Sitemap Generator

Hello! For some reason, this *very* old blog post gets a lot of traffic. Like an insane amount. If you don't mind, send me a quick email (raymondcamden@gmail.com), or DM on Twitter (@raymondcamden), telling me how you got here. I'm just incredibly curious as to where the traffic is coming from. Thank you! First person to answer the mystery gets mega brownie points!

Earlier today Yahoo and Google announced their collaboration on Sitemaps.org. Sitemaps provide a way to describe to a search engine what pages make up your web site. I've had sitemap support in BlogCFC for a while, but today I wrote a little UDF you can use to generate sitemap xml. It will take either a list of URLs or a query of URLs. Enjoy. I'll post it to CFLib later in the week.

<cffunction name="generateSiteMap" output="false" returnType="xml">
	<cfargument name="data" type="any" required="true">
	<cfargument name="lastmod" type="date" required="false">
	<cfargument name="changefreq" type="string" required="false">
	<cfargument name="priority" type="numeric" required="false">
	
	<cfset var header = "<?xml version=""1.0"" encoding=""UTF-8""?><urlset xmlns=""http://www.sitemaps.org/schemas/sitemap/0.9"">">
	<cfset var result = header>
	<cfset var aurl = "">
	<cfset var item = "">
	<cfset var validChangeFreq = "always,hourly,daily,weekly,monthly,yearly,never">
	<cfset var newDate = "">
	<cfset var tz = getTimeZoneInfo().utcHourOffset>
	
	<cfif structKeyExists(arguments, "changefreq") and not listFindNoCase(validChangeFreq, arguments.changefreq)>
		<cfthrow message="Invalid changefreq (#arguments.changefreq#) passed. Valid values are #validChangeFreq#">
	</cfif>

	<cfif structKeyExists(arguments, "priority") and (arguments.priority lt 0 or arguments.priority gt 1)>
		<cfthrow message="Invalid priority (#arguments.priority#) passed. Must be between 0.0 and 1.0">
	</cfif>
	
	<!--- reformat datetime as w3c datetime / http://www.w3.org/TR/NOTE-datetime --->
	<cfif structKeyExists(arguments, "lastmod")>			
		<cfset newDate = dateFormat(arguments.lastmod, "YYYY-MM-DD") & "T" & timeFormat(arguments.lastmod, "HH:mm")>
		<cfif tz gte 0>
			<cfset newDate = newDate & "-" & tz & ":00">
		<cfelse>
			<cfset newDate = newDate & "+" & tz & ":00">
		</cfif>		
	</cfif>
	
	<!--- Support either a query or list of URLs --->
	<cfif isSimpleValue(arguments.data)>
		<cfloop index="aurl" list="#arguments.data#">
			<cfsavecontent variable="item">
<cfoutput>
<url>
	<loc>#xmlFormat(aurl)#</loc>
	<cfif structKeyExists(arguments,"lastmod")>
	<lastmod>#newDate#</lastmod>
	</cfif>
	<cfif structKeyExists(arguments,"changefreq")>
	<changefreq>#arguments.changefreq#</changefreq>
	</cfif>
	<cfif structKeyExists(arguments,"priority")>
	<priority>#arguments.priority#</priority>
	</cfif>
</url>
</cfoutput>
			</cfsavecontent>
			<cfset item = trim(item)>
			<cfset result = result & item>
		</cfloop>
		
	<cfelseif isQuery(arguments.data)>
		<cfloop query="arguments.data">
			<cfsavecontent variable="item">
<cfoutput>
<url>
	<loc>#xmlFormat(url)#</loc>
	<cfif listFindNoCase(arguments.data.columnlist,"lastmod")>
		<cfset newDate = dateFormat(lastmod, "YYYY-MM-DD") & "T" & timeFormat(lastmod, "HH:mm")>
		<cfif tz gte 0>
			<cfset newDate = newDate & "-" & tz & ":00">
		<cfelse>
			<cfset newDate = newDate & "+" & tz & ":00">
		</cfif>		
		<lastmod>#newDate#</lastmod>
	</cfif>
	<cfif listFindNoCase(arguments.data.columnlist,"changefreq")>
	<changefreq>#changefreq#</changefreq>
	</cfif>
	<cfif listFindNoCase(arguments.data.columnlist,"priority")>
	<priority>#priority#</priority>
	</cfif>
</url>
</cfoutput>
			</cfsavecontent>
			<cfset item = trim(item)>
			<cfset result = result & item>
		
		</cfloop>
	</cfif>
	
	<cfset result = result & "</urlset>">
	
	<cfreturn result>
	
</cffunction>

Archived Comments

Comment 1 by aleksandar posted on 11/20/2006 at 5:50 PM

How actualy it works?

Comment 2 by Raymond Camden posted on 11/20/2006 at 7:02 PM

You pass in either a list of URLs or a query. I added it to CFLib last night and there is a bit more documentation there.

http://www.cflib.org/udf.cf...

Comment 3 by BL posted on 11/28/2006 at 9:41 PM

do you think it would be hard to build and site crawler and link parser in cf to use with this udf?

Comment 4 by Raymond Camden posted on 11/29/2006 at 1:31 AM

BL: Sure, I'll make it a Friday test. ;)

Comment 5 by BL posted on 11/29/2006 at 5:21 AM

nice. you feelin a little regexy?

Comment 6 by dickbob posted on 3/3/2007 at 6:07 PM

Can I offer a couple of amendments in the light of my experience of using this UDF to submit to Google.

Code changes occur after the comment "reformat datetime as w3c datetime / http://www.w3.org/TR/NOTE-d...".

1. Change the test of tz to be "gt" rather than "gte". To be honest this is really just a personal style thing, +00:00 looks better than -00:00 to me, and doesn't seem to effect Google.

2. Make the hour number format "00" for the newDate offset eg. numberFormat(tz,"00"). So the lines should read newDate = newDate & "-" & numberFormat(tz,"00") & ":00" and newDate = newDate & "+" & numberFormat(tz,"00") & ":00"

HTH

Comment 7 by Raymond Camden posted on 3/4/2007 at 3:20 AM

I've made both changes. I've also changed the UDF to use stringbuffer, this makes it quicker. Unfortunately - CFLIB is causing me fits now - so it's not hooked up yet. I will refresh it later.

Comment 8 by dickbob posted on 3/6/2007 at 8:13 PM

I know you most likely thought of it but I should have mentioned that the same changes need to be applied when the data is supplied as a query.

Comment 9 by Ruth posted on 4/19/2007 at 11:45 PM

Hi there, I was just tasked to produce an XML sitemap. I noticed that you mention that you would make a Friday test out of the idea of making a site crawler. I searched for that term in your blog and didn't see any results. Did this ever occur?

Comment 10 by Raymond Camden posted on 4/20/2007 at 2:32 AM

Not yet - no.

Comment 11 by Ruth posted on 4/21/2007 at 12:04 AM

I started a solution to create a map on the server side - but it isn't a crawler. I'll post the code after I get it cleaned up enough and some of the kinks worked out.

I am having issues with cfdirectory w/recursion at webroot. I get the pesky null pointer error, which I am attributing to archived directories, etc bloating the query. I still need to prove that is the cause.

Comment 12 by Raymond Camden posted on 4/21/2007 at 12:11 AM

Ruth, don't forget that if you confirm a bug, you can report it at:

http://www.adobe.com/go/wish

Comment 13 by Jeremy Halliwell posted on 8/14/2007 at 4:06 PM

Ray, the udf on cflib although dated March 9 2007 doesn't have the changes you mentioned you'd added to timezone and the use of stringbuffer. Also the getTimeZoneInfo().utcHourOffset assigned to tz returns (for me at least) an offset value, eg "-1" for UK, so the test which adds the "+" or "-" later is unnecessary and makes the date format invalid (eg 2007-08-04T19:24+-1:00).

Comment 14 by Raymond Camden posted on 8/14/2007 at 5:00 PM

I'm updating it in 10 seconds. Will you please give it a try?

Comment 15 by Jeremy Halliwell posted on 8/14/2007 at 5:22 PM

Yes that works great, thanks Ray.

Comment 16 by Snake posted on 9/4/2007 at 9:40 PM

You say blogcfc has had sitemap support for some time, in what way?
Is it supposed to generate a sitemap.xml file ?
And if so, how? I can't find any option to do this.
Or do I need to update as I am on blogCFC 5.5

Comment 17 by Raymond Camden posted on 9/4/2007 at 9:53 PM

There should be a file named sitemap or googlesitemap.cfm in the root directory.

Comment 18 by marco posted on 10/13/2007 at 6:17 AM

Hi I can't figur out how to combine these values in to one and to an xml output:

<cfset siteMapXML = generateSiteMap(data=urls,changefreq="daily",priority="1.0", lastmod=now())>
<cfdump var="#xmlParse(siteMapXML)#">
<cfset siteMapXML = generateSiteMap(qurls)>
<cfdump var="#xmlParse(siteMapXML)#">

I want these combined as a need to put it all to one xml sitemap, the .cfm sitemap takes to long to load, big sitemap.

thanks

Comment 19 by Raymond Camden posted on 10/24/2007 at 2:03 AM

Well I think you can just combine both XML files. You would want to remove the <xml> header from the second one though. Not exactly sure - but it's definitely possible.

Comment 20 by m van den oever posted on 10/24/2007 at 3:18 AM

of course, i was thinking the hard way as usual, thx

Comment 21 by Adam posted on 1/19/2008 at 12:49 AM

Could someone break this down for me? I've read this post over and over and looked at the CFLib documentation. I don't know if I"m missing something or (most likely) I just don't know what I'm doing. Any shove in the right direction is greatly appreciated.

Comment 22 by TonyD posted on 7/9/2009 at 9:10 AM

Hi Ray, many retrospective thanks for other tips.

I'm with Adam on this. As indicated at siteMap.org, that the search engine will look for an XML document called sitemap is clear.

That the generateSiteMap functions returns this XML code for the urls provided is also clear. That the xmlparse function turns this into an xml object (and the xml code is visible in codeview on the webpage) is clear.

So instead of using cfdump, you pick whichever of the three options you prefer and then just surround the xmlparse with cfoutput tags and this hands the xml to the search engine?

Its one of those things where you could end up ten years later in a bar discussing XML and girls only to find out that you've been setting up invalid site maps for ten years.

Comment 23 by Raymond Camden posted on 7/9/2009 at 4:32 PM

Yeah, take the result and just output it.

Comment 24 by Karan Joshi posted on 2/19/2010 at 8:42 AM

Used bits n pieces of your code...works like a charm...thanks.

Comment 25 by Custard Pie posted on 8/23/2010 at 6:56 PM

How are using this? I'm able to call it and it seems to output the header and closing mark up, but I haven't made the connection as to how it gets all the site directories to produce a complete site map. I can see I can pass the data argument to it, but what are you using to traverse the site directories so that this component can wrap it in xml?

Comment 26 by Raymond Camden posted on 8/24/2010 at 4:02 PM

I'm not traversing the site. You need to figure that part out yourself. Every site is different. So for example, on this blog, I can make a map by getting all the entries. For a news site, you would get all the news article. Amazon would get all their products. Essentially - look at your site data and set up a query that represents the content.