Hello! For some reason, this *very* old blog post gets a lot of traffic. Like an insane amount. If you don't mind, send me a quick email (raymondcamden@gmail.com), or DM on Twitter (@raymondcamden), telling me how you got here. I'm just incredibly curious as to where the traffic is coming from. Thank you! First person to answer the mystery gets mega brownie points!
Earlier today Yahoo and Google announced their collaboration on Sitemaps.org. Sitemaps provide a way to describe to a search engine what pages make up your web site. I've had sitemap support in BlogCFC for a while, but today I wrote a little UDF you can use to generate sitemap xml. It will take either a list of URLs or a query of URLs. Enjoy. I'll post it to CFLib later in the week.
<cffunction name="generateSiteMap" output="false" returnType="xml">
<cfargument name="data" type="any" required="true">
<cfargument name="lastmod" type="date" required="false">
<cfargument name="changefreq" type="string" required="false">
<cfargument name="priority" type="numeric" required="false">
<cfset var header = "<?xml version=""1.0"" encoding=""UTF-8""?><urlset xmlns=""http://www.sitemaps.org/schemas/sitemap/0.9"">">
<cfset var result = header>
<cfset var aurl = "">
<cfset var item = "">
<cfset var validChangeFreq = "always,hourly,daily,weekly,monthly,yearly,never">
<cfset var newDate = "">
<cfset var tz = getTimeZoneInfo().utcHourOffset>
<cfif structKeyExists(arguments, "changefreq") and not listFindNoCase(validChangeFreq, arguments.changefreq)>
<cfthrow message="Invalid changefreq (#arguments.changefreq#) passed. Valid values are #validChangeFreq#">
</cfif>
<cfif structKeyExists(arguments, "priority") and (arguments.priority lt 0 or arguments.priority gt 1)>
<cfthrow message="Invalid priority (#arguments.priority#) passed. Must be between 0.0 and 1.0">
</cfif>
<!--- reformat datetime as w3c datetime / http://www.w3.org/TR/NOTE-datetime --->
<cfif structKeyExists(arguments, "lastmod")>
<cfset newDate = dateFormat(arguments.lastmod, "YYYY-MM-DD") & "T" & timeFormat(arguments.lastmod, "HH:mm")>
<cfif tz gte 0>
<cfset newDate = newDate & "-" & tz & ":00">
<cfelse>
<cfset newDate = newDate & "+" & tz & ":00">
</cfif>
</cfif>
<!--- Support either a query or list of URLs --->
<cfif isSimpleValue(arguments.data)>
<cfloop index="aurl" list="#arguments.data#">
<cfsavecontent variable="item">
<cfoutput>
<url>
<loc>#xmlFormat(aurl)#</loc>
<cfif structKeyExists(arguments,"lastmod")>
<lastmod>#newDate#</lastmod>
</cfif>
<cfif structKeyExists(arguments,"changefreq")>
<changefreq>#arguments.changefreq#</changefreq>
</cfif>
<cfif structKeyExists(arguments,"priority")>
<priority>#arguments.priority#</priority>
</cfif>
</url>
</cfoutput>
</cfsavecontent>
<cfset item = trim(item)>
<cfset result = result & item>
</cfloop>
<cfelseif isQuery(arguments.data)>
<cfloop query="arguments.data">
<cfsavecontent variable="item">
<cfoutput>
<url>
<loc>#xmlFormat(url)#</loc>
<cfif listFindNoCase(arguments.data.columnlist,"lastmod")>
<cfset newDate = dateFormat(lastmod, "YYYY-MM-DD") & "T" & timeFormat(lastmod, "HH:mm")>
<cfif tz gte 0>
<cfset newDate = newDate & "-" & tz & ":00">
<cfelse>
<cfset newDate = newDate & "+" & tz & ":00">
</cfif>
<lastmod>#newDate#</lastmod>
</cfif>
<cfif listFindNoCase(arguments.data.columnlist,"changefreq")>
<changefreq>#changefreq#</changefreq>
</cfif>
<cfif listFindNoCase(arguments.data.columnlist,"priority")>
<priority>#priority#</priority>
</cfif>
</url>
</cfoutput>
</cfsavecontent>
<cfset item = trim(item)>
<cfset result = result & item>
</cfloop>
</cfif>
<cfset result = result & "</urlset>">
<cfreturn result>
</cffunction>
Archived Comments
How actualy it works?
You pass in either a list of URLs or a query. I added it to CFLib last night and there is a bit more documentation there.
http://www.cflib.org/udf.cf...
do you think it would be hard to build and site crawler and link parser in cf to use with this udf?
BL: Sure, I'll make it a Friday test. ;)
nice. you feelin a little regexy?
Can I offer a couple of amendments in the light of my experience of using this UDF to submit to Google.
Code changes occur after the comment "reformat datetime as w3c datetime / http://www.w3.org/TR/NOTE-d...".
1. Change the test of tz to be "gt" rather than "gte". To be honest this is really just a personal style thing, +00:00 looks better than -00:00 to me, and doesn't seem to effect Google.
2. Make the hour number format "00" for the newDate offset eg. numberFormat(tz,"00"). So the lines should read newDate = newDate & "-" & numberFormat(tz,"00") & ":00" and newDate = newDate & "+" & numberFormat(tz,"00") & ":00"
HTH
I've made both changes. I've also changed the UDF to use stringbuffer, this makes it quicker. Unfortunately - CFLIB is causing me fits now - so it's not hooked up yet. I will refresh it later.
I know you most likely thought of it but I should have mentioned that the same changes need to be applied when the data is supplied as a query.
Hi there, I was just tasked to produce an XML sitemap. I noticed that you mention that you would make a Friday test out of the idea of making a site crawler. I searched for that term in your blog and didn't see any results. Did this ever occur?
Not yet - no.
I started a solution to create a map on the server side - but it isn't a crawler. I'll post the code after I get it cleaned up enough and some of the kinks worked out.
I am having issues with cfdirectory w/recursion at webroot. I get the pesky null pointer error, which I am attributing to archived directories, etc bloating the query. I still need to prove that is the cause.
Ruth, don't forget that if you confirm a bug, you can report it at:
http://www.adobe.com/go/wish
Ray, the udf on cflib although dated March 9 2007 doesn't have the changes you mentioned you'd added to timezone and the use of stringbuffer. Also the getTimeZoneInfo().utcHourOffset assigned to tz returns (for me at least) an offset value, eg "-1" for UK, so the test which adds the "+" or "-" later is unnecessary and makes the date format invalid (eg 2007-08-04T19:24+-1:00).
I'm updating it in 10 seconds. Will you please give it a try?
Yes that works great, thanks Ray.
You say blogcfc has had sitemap support for some time, in what way?
Is it supposed to generate a sitemap.xml file ?
And if so, how? I can't find any option to do this.
Or do I need to update as I am on blogCFC 5.5
There should be a file named sitemap or googlesitemap.cfm in the root directory.
Hi I can't figur out how to combine these values in to one and to an xml output:
<cfset siteMapXML = generateSiteMap(data=urls,changefreq="daily",priority="1.0", lastmod=now())>
<cfdump var="#xmlParse(siteMapXML)#">
<cfset siteMapXML = generateSiteMap(qurls)>
<cfdump var="#xmlParse(siteMapXML)#">
I want these combined as a need to put it all to one xml sitemap, the .cfm sitemap takes to long to load, big sitemap.
thanks
Well I think you can just combine both XML files. You would want to remove the <xml> header from the second one though. Not exactly sure - but it's definitely possible.
of course, i was thinking the hard way as usual, thx
Could someone break this down for me? I've read this post over and over and looked at the CFLib documentation. I don't know if I"m missing something or (most likely) I just don't know what I'm doing. Any shove in the right direction is greatly appreciated.
Hi Ray, many retrospective thanks for other tips.
I'm with Adam on this. As indicated at siteMap.org, that the search engine will look for an XML document called sitemap is clear.
That the generateSiteMap functions returns this XML code for the urls provided is also clear. That the xmlparse function turns this into an xml object (and the xml code is visible in codeview on the webpage) is clear.
So instead of using cfdump, you pick whichever of the three options you prefer and then just surround the xmlparse with cfoutput tags and this hands the xml to the search engine?
Its one of those things where you could end up ten years later in a bar discussing XML and girls only to find out that you've been setting up invalid site maps for ten years.
Yeah, take the result and just output it.
Used bits n pieces of your code...works like a charm...thanks.
How are using this? I'm able to call it and it seems to output the header and closing mark up, but I haven't made the connection as to how it gets all the site directories to produce a complete site map. I can see I can pass the data argument to it, but what are you using to traverse the site directories so that this component can wrap it in xml?
I'm not traversing the site. You need to figure that part out yourself. Every site is different. So for example, on this blog, I can make a map by getting all the entries. For a news site, you would get all the news article. Amazon would get all their products. Essentially - look at your site data and set up a query that represents the content.