Using Sitemaps with Verity

September 17, 2007 coldfusion

(This post is more than 2 years old.)

Not many people know that ColdFusion ships with a HTTP spider that integrates with Verity. Unfortunately, this spider will only work with localhost as a server. This means if you want to spider multiple sites, you can't. Well, not without playing with your host headers. (More information on the Verity Spider and ColdFusion may be found here.)

What I worked on today was a way to work around this limitation. It turns out - if you have a sitemap, you already have a "spider" of your site. BlogCFC supports sitemaps out of the box, and I've blogged in the past a simple UDF to generate sitemaps. Let's look at how we can convert a sitemap into Verity data.

To begin with - let's take a look at some very simple sitemap data.


<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.foo.com/index.cfm</loc>
</url>
<url>
<loc>http://www.foo.com/index2.cfm</loc>
</url>
<url>
<loc>http://www.foo.com/index3.cfm</loc>
</url>
</urlset>

This sample is missing many of the features that you can include with a sitemap, but it gives you an idea of the structure. As you could guess - a sitemap contains a collection of URLs. So let's look at how we can parse this XML. (Note - I'll be using ColdFusion 8 code throughout this demonstration, but you can easily downgrade this to CF7, 6, or even 5.)


<!--- read in xml --->
<cfset myxml = fileRead(expandPath("./sitemap.xml"))>
<!--- convert to xml --->
<cfset myxml = xmlParse(myxml)>

The first thing I do is read in my sitemap and convert it to XML.


<!--- place to store data --->
<cfset request.data = structNew()>

I'm going to be using threading, so I create a Request variable to store my information.


<!--- now loop through.... --->
<cfloop index="x" from="1" to="#min(20,arrayLen(myxml.urlset.url))#">
<cfset tname = "thread#x#">
<cfthread name="#tname#" url="#myxml.urlset.url[x].loc.xmltext#">
<cfhttp url="#attributes.url#" result="result">
<cfset request.data[attributes.url] = structNew()>
<cfset request.data[attributes.url].title = getHTMLTitle(result.filecontent)>
<cfset request.data[attributes.url].body = getHTMLBody(result.filecontent)>
<!--- remove all html from body --->
<cfset request.data[attributes.url].body = rereplace(request.data[attributes.url].body, "<.*?>", "", "all")>
<cfset headers = getMetaHeaders(result.filecontent)>
<cfset request.data[attributes.url].keywords = "">
<cfset request.data[attributes.url].description = "">
<cfset request.data[attributes.url].x = headers>
<!--- find description and keywords --->
<cfloop index="x" from="1" to="#arrayLen(headers)#">
<cfif structKeyExists(headers[x], "name")>
<cfif headers[x].name is "description">
<cfset request.data[attributes.url].description = headers[x].content>
<cfelseif headers[x].name is "keywords">
<cfset request.data[attributes.url].keywords = headers[x].content>
</cfif>	
</cfif>
</cfloop>
</cfthread>

</cfloop>

Ok, I have a lot going on here, so let me take it bit by bit. First off - I'm looping over the XML packet by treating the URL tag as an array. Note the use of "min". I did this to simply make my testing run quicker. My current blog's site map is over 2k URLs. (The cool thing is that it only took ColdFusion about 9 minutes to process all 2000 when I tried it.) For each URL, I suck down the content with CFHTTP.

Now for the interesting part. I could simply provide the HTML to Verity. While this "works", it doesn't provide as much information to Verity as I would like. Also - Verity doesn't "get" that is indexing HTML data. It thinks it is just working with simple strings. So in order to make things a bit nicer, I do some cleaning.

First off I use a few UDFs: getHTMLTitle, getHTMLBody, getMetaHeaders. The first and last are from CFLib and the middle one is a modified version of getHTMLTitle. These simply parse the HTML string for the title, body, and the meta tags. (I provide the full code at the end.)

I remove all HTML from the body. I then look at the meta tags and specifically try to find a description and keywords meta tag. All of this gets stored into the Request scope Data structure I created.

Now I need to tell ColdFusion to wait for the threads to end:


<!--- join the threads --->
<cfthread action="join" name="#structKeyList(cfthread)#" />

I then convert my structure into a query:


<!--- make a query for the data --->
<cfset info = queryNew("url,body,title,keywords,description")>
<cfloop item="c" collection="#request.data#">
	<cfset queryAddRow(info)>
	<cfset querySetCell(info, "url", c)>
	<cfset querySetCell(info, "body", request.data[c].body)>
	<cfset querySetCell(info, "title", request.data[c].title)>
	<cfset querySetCell(info, "keywords", request.data[c].keywords)>
	<cfset querySetCell(info, "description", request.data[c].description)>
</cfloop>

There isn't anything too fancy there - I'm just copying from the structure into a new query.

Next I insert into the Verity collection:


<!--- insert data --->
<cfindex collection="sitemaptest" action="refresh" query="info" title="title" key="url" body="body" urlpath="url" custom1="keywords" custom2="description" status="status">

Note how I tell Verity what column means what. I stored the description and keywords into the custom columns.

That's it! To make sure things worked, I dumped some information at the end:


<cfoutput>
<p>
Done indexing. Did #info.recordCount# rows. Took #totaltime# ms.
</p>
</cfoutput>

<cfdump var="#status#">

All in all this worked ok - but it has a few problems/places for improvement:

First off - you can really tell that Verity wasn't 100% sure what to do with my data. That's why I removed the HTML. I could have considered taking the data I sucked down, saving it to an HTML file, and then running a file based index. While this would be slower, it could have resulted in better indexing.
Second - my code, ignoring the Mind(), will suck down every URL and index it. As I mentioned, sitemaps can store more than just URLs. They can also store the last time they were modified. If I were reading my XML data once a day, then it would make sense to only suck down URLs that were modified today. This would greatly improve the speed of the indexing.

Here is the complete index file:


<cfsetting requesttimeout="600">
<cfset thetime = getTickCount()>
<cfscript>
/**

Parses an HTML page and returns the title.

@param str 	 The HTML string to check.
@return Returns a string.
@author Raymond Camden (ray@camdenfamily.com)
@version 1, December 3, 2001
/
function GetHTMLTitle(str) {
var matchStruct = reFindNoCase("<[[:space:]]title[[:space:]]>([^<])<[[:space:]]/title[[:space:]]>",str,1,1);
if(arrayLen(matchStruct.len) lt 2) return "";
return Mid(str,matchStruct.pos[2],matchStruct.len[2]);	
}

function GetHTMLBody(str) {
var matchStruct = reFindNoCase("<.?body.?>(.?)<[[:space:]]/body[[:space:]]*>",str,1,1);
if(arrayLen(matchStruct.len) lt 2) return "";
return Mid(str,matchStruct.pos[2],matchStruct.len[2]);	
}
function GetMetaHeaders(str) {
var matchStruct = structNew();
var name = "";
var content = "";
var results = arrayNew(1);
var pos = 1;
var regex = "<meta[[:space:]](name|http-equiv)[[:space:]]=[[:space:]](""|')([^""])(""|')[[:space:]]content=(""|')([^""])(""|')[[:space:]]*/{0,1}>";
matchStruct = REFindNoCase(regex,str,pos,1);
while(matchStruct.pos[1]) {
results[arrayLen(results)+1] = structNew();
results[arrayLen(results)][ Mid(str,matchStruct.pos[2],matchStruct.len[2])] = Mid(str,matchStruct.pos[4],matchStruct.len[4]);
results[arrayLen(results)].content = Mid(str,matchStruct.pos[7],matchStruct.len[7]);
pos = matchStruct.pos[6] + matchStruct.len[6] + 1;
matchStruct = REFindNoCase(regex,str,pos,1);
}
return results;
}
</cfscript>
<!--- create collection if needed --->
<cfcollection action="list" name="mycollections">
<cfif not listFindNoCase(valueList(mycollections.name), "sitemaptest")>
<cfoutput><p>Creating collection.<p></cfoutput>
<cfcollection action="create" collection="sitemaptest" path="#server.coldfusion.rootdir#/collections">
</cfif>
<!--- read in xml --->
<cfset myxml = fileRead(expandPath("./sitemap.xml"))>
<!--- convert to xml --->
<cfset myxml = xmlParse(myxml)>
<!--- place to store data --->
<cfset request.data = structNew()>
<!--- now loop through.... --->
<cfloop index="x" from="1" to="#min(20,arrayLen(myxml.urlset.url))#">
<cfset tname = "thread#x#">
<cfthread name="#tname#" url="#myxml.urlset.url[x].loc.xmltext#">
<cfhttp url="#attributes.url#" result="result">
<cfset request.data[attributes.url] = structNew()>
<cfset request.data[attributes.url].title = getHTMLTitle(result.filecontent)>
<cfset request.data[attributes.url].body = getHTMLBody(result.filecontent)>
<!--- remove all html from body --->
<cfset request.data[attributes.url].body = rereplace(request.data[attributes.url].body, "<.*?>", "", "all")>
<cfset headers = getMetaHeaders(result.filecontent)>
<cfset request.data[attributes.url].keywords = "">
<cfset request.data[attributes.url].description = "">
<cfset request.data[attributes.url].x = headers>
<!--- find description and keywords --->
<cfloop index="x" from="1" to="#arrayLen(headers)#">
<cfif structKeyExists(headers[x], "name")>
<cfif headers[x].name is "description">
<cfset request.data[attributes.url].description = headers[x].content>
<cfelseif headers[x].name is "keywords">
<cfset request.data[attributes.url].keywords = headers[x].content>
</cfif>	
</cfif>
</cfloop>
</cfthread>
</cfloop>
<!--- join the threads --->
<cfthread action="join" name="#structKeyList(cfthread)#" />
<!--- make a query for the data --->
<cfset info = queryNew("url,body,title,keywords,description")>
<cfloop item="c" collection="#request.data#">
<cfset queryAddRow(info)>
<cfset querySetCell(info, "url", c)>
<cfset querySetCell(info, "body", request.data[c].body)>
<cfset querySetCell(info, "title", request.data[c].title)>
<cfset querySetCell(info, "keywords", request.data[c].keywords)>
<cfset querySetCell(info, "description", request.data[c].description)>
</cfloop>
<!--- insert data --->
<cfindex collection="sitemaptest" action="refresh" query="info" title="title" key="url" body="body" urlpath="url" custom1="keywords" custom2="description" status="status">
<cfset totaltime = getTickCount() - thetime>
<cfoutput>
<p>
Done indexing. Did #info.recordCount# rows. Took #totaltime# ms.
</p>
</cfoutput>

<cfdump var="#status#">