Not many people know that ColdFusion ships with a HTTP spider that integrates with Verity. Unfortunately, this spider will only work with localhost as a server. This means if you want to spider multiple sites, you can't. Well, not without playing with your host headers. (More information on the Verity Spider and ColdFusion may be found here.)
What I worked on today was a way to work around this limitation. It turns out - if you have a sitemap, you already have a "spider" of your site. BlogCFC supports sitemaps out of the box, and I've blogged in the past a simple UDF to generate sitemaps. Let's look at how we can convert a sitemap into Verity data.
To begin with - let's take a look at some very simple sitemap data.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.foo.com/index.cfm</loc>
</url>
<url>
<loc>http://www.foo.com/index2.cfm</loc>
</url>
<url>
<loc>http://www.foo.com/index3.cfm</loc>
</url>
</urlset>
This sample is missing many of the features that you can include with a sitemap, but it gives you an idea of the structure. As you could guess - a sitemap contains a collection of URLs. So let's look at how we can parse this XML. (Note - I'll be using ColdFusion 8 code throughout this demonstration, but you can easily downgrade this to CF7, 6, or even 5.)
<!--- read in xml --->
<cfset myxml = fileRead(expandPath("./sitemap.xml"))>
<!--- convert to xml --->
<cfset myxml = xmlParse(myxml)>
The first thing I do is read in my sitemap and convert it to XML.
<!--- place to store data --->
<cfset request.data = structNew()>
I'm going to be using threading, so I create a Request variable to store my information.
<!--- now loop through.... --->
<cfloop index="x" from="1" to="#min(20,arrayLen(myxml.urlset.url))#">
<cfset tname = "thread#x#">
<cfthread name="#tname#" url="#myxml.urlset.url[x].loc.xmltext#">
<cfhttp url="#attributes.url#" result="result">
<cfset request.data[attributes.url] = structNew()>
<cfset request.data[attributes.url].title = getHTMLTitle(result.filecontent)>
<cfset request.data[attributes.url].body = getHTMLBody(result.filecontent)>
<!--- remove all html from body --->
<cfset request.data[attributes.url].body = rereplace(request.data[attributes.url].body, "<.*?>", "", "all")>
<cfset headers = getMetaHeaders(result.filecontent)>
<cfset request.data[attributes.url].keywords = "">
<cfset request.data[attributes.url].description = "">
<cfset request.data[attributes.url].x = headers>
<!--- find description and keywords --->
<cfloop index="x" from="1" to="#arrayLen(headers)#">
<cfif structKeyExists(headers[x], "name")>
<cfif headers[x].name is "description">
<cfset request.data[attributes.url].description = headers[x].content>
<cfelseif headers[x].name is "keywords">
<cfset request.data[attributes.url].keywords = headers[x].content>
</cfif>
</cfif>
</cfloop>
</cfthread>
</cfloop>
Ok, I have a lot going on here, so let me take it bit by bit. First off - I'm looping over the XML packet by treating the URL tag as an array. Note the use of "min". I did this to simply make my testing run quicker. My current blog's site map is over 2k URLs. (The cool thing is that it only took ColdFusion about 9 minutes to process all 2000 when I tried it.) For each URL, I suck down the content with CFHTTP.
Now for the interesting part. I could simply provide the HTML to Verity. While this "works", it doesn't provide as much information to Verity as I would like. Also - Verity doesn't "get" that is indexing HTML data. It thinks it is just working with simple strings. So in order to make things a bit nicer, I do some cleaning.
First off I use a few UDFs: getHTMLTitle, getHTMLBody, getMetaHeaders. The first and last are from CFLib and the middle one is a modified version of getHTMLTitle. These simply parse the HTML string for the title, body, and the meta tags. (I provide the full code at the end.)
I remove all HTML from the body. I then look at the meta tags and specifically try to find a description and keywords meta tag. All of this gets stored into the Request scope Data structure I created.
Now I need to tell ColdFusion to wait for the threads to end:
<!--- join the threads --->
<cfthread action="join" name="#structKeyList(cfthread)#" />
I then convert my structure into a query:
<!--- make a query for the data --->
<cfset info = queryNew("url,body,title,keywords,description")>
<cfloop item="c" collection="#request.data#">
<cfset queryAddRow(info)>
<cfset querySetCell(info, "url", c)>
<cfset querySetCell(info, "body", request.data[c].body)>
<cfset querySetCell(info, "title", request.data[c].title)>
<cfset querySetCell(info, "keywords", request.data[c].keywords)>
<cfset querySetCell(info, "description", request.data[c].description)>
</cfloop>
There isn't anything too fancy there - I'm just copying from the structure into a new query.
Next I insert into the Verity collection:
<!--- insert data --->
<cfindex collection="sitemaptest" action="refresh" query="info" title="title" key="url" body="body" urlpath="url" custom1="keywords" custom2="description" status="status">
Note how I tell Verity what column means what. I stored the description and keywords into the custom columns.
That's it! To make sure things worked, I dumped some information at the end:
<cfoutput>
<p>
Done indexing. Did #info.recordCount# rows. Took #totaltime# ms.
</p>
</cfoutput>
<cfdump var="#status#">
All in all this worked ok - but it has a few problems/places for improvement:
- First off - you can really tell that Verity wasn't 100% sure what to do with my data. That's why I removed the HTML. I could have considered taking the data I sucked down, saving it to an HTML file, and then running a file based index. While this would be slower, it could have resulted in better indexing.
- Second - my code, ignoring the Mind(), will suck down every URL and index it. As I mentioned, sitemaps can store more than just URLs. They can also store the last time they were modified. If I were reading my XML data once a day, then it would make sense to only suck down URLs that were modified today. This would greatly improve the speed of the indexing.
Here is the complete index file:
<cfsetting requesttimeout="600">
<cfset thetime = getTickCount()>
<cfscript>
/**
- Parses an HTML page and returns the title.
- @param str The HTML string to check.
- @return Returns a string.
- @author Raymond Camden (ray@camdenfamily.com)
- @version 1, December 3, 2001
/
function GetHTMLTitle(str) {
var matchStruct = reFindNoCase("<[[:space:]]title[[:space:]]>([^<])<[[:space:]]/title[[:space:]]>",str,1,1);
if(arrayLen(matchStruct.len) lt 2) return "";
return Mid(str,matchStruct.pos[2],matchStruct.len[2]);
}
function GetHTMLBody(str) {
var matchStruct = reFindNoCase("<.?body.?>(.?)<[[:space:]]/body[[:space:]]*>",str,1,1);
if(arrayLen(matchStruct.len) lt 2) return "";
return Mid(str,matchStruct.pos[2],matchStruct.len[2]);
}
function GetMetaHeaders(str) {
var matchStruct = structNew();
var name = "";
var content = "";
var results = arrayNew(1);
var pos = 1;
var regex = "<meta[[:space:]](name|http-equiv)[[:space:]]=[[:space:]](""|')([^""])(""|')[[:space:]]content=(""|')([^""])(""|')[[:space:]]*/{0,1}>";
matchStruct = REFindNoCase(regex,str,pos,1);
while(matchStruct.pos[1]) {
results[arrayLen(results)+1] = structNew();
results[arrayLen(results)][ Mid(str,matchStruct.pos[2],matchStruct.len[2])] = Mid(str,matchStruct.pos[4],matchStruct.len[4]);
results[arrayLen(results)].content = Mid(str,matchStruct.pos[7],matchStruct.len[7]);
pos = matchStruct.pos[6] + matchStruct.len[6] + 1;
matchStruct = REFindNoCase(regex,str,pos,1);
}
return results;
}
</cfscript>
<!--- create collection if needed --->
<cfcollection action="list" name="mycollections">
<cfif not listFindNoCase(valueList(mycollections.name), "sitemaptest")>
<cfoutput><p>Creating collection.<p></cfoutput>
<cfcollection action="create" collection="sitemaptest" path="#server.coldfusion.rootdir#/collections">
</cfif>
<!--- read in xml --->
<cfset myxml = fileRead(expandPath("./sitemap.xml"))>
<!--- convert to xml --->
<cfset myxml = xmlParse(myxml)>
<!--- place to store data --->
<cfset request.data = structNew()>
<!--- now loop through.... --->
<cfloop index="x" from="1" to="#min(20,arrayLen(myxml.urlset.url))#">
<cfset tname = "thread#x#">
<cfthread name="#tname#" url="#myxml.urlset.url[x].loc.xmltext#">
<cfhttp url="#attributes.url#" result="result">
<cfset request.data[attributes.url] = structNew()>
<cfset request.data[attributes.url].title = getHTMLTitle(result.filecontent)>
<cfset request.data[attributes.url].body = getHTMLBody(result.filecontent)>
<!--- remove all html from body --->
<cfset request.data[attributes.url].body = rereplace(request.data[attributes.url].body, "<.*?>", "", "all")>
<cfset headers = getMetaHeaders(result.filecontent)>
<cfset request.data[attributes.url].keywords = "">
<cfset request.data[attributes.url].description = "">
<cfset request.data[attributes.url].x = headers>
<!--- find description and keywords --->
<cfloop index="x" from="1" to="#arrayLen(headers)#">
<cfif structKeyExists(headers[x], "name")>
<cfif headers[x].name is "description">
<cfset request.data[attributes.url].description = headers[x].content>
<cfelseif headers[x].name is "keywords">
<cfset request.data[attributes.url].keywords = headers[x].content>
</cfif>
</cfif>
</cfloop>
</cfthread>
</cfloop>
<!--- join the threads --->
<cfthread action="join" name="#structKeyList(cfthread)#" />
<!--- make a query for the data --->
<cfset info = queryNew("url,body,title,keywords,description")>
<cfloop item="c" collection="#request.data#">
<cfset queryAddRow(info)>
<cfset querySetCell(info, "url", c)>
<cfset querySetCell(info, "body", request.data[c].body)>
<cfset querySetCell(info, "title", request.data[c].title)>
<cfset querySetCell(info, "keywords", request.data[c].keywords)>
<cfset querySetCell(info, "description", request.data[c].description)>
</cfloop>
<!--- insert data --->
<cfindex collection="sitemaptest" action="refresh" query="info" title="title" key="url" body="body" urlpath="url" custom1="keywords" custom2="description" status="status">
<cfset totaltime = getTickCount() - thetime>
<cfoutput>
<p>
Done indexing. Did #info.recordCount# rows. Took #totaltime# ms.
</p>
</cfoutput>
<cfdump var="#status#">
Archived Comments
I may have missed the reason for doing this :-) You mention indexing and storing the meta info in a database. Why would you want to do this? Thanks
I only mentioned the meta info in regards to the time stamps. If URL x was updated yesterday, then there is no need to resuck it down.
Thanks this looks really cool, so I was wondering any specific benefits or uses for 'Using Sitemaps with Verity'.
Well the point was that it could be used as an alternative to the spider, which only works on localhost.
CF7 version?
Hi Ray -
I am trying to implement this on a site that sits on a CF7 server. I managed to remove the cfthread references and replaced fileRead with a cffile action='read'... and it seems to work, sorta but not quite.
My sitemap.xml has about 80 nodes.
When running the modified indexing page, I get a success message saying 20 rows have been added - but nothing at all in search results, even searching for words that should be in every document. (side note: is there any way to view or dump the contents of a verity collection?)
In your notes you say 'this should be easy to downgrade to cf7' , but I am wondering what I have missed. I know this is vague but... any ideas?
Hmm. So first lets look at why you get 20 rows, not 80. If you cfdump info before you index, do you see 80 rows? How about your URL column. Are they all unique?
The only simple way to dump an entire collection is to search for *.
Hi Ray -
thanks for the quick reply.
yes, the dump shows 80 rows (dumping out the same 'myxml' variable that is being parsed).
I hacked this in two ways for cf7
Took out the fileRead like this, adding cffile instead
<cffile action="read" file="#siteMapPath#" variable="myxml">
<!--- <cfset myxml = fileRead(expandPath("./sitemap.xml"))> --->
And commented out the cfthread references in 3 places.
I just sent you a direct email with links to the pages and a bit of explanation, and I am playing with the cfdump now.
I'd love to get this working for CF7, and hope it might be useful to others too.
tracing this down further,
http://garkaneenergy.com/ve...
For the moment, I have this page dumping out each of the request.data... structs for each page, along with the page name.
Super cool to see all that meaty text content in there!
below that i am dumping the full xml variable.
it goes through the first 20 in the file, and then chokes.
The fact that I am getting the first sequential 20 makes me curious. I deleted number 20 and 21 in case it was bad data... then i deleted the first 20 and still only got 20.
DOH!!!!
<smacking head>
Your example code has 20 rows limited in the cfloop.
< ashamed... >
<cfloop index="x" from="1" to="#min(20,arrayLen(myxml.urlset.url))#">
Changing that to the more obvious value
<cfloop index="x" from="1" to="#arrayLen(myxml.urlset.url)#">
did the trick. Duh, double duh.
So... ok ... we are past the 'why only 20' hurdle.
The page shows I am getting 58 rows from the xml file which sounds perfect, and I can see , in my lovely cfdumps, all the meaty text content and perfectly organized meta values... awesome!
Listing my verity collection info
I see I have a DocCount of 58.
But... still no searchy.
http://garkaneenergy.com/se...
Running a search here for "garkane" should bring up 58 pages at the current time.
Now I am back to wishing I could 'dump' the contents of a verity collection. I want to see what those 58 DocCount records actually contain!
I can only think that I screwed it up somewhere by simply hacking out the cfthread tags. More investigation!
AHA!!! SUCCESS!!
I had my cfsearch set up to find content only '<in> body'... but there's no more body tag to search through.
Ray this is AWESOME, and it has been the catalyst for quite the spontaneous learning experience.
Thanks for being a sounding board... I think it works!!
Glad you got it - I was in meetings so couldn't respond.
This is so cool.
Now that I have struggled with it, I *get* it and , Ray, this rocks!
Taking the content filtering one step further, I added some regular expression code to strip out everything from the retrieved 'body' code that is not inside of my "mainCol" div, so the first thing shown in the site summary from the cfindex is the heading of the main part of the page, then the page text - no messy menu, or preceding-column junk to contend with.
This full circle code-trip means that I can now
- generate an xml site map for any visible site
- edit the sitemap.xml as an easy no-frills way to limit or extend which pages verity spiders
- use this code to create a cfcollection in *minutes*,
filtering the retained content according to any inserted tag or comment in my pages' code (i.e. only main column, etc)
- run a cfsearch against the collection, resulting in a super-fast lightweight homegrown in-site magic ColdFusion search!
thanks again... I am really psyched about having this entire code set in my collection.
To be honest, I wasn't sure anyone would use it. Glad you liked it!
Not only do I like it ... I can see using this a LOT. Even with access to Verity Spider in some cases, this seems like a very customizable way to get some very clean results.
I think I will put together a demo and blog post on the full circle trip using the sitemap creator and this code, plus how I restricted the search to specific parts of the page markup... If I do that, can I include a modified copy of your code as a downloadable file (with credit given, of course)?
I didnt have a clue about any of this until yesterday... now I feel like I've been handed a shiny new toolbox that lots of folks are constantly looking for... this could be really useful to a lot of people once they see it in action!
You can include my code for
ONE MILLION DOLLARS! (finger by mouth and evil laugh)
For you... in advance!
http://mredesign.com/cfdev/...
Oh that's just mean. ;)
Ok, here's to make up for that last link
http://mredesign.com/demos/...
This is pretty neat - I'm using the sitemap generator to make the xml, then feeding it to your verity writer, all with a neat little skin.
Coolest part - download the zip, and drop the files into any site, then browse to the index page - walk through the steps and presto chango, instant site search!
Ray,
first off I visit your blog almost daily and usually find exactly what I am looking for. Thanks for this great resource.
My question is about his script is:
While crawling my local website with this script it works perfectly except. I often get time out errors. I have adjusted the time out to 1200 but it still occurs even on smaller crawls.
Is there anyway to debug this to find out what cfhttp calls are getting hung up and perhaps just skip them ?
Well, there are a few things to consider. First off - why not bump up the timeout even higher? Secondly - you can log each http request and how long it took.
<cfset t = getTickCount()>
<cfhttp .....>
<cfset duration = getTickCount() - t>
<cflog file="test" text="To get url #x#, it took #duration# ms">
This may flag the culprit.
Worse comes to worse - feed the code portions of the XML at a time.
Ray,
You mentioned "I could have considered taking the data I sucked down, saving it to an HTML file, and then running a file based index. While this would be slower, it could have resulted in better indexing."
What would be the easy way to modify this code to work that way instead? I currently have save all my dynamic page saved off as html files using cfhttp and have verity indexing them for my search. I would like to switch to this instead.
Well, where I did the cfhttp and get the content, I'd just save it into a new folder. I'd then tell Verity to index that.
Ray,
How would you recommend telling the function to exclude certain links? Would you just have to maintain a no follow list? Or is it kinda pointless since Google is going to follow the link no matter what?
I guess it depends on what you want to filter. Not sure what you want to do.