Quick example of cleaning up Verity results

Christian Ready pinged me a few days ago about an interesting problem he was having at one of his web sites. His search (Verity-based on CFMX7) was returning HTML. The HTML was escaped so the user literally saw stuff like this in the results:

Hi, my name is <b>Bob</b> and I'm a rabid developer!
I pointed out that the regex used to remove HTML would also work for escaped html: <cfset cleaned = rereplace(str, "<.*?>", "", "all")>

In English, this regex matches the escaped less than sign (&lt;), any character (non greedy, more on that in a bit), and then the escaped greater than symbol (&gt;). The "non greedy" part means to match the smallest possible match possible. Without this, the regex would remove the html tag and everything inside of it! We just want to remove the tags themselves.

This worked - but then exposed another problem. Verity was returning text with incomplete HTML tags. As an example, consider this text block:

ul>This is some <b>bold</b> html with <i>markup</i> in it. Here is <b

Notice the incomplete HTML tag at the beginning and end of the string. Luckily regex provides us with a simple way to look for patterns at either the beginning or end of a string. Consider these two lines:

<cfset cleaned = rereplace(cleaned, "<.*?$", "", "all")> <cfset cleaned = rereplace(cleaned, "^.*?>", "", "all")> </code

The first line looks for a match of a &lt; at the end of the string. The next line looks for a > at the beginning of the string. Both allow for bits of the html tag as well.

So all together this is the code I gave him:

<code> <cfset cleaned = rereplace(str, "<.?>", "", "all")> <cfset cleaned = rereplace(cleaned, "<.?$", "", "all")> <cfset cleaned = rereplace(cleaned, "^.*?>", "", "all")>

Most likely this could be done in one regex instead.

Archived Comments

Comment 1 by Tom Mollerus posted on 4/17/2007 at 8:53 PM

C'mon, Ray, make your examples a little more realistic. We all know that no one with *any* self-respect would refer to themselves a "rabid" developer. ;)

Comment 2 by Jim Priest posted on 4/17/2007 at 9:14 PM

Just curious - was he using the spider? I used to have a lot of issues of odd code being displayed in my results until I started using vspider. That worked MUCH better to index the HTML content of my site - and it was easy to combine the results of database searches fairly easily.

Comment 3 by Christian Ready posted on 4/17/2007 at 9:22 PM

Thanks again for your help on this, Ray. It's always apprecaited.

Comment 4 by Christian Ready posted on 4/17/2007 at 9:24 PM

@Jim: No, I wasn't using the spider. All content for the site comes out of a database so I just built collections based on db queries.

That said, I've never used the spider. Does that come with the OEM version of Verity in CF?

Comment 5 by Jim Priest posted on 4/17/2007 at 9:38 PM

Yes - vspider is included (though poorly documented)... and may not be installed by default.

For some people it works easily - others have issues. Peter Bell has a good blog post if you are interested:

http://www.pbell.com/index....

And I have a few things on my wiki (which I need to update)
http://www.thecrumb.com/wik...

It works exactly like the Google or Yahoo spiders - it hits your homepage (or wherever you direct it) and 'spiders' each page by hitting the links in each page - so it only sees what your visitor's see.

Comment 6 by Hapex posted on 4/18/2007 at 6:17 PM

I've had this issue in the past, one way around it (for me) was to strip HTML before it was indexed. That way you don't need to do it on display, you use the regex while running the cfindex step building the collection.

Comment 7 by Dylan posted on 5/8/2007 at 11:59 PM

the regex, would this do it?

text = ReReplace(text, "<[^>]*>", "", "all");

Comment 8 by Jim posted on 6/7/2007 at 9:38 PM

Ok, we can strip HTML, but how could you make sure that text included within a <cfquery> doesn't show up within the results of <cfsearch>?

For example in my tests, when I search on "job" I could receive the following back as a result:

"Select position, job, from table where id = 5"

Any ideas?

Comment 9 by Raymond Camden posted on 6/7/2007 at 10:02 PM

Are you using Verity to index the CFM files? If so then it is expected. You shouldn't be doing that. You should either use the spider - or index just the data, not the CFM files.

Comment 10 by Jim Priest posted on 6/7/2007 at 10:23 PM

Yes - use vspider.

Speaking of which - I didn't hear about any changes in CF8 to search? Nothing new or nothing announced yet? It would be really nice if they made vspider easier to work with...

Comment 11 by Raymond Camden posted on 6/7/2007 at 10:28 PM

I believe there was a general Verity engine update, but no new functionality.

Comment 12 by Jim posted on 6/8/2007 at 12:19 AM

Ok, thanks...thought that's what you would say. Was hoping not to have to 'vspider' (what with all the horror stories). :-)

Comment 13 by Raymond Camden posted on 6/8/2007 at 12:38 AM

Well wait Jim. What are you trying to search here? The entire web page of your site? Or just the data?

Comment 14 by Jim posted on 6/8/2007 at 1:23 AM

I am searching on both. We need to return both content that is found within the individual .cfm and .html pages and also content from our .pdf and .doc files.

Comment 15 by Raymond Camden posted on 6/8/2007 at 1:35 AM

Well, if the content on the CFM pages is from the DB, you should just index the DB. You can then index the pdf/doc files. You can also index your HTML files as just that - files.

So it _sounds_ like you don't need the spider at all.

Comment 16 by Jim posted on 6/8/2007 at 7:59 PM

To clarify...the content on our .cfm pages contains both dynamic AND static content that we would like to account for. So not all the content is from a DB.

Also, we have multiple sites within IIS on a single server.
Since VSPIDER can only search on localhost, I am having difficulty figuring out how this will work with our current setup. When I create the collection, it embeds the "http://localhost/..." URL within the collection, so how does one get it to relect the true site URL of say, "http://mysite.com"? Sorry for the ignorance here, I must be doing something wrong...

Comment 17 by Raymond Camden posted on 6/8/2007 at 9:02 PM

When displaying the results, you could just use replace().

Comment 18 by Jim Priest posted on 6/8/2007 at 9:17 PM

Unfortunately with vspider your really have to RTFM.

In the past I setup virtual domains off localhost:

http://localhost/siteone
http://localhost/sitetwo

Depending on how you have things pathed - sometimes images and CSS wouldn't work - but the spider doesn't care. Just as long as the links work you are set.

I think you can also spider file system paths and use -prefixmap to replace it with a URL.

Check out Peter Bell's site (http://www.pbell.com) he has a few good vspider posts.

vspider is great IF you can get it working and that unfortunately seems very hit or miss.

Maybe now that ColdFusion has an image tag (after an eternity!) they can fix up search in ColdFusion 9! :)

Comment 19 by Jim Stout posted on 6/22/2007 at 12:15 AM

Well, got it working. :-)

Jim, like you said, when it works, it works great!

I ended up mixing and matching vspider collections, with regular collections to meet my needs.

Ray, I took your advice and used the replace()...works great.

Thank you all for your help; see you at CFUNITED?

Comment 20 by Raymond Camden posted on 6/22/2007 at 12:39 AM

I'll be there.

Comment 21 by Connie DeCinko posted on 7/24/2009 at 2:06 AM

How would one go about removing HTML tags from a query before it gets passed to CFIndex? I don't want the tags
to be indexed in the collection, nor do I want them to display in the results.

Comment 22 by Raymond Camden posted on 7/24/2009 at 2:29 AM

Loop over the query, use querySetCell to modify the contents, and strip the HTML with:

s = rereplace(s, "<.*?>", "", "all")