Christian Ready pinged me a few days ago about an interesting problem he was having at one of his web sites. His search (Verity-based on CFMX7) was returning HTML. The HTML was escaped so the user literally saw stuff like this in the results:
Hi, my name is <b>Bob</b> and I'm a rabid developer!I pointed out that the regex used to remove HTML would also work for escaped html:
<cfset cleaned = rereplace(str, "<.*?>", "", "all")>
In English, this regex matches the escaped less than sign (<), any character (non greedy, more on that in a bit), and then the escaped greater than symbol (>). The "non greedy" part means to match the smallest possible match possible. Without this, the regex would remove the html tag and everything inside of it! We just want to remove the tags themselves.
This worked - but then exposed another problem. Verity was returning text with incomplete HTML tags. As an example, consider this text block:
ul>This is some <b>bold</b> html with <i>markup</i> in it.
Here is <b
Notice the incomplete HTML tag at the beginning and end of the string. Luckily regex provides us with a simple way to look for patterns at either the beginning or end of a string. Consider these two lines:
<cfset cleaned = rereplace(cleaned, "<.*?$", "", "all")>
<cfset cleaned = rereplace(cleaned, "^.*?>", "", "all")>
</code
The first line looks for a match of a < at the end of the string. The next line looks for a > at the beginning of the string. Both allow for bits of the html tag as well.
So all together this is the code I gave him:
<code>
<cfset cleaned = rereplace(str, "<.?>", "", "all")>
<cfset cleaned = rereplace(cleaned, "<.?$", "", "all")>
<cfset cleaned = rereplace(cleaned, "^.*?>", "", "all")>
Most likely this could be done in one regex instead.
Archived Comments
C'mon, Ray, make your examples a little more realistic. We all know that no one with *any* self-respect would refer to themselves a "rabid" developer. ;)
Just curious - was he using the spider? I used to have a lot of issues of odd code being displayed in my results until I started using vspider. That worked MUCH better to index the HTML content of my site - and it was easy to combine the results of database searches fairly easily.
Thanks again for your help on this, Ray. It's always apprecaited.
@Jim: No, I wasn't using the spider. All content for the site comes out of a database so I just built collections based on db queries.
That said, I've never used the spider. Does that come with the OEM version of Verity in CF?
Yes - vspider is included (though poorly documented)... and may not be installed by default.
For some people it works easily - others have issues. Peter Bell has a good blog post if you are interested:
http://www.pbell.com/index....
And I have a few things on my wiki (which I need to update)
http://www.thecrumb.com/wik...
It works exactly like the Google or Yahoo spiders - it hits your homepage (or wherever you direct it) and 'spiders' each page by hitting the links in each page - so it only sees what your visitor's see.
I've had this issue in the past, one way around it (for me) was to strip HTML before it was indexed. That way you don't need to do it on display, you use the regex while running the cfindex step building the collection.
the regex, would this do it?
text = ReReplace(text, "<[^>]*>", "", "all");
Ok, we can strip HTML, but how could you make sure that text included within a <cfquery> doesn't show up within the results of <cfsearch>?
For example in my tests, when I search on "job" I could receive the following back as a result:
"Select position, job, from table where id = 5"
Any ideas?
Are you using Verity to index the CFM files? If so then it is expected. You shouldn't be doing that. You should either use the spider - or index just the data, not the CFM files.
Yes - use vspider.
Speaking of which - I didn't hear about any changes in CF8 to search? Nothing new or nothing announced yet? It would be really nice if they made vspider easier to work with...
I believe there was a general Verity engine update, but no new functionality.
Ok, thanks...thought that's what you would say. Was hoping not to have to 'vspider' (what with all the horror stories). :-)
Well wait Jim. What are you trying to search here? The entire web page of your site? Or just the data?
I am searching on both. We need to return both content that is found within the individual .cfm and .html pages and also content from our .pdf and .doc files.
Well, if the content on the CFM pages is from the DB, you should just index the DB. You can then index the pdf/doc files. You can also index your HTML files as just that - files.
So it _sounds_ like you don't need the spider at all.
To clarify...the content on our .cfm pages contains both dynamic AND static content that we would like to account for. So not all the content is from a DB.
Also, we have multiple sites within IIS on a single server.
Since VSPIDER can only search on localhost, I am having difficulty figuring out how this will work with our current setup. When I create the collection, it embeds the "http://localhost/..." URL within the collection, so how does one get it to relect the true site URL of say, "http://mysite.com"? Sorry for the ignorance here, I must be doing something wrong...
When displaying the results, you could just use replace().
Unfortunately with vspider your really have to RTFM.
In the past I setup virtual domains off localhost:
http://localhost/siteone
http://localhost/sitetwo
Depending on how you have things pathed - sometimes images and CSS wouldn't work - but the spider doesn't care. Just as long as the links work you are set.
I think you can also spider file system paths and use -prefixmap to replace it with a URL.
Check out Peter Bell's site (http://www.pbell.com) he has a few good vspider posts.
vspider is great IF you can get it working and that unfortunately seems very hit or miss.
Maybe now that ColdFusion has an image tag (after an eternity!) they can fix up search in ColdFusion 9! :)
Well, got it working. :-)
Jim, like you said, when it works, it works great!
I ended up mixing and matching vspider collections, with regular collections to meet my needs.
Ray, I took your advice and used the replace()...works great.
Thank you all for your help; see you at CFUNITED?
I'll be there.
How would one go about removing HTML tags from a query before it gets passed to CFIndex? I don't want the tags
to be indexed in the collection, nor do I want them to display in the results.
Loop over the query, use querySetCell to modify the contents, and strip the HTML with:
s = rereplace(s, "<.*?>", "", "all")