Simon asked:
I've noticed in my search results that my SQL is showing in the results. My collection indexes all my cfm pages. Is there a tag I can add to a page to prevent verity indexing the page or is it just a case of putting those sorts of files in a different directory that doesn't get indexed?First - you should know that if you are using ColdFusion 9 you really need to be using Solr instead. If this is legacy code or an older version of ColdFusion, then by all means, keep using Verity, but keep in mind that ColdFusion has moved to Solr for supporting search. That being said, what you describe would have happened in Solr as well.
Basically, you told Verity to index your CFM files. Verity (and again, Solr) has no idea what CFML is. Or SQL. Or any other code really. So it ignores your CFML and SQL and possibly even considers it part of your indexed data. That's why normally you don't use file based indexes of your CFML pages. Instead, you either index the data instead or you use a spider to index you CFML pages. A spider acts like Google's search index and hits your CFM pages via HTTP. That way it gets just the results of your pages and not the code itself. Verity under ColdFusion ships with a spider, Solr unfortunately does not. There are solutions for that (like Nutch), but I've not gotten into that yet.
Archived Comments
So, it seeems that Solr is useless for a public CFML site, unless you use no CFML, or don't mind showing your CFML in search results, or store all of your content in a database
What??? I could not disagree more. Just because Solr doesn't ship with a spider like Verity did does NOT make it useless. I can't remember the last time I needed to use a spider vs just indexing data. I'm not saying that there are times when folks would rather use it, but I don't think it is close to being the majority.
And - if you want - you can simply download Nutch, the OS spider - and run it yourself.
Doesn't that mean it is still useless for indexing public CFML pages, unless you are using the Nutch crutch to stop search results like, Hello #firstName#
A good presentation the other night too (apart from the sound quality), although I feel I have just wasted a day on something I have no current use for.
How many people prefer indexing their web pages versus indexing pure data? Take this blog for example. I'd much rather index the blog entry text from the DB then indexing the entire HTML page. The HTML has menu crap, advertising, other stuff. The DB _just_ has the blog entry.
Again - I'm not saying it is unheard of to use a spider - Google does of course. But for your own search engine, I think most people would rather index just the particular data they want.
A very good and fair enough point, although I had hoped to no longer need to use the Google free and paid search facilities on sites that don't use database content
Take a look at Nutch. Or wait for me to blog on it. ;) I've been meaning to for a year or so now. Once I get past MAX I'll have more time. In theory, it shouldn't be difficult to make use of Nutch and still use cfsearch to search the content.
Had a quick look, but being predominantly an IIS user, nothing makes much sense. I think at the moment, I will just look forward to you blogging on it, and save myself the possibility of a brain haemorrhage :)