Twitter: raymondcamden


Address: Lafayette, LA, USA

ColdFusion Lucene Test

09-30-2007 6,845 views ColdFusion 8 Comments

This morning I woke up rather early (I rarely sleep well away from home), and decided to play a bit more with Lucene. I've learned a few things about the project that I'm going to list out here. If I get them wrong, please correct me. I've only spent a few hours on this so I'm far from being an expert.

First off - Lucene does not, by itself, have any support for binary data. That means you can't index PDF, Word Documents, MP3 files, etc. People have written libraries for it - but the important thing to note is that "out of the box", you just have string data and that's it.

Secondly - Lucene lets you build an index with any set of fields, which is pretty cool. CFVerity (I'll use CFVerity to refer to the built-in Verity support in ColdFusion) has a set number of fields that you are forced to use. Now it is a pretty big set, but with Lucene you can create fields you want.

That is good - but a bit of a problem as well. One nice thing about the CFVerity stuff is that you have multiple, disparate sources that all must feed into a certain set of fields. You then get the same fields outs. So when searching an index in Lucene for example, you have to dynamically introspect the results to see what fields exist. That isn't hard to do at all - but I guess I'm saying it is both good and bad that Lucene is so open (more good than bad).

So I wrote up a simple file based demo. This is based heavily on Lucene's on demo code, and the CFLucene project. My demo code only lets you index files. But it does try to mimic Verity a bit. So for example, you can define an index with just one word "name" and it uses a folder under CFROOT/lucenecollections. Ditto for searching. Here is the sample code I've used:

view plain print about
1<cf_index directory="/Library/WebServer/Documents" indexdirectory="test2"
2filter="*.cfm,*.html,*.txt" recurse="true">

So the directory attribute is my source. Indexdirectory is where the index is stored. Filter is an optional list of filters (duh), and recurse tells the tag to recursively search my directory attribute. Searching is rather simple:

view plain print about
1<cf_search index="test2" term="#form.search#">
2<cfdump var="#result#">

I've included my demo files and custom tags in a zip attached to this blog entry. I will most likely not continue this code as I'm thinking of this as a first draft only.

What I'm thinking of the final version is something along these lines:

a) We need a generic wrapper to index data. This wrapper takes in data (again, string only) and handles telling Lucene where to store the index.

b) Above this we need 'helpers' I think. The file I've included in my zip that indexes files could be considered a helper. Ie, it is simply a utility to easily send crap to the utility defined in A. I can imagine 2 basic helpers - a File helper and a Query helper. Again - I'm thinking of Verity as my base here. The File helper could be an interesting project. As I mentioned, Lucene supports strings only, so I could build a File helper that it itself takes plugins. This way folks could add PDF support (maybe using my pdfutils), Word support, etc.

c) Search - This one is rather easy actually. Since Lucene lets you get the fields back dynamically, it wouldn't be hard to modify the search.cfm I have already.

d) Lastly - I want to build a CF Admin page that will let you do what you can do in Verity admin. You an index folders/files and see your indexes (again, assuming we use a default root folder). You can optimize - etc. And if I read the docs right - you can loop through the data, delete, etc, so a browser could be built to let you see what is in the index. (To be fair, CFVerity supports this as well, you just need to search for %.)

Download attached file

8 Comments

  • Commented on 10-01-2007 at 9:52 AM
    If you're going to the trouble of using Lucene, you probably want to look at Solr -- which is a RESTful wrapper around Lucene that makes it dead simple to use. It speaks HTML, JSON, etc and removes a lot of the hassle of managing Lucene.
  • Joseph Lamoree #
    Commented on 10-23-2007 at 5:24 AM
    Hi Ray.

    I wrote the code on CFLucene a long time ago. There are lots of areas that need improvement, but I haven't been active on it. I would be interested to see what you do with Lucene and ColdFusion.

    -joseph
  • Commented on 10-23-2007 at 9:17 AM
    I've got some sample code ready - just not posted yet.
  • Commented on 10-25-2007 at 1:16 PM
    Guys, please see:

    http://www.coldfusionjedi.com/index.cfm/2007/10/24...
  • barry.b #
    Commented on 11-29-2007 at 9:42 PM
    looks like the CFLucene project site has an expired domain.

    I was just after seeing how people have wrapped CF around Lucerne.
  • Commented on 11-30-2007 at 8:25 AM
    Barry, have you looked at my Seeker project?
  • Jeff #
    Commented on 12-01-2007 at 3:11 PM
    Ray, is there a way to get the indexed data for a single file back. So lets say i run cf_index on test.txt. Is there a way to grab what lucene indexed for test.txt back as a variable so CF can play with it?
  • Commented on 12-03-2007 at 10:17 AM
    I do not think so. If the Lucene API supports this, it is something I could add to my code. Note that I only learned a tiny bit about Lucene.

Post Reply

Please refrain from posting large blocks of code as a comment. Use Pastebin or Gists instead. Text wrapped in asterisks (*) will be bold and text wrapped in underscores (_) will be italicized.

Leave this field empty