I had to spend a few minutes on hold today waiting for a client, so I decided to spend a bit more time working on the ColdFusion Lucene code. One of the things I found about Lucene was it had no support for binary files. Basically you are responsible for finding your own plugins to get the text out of various file formats.
So I've updated my code to allow for plugins. What do I mean? The previous version of the code simply used a fileRead() call on the file to read in file contents and pass it to Lucene. Now I call a new CFC, filereader.cfc. This CFC, when created, scans a subdirectory named readers. Each CFC (except the core CFC the others one extend) represents a 'reader' for a type of file. Consider plaintext.cfc:
<cfcomponent output="false" hint="Plain text reader." extensions="xml,txt,html,cfm,cfc" extends="reader">
<cffunction name="read" access="public" returnType="string" output="false">
<cfargument name="file" type="string" required="true">
<cfset var result = "">
<cffile action="read" file="#arguments.file#" variable="result">
<cfreturn result>
</cffunction>
</cfcomponent>
There are two things important going on here. First off - notice the extensions attribute. Don't forget that the cfcomponent tag allows you to add any ole attribute you like. Well, my filereader.cfc makes note of this attribute. If it is reading a file of type X, and there is a reader that says it handles the extension, then the Read method is called. Notice for the plaintext cfc, I simply read in the file and return the result. Easy. So to "plugin" PDF support, I wrote a pdf.cfc. I stole my code from pdfutils. Now this code only works in ColdFusion 8, but someone else (that's you guys) could write a cf7 PDF reader. Someone else could write a MP3 reader. Etc.
Make sense? Cool? This change also removes the CF8 requirement for my code. (I mean outside of the PDF reader.) In theory - it completes the support (although the code is still a bit ugly) of file based indexing. All I would need to add next is support for manual updates so folks can index database information.
Archived Comments
How is that you "steal" your own code and "repurpose" other people's code? ;)
i learned from ms
About a year ago I wrote a complete Lucene 2 wrapper for our CMS which contains all the functionality of Verity and a lot more. We mainly went to Lucene for the more flexible multilingual support and the ability to write our own analyizers. The code is wrapped up in a jar and is part of a commercial product so not open source. We used the following open source 3rd party java libraries to get it work with a number of binary file formats
PDFBOX - PDF
Text Mining Extractors - http://www.textmining.org MS Office documents
Java - Text , Images
We have documented how we call things here:
http://www.shadocms.com/sha...
Grant, there is a small layout issue with your URL - a non closed italics tag. Would you be able to share any code with the project?
Hi Ray,
I'll have a chat with the team on Monday but we could certainly look at something like a free version for non commercial use and a $100 or something close to that for commercial use on a project. That way we can release the full code set as we have probably ported all the functionality someone would have ever used in Verity. It's all wrapped up in a simple CF API with a .jar file to add to the server. The main benefits of Lucene are:
1. It runs on a Mac (multi-platform support is much better than Verity - not that it would hard to beat!)
2. Multilingual support is very powerful and easy to extend
3. It's easy to cluster (just writes an index to the file system you can easily open and close, or copy to multiple servers)
4. It's easy to move (just copy the created index folder)
5. You can create your own analyzers, for example the simplest analyizer is a white space one which simply separates words base on whitespace. YOu can easily write a CF analyzer that would search based on "cfset" for example and then search all attributes etc.
6. The query syntax is easy to work with and very powerful
7. It is very fast
We're pretty busy but I'll see if we can get something out in the next couple of weeks that would let others use the functionality we've developed.
Hey Ray - I've been looking at using your code but am having problems setting up Lucene - can this just be added as a Classpath in CF, or do you have to have it run on a java app server by itself?
Thanks in advance for any help...:-)
You should be able ot just add it to your classpath.
FYI, another user sent some bug reports pre-cfobjective. I've been swamped, but hopefully I'll have an update to release sometime soon.