I just pushed up an update to Seeker, my ColdFusion Lucene project. I added support for MS Word documents and MS Excel files. This was incredibly easy using JavaLoader from Mark Mandel and the POI project.
Todd Sharp gets credit for pushing both these ideas to me. He also made a good suggestion for how to use JavaLoader within Seeker.
Seeker makes use of various "reader" CFCs. Each CFC is responsible for one or more file types. A CFC 'registers' itself using metadata. So here is what plaintext.cfc looks like:
<cfcomponent output="false" hint="Plain text reader." extensions="xml,txt,html,htm,cfm,cfc" extends="reader">
<cffunction name="read" access="public" returnType="string" output="false"> <cfargument name="file" type="string" required="true"> <cfset var result = "">
<cffile action="read" file="#arguments.file#" variable="result"> <cfreturn result>
Note the extensions attribute. This then says that this reader will be used for all the plain text file types. So what Todd suggested was just using a similar method for the Java classes. I'm not terribly happy with the names, but this is what I did.
When you add requires= to your reader CFC, you specify a list of Java classes. Like so:
<cfcomponent output="false" hint="MS Office format reader." extensions="doc,xls" requires="org.apache.poi.hwpf.HWPFDocument, org.apache.poi.hwpf.extractor.WordExtractor, org.apache.poi.hssf.extractor.ExcelExtractor, org.apache.poi.hssf.usermodel.HSSFWorkbook" extends="reader">
(Spaces were added to me.) When Seeker runs, it will notice these requirements and use JavaLoader to load them. There is a JARs file that is autoloaded, and it is expected that if your CFC needs a jar, you will put it in the folder. Since I'm using JavaLoader, all of these JARs are plug and play. No need to restart ColdFusion. Working with the classes is simple as well:
<cfset var doc = getRequirement("org.apache.poi.hwpf.HWPFDocument")>
This calls a method in the inherited CFC that gets the class that was loaded by JavaLoader and injected by the core Seeker code. I'm not happy with that method name there, but it works.