I just pushed up an update to Seeker, my ColdFusion Lucene project. I added support for MS Word documents and MS Excel files. This was incredibly easy using JavaLoader from Mark Mandel and the POI project.
Todd Sharp gets credit for pushing both these ideas to me. He also made a good suggestion for how to use JavaLoader within Seeker.
Seeker makes use of various "reader" CFCs. Each CFC is responsible for one or more file types. A CFC 'registers' itself using metadata. So here is what plaintext.cfc looks like:
<cfcomponent output="false" hint="Plain text reader." extensions="xml,txt,html,htm,cfm,cfc" extends="reader">
<cffunction name="read" access="public" returnType="string" output="false">
<cfargument name="file" type="string" required="true">
<cfset var result = "">
<cffile action="read" file="#arguments.file#" variable="result">
<cfreturn result>
</cffunction>
</cfcomponent>
Note the extensions attribute. This then says that this reader will be used for all the plain text file types. So what Todd suggested was just using a similar method for the Java classes. I'm not terribly happy with the names, but this is what I did.
When you add requires= to your reader CFC, you specify a list of Java classes. Like so:
<cfcomponent output="false" hint="MS Office format reader." extensions="doc,xls" requires="org.apache.poi.hwpf.HWPFDocument, org.apache.poi.hwpf.extractor.WordExtractor, org.apache.poi.hssf.extractor.ExcelExtractor, org.apache.poi.hssf.usermodel.HSSFWorkbook" extends="reader">
(Spaces were added to me.) When Seeker runs, it will notice these requirements and use JavaLoader to load them. There is a JARs file that is autoloaded, and it is expected that if your CFC needs a jar, you will put it in the folder. Since I'm using JavaLoader, all of these JARs are plug and play. No need to restart ColdFusion. Working with the classes is simple as well:
<cfset var doc = getRequirement("org.apache.poi.hwpf.HWPFDocument")>
This calls a method in the inherited CFC that gets the class that was loaded by JavaLoader and injected by the core Seeker code. I'm not happy with that method name there, but it works.
Archived Comments
taking that one step further - someone needs to take this and create a word/excel/powerpoint to pdf converter.
Is there currently a way for CF8 to process a .docx or .xlsx file uploaded and show it back - or even store it for later editing?
Hi Ray - Lucene VS Verity - thoughts?
@Peter - Not sure. I think the POI folks may add support for those new formats, but I've heard they are pretty different.
@Nick - Rough thoughs:
So obviously if you aren't on a Mac, then Verity is built in. No need to 'install' Seeker. I also like the category and suggestions support for Verity. You could probably duplicate category support, but it would be more difficult to do suggestions. (As far as I know, I'm still learning Lucene.)
The big plus for Lucene is index size. You have no license limits like Verity (250k). As I blogged about earlier, I tested w/ an index of 25 million records.
@ray - thanks. Well as i've moved over to the 'mac side' i'll give it a bash. Thanks for your thoughts. I presume you only use this locally when wanting to have an search service?
Correct. I'm still using Verity for production.
Hi Ray,
does seeker work with Coldfusion 7?
I haven't tested - but it should. Try it. ;)
Not sure the instructions are there to add cfadmin pages. You need to add a page for the menu item to show up, correct?
There is a file in the root of the CF Admin named custommenu.xml. If you open that up, you will see instructions on how to modify the CF Admin to show new links.
I just created a file extensionscustom.cfm, and added:
<a href="seeker/index.cfm" target="content">Seeker</a><br> to it. Is the XML the new approach?
Yes. Either works, but the XML way is the 'new' way.
I see, there needs to be multiple links, not just to index.cfm. So searchtool.cfm and index.cfm are the two links?
Or is index.cfm the only file needed?
index.cfm is the only one you need.
Blogged install instructions. Will be eval'ing Seeker.
Hey Ray,
Any chance you could add snippets on the next version (similar to verity where it highlights text etc).
Would also be cool if you could link to pages within a framework...although I'm baffled how one would accomplish this.
Keep up the good work! Being on a mac, I'm finding this tool invaluable!
I don't believe Lucene support this. If I'm wrong, I'd be happy to add it.
Also, you need to clearly differentiate between snippets and context. A snippet could be from anywhere in the document, but helps identify the document, whereas context shows you the match. So I'm sure you mean context.
That's a shame, it's a cool feature. Yeah my bad, I meant context. Mental note made :)
hi ray
im wondering if seeker allows for more than one index to be created?
i currently need to index lots of different tables with different columns. Instead of trying to collate them into one index, thought it may be easier to create more than one index?
thanks
Like Verity, Seeker works with multiple indexes. So yes, you can do multiple indexes. :)
hi ray.. thanks for letting me know.
if i have this...
<cf_indexquery directory="#index_folder#" indexdirectory="#index_folder#"
query="#arguments.index_qry#"
storecolumns="id,title,content,type,link" indexcolumns="id,title,content">
where would the name of the index file go?
seeker has been a lifesaver as im on a mac.
thanks
It's directory based. So give it a new indexdirectory value.
brilliant... thanks for your help.
Hi Ray,
It doesn't seem like stemming is working when I use Seeker. Is there something I need to do to get it working?
Thanks
Forgive me - but I'm a bit rusty on my code base. I believe you would need to change the analyzer used to parse data. I'd have to look into which are available and what could easily be used with Seeker. You can file an ER for this at RIAForge, but no promise on when I'd have time to get it working.
Hi,
what are you doing with MS word math equations? could you insert formula to database via rich text box or reading from doc files?
Please help me.
regards
farshid
Hi
Maybe a stupid question but how would I search 2 indexes at the same time, one a file index the other a DB index? Or can I join the two together?
Thanks
Peter
Yes. Just provide a list to the collection attribute. This is only allowed if you aren't doing a category search.