Seeker updated to support Word docs and Excel files

This post is more than 2 years old.

I just pushed up an update to Seeker, my ColdFusion Lucene project. I added support for MS Word documents and MS Excel files. This was incredibly easy using JavaLoader from Mark Mandel and the POI project.

Todd Sharp gets credit for pushing both these ideas to me. He also made a good suggestion for how to use JavaLoader within Seeker.

Seeker makes use of various "reader" CFCs. Each CFC is responsible for one or more file types. A CFC 'registers' itself using metadata. So here is what plaintext.cfc looks like:

<cfcomponent output="false" hint="Plain text reader." extensions="xml,txt,html,htm,cfm,cfc" extends="reader">

<cffunction name="read" access="public" returnType="string" output="false"> <cfargument name="file" type="string" required="true"> <cfset var result = "">

&lt;cffile action="read" file="#arguments.file#" variable="result"&gt;
&lt;cfreturn result&gt;

</cffunction>

</cfcomponent>

Note the extensions attribute. This then says that this reader will be used for all the plain text file types. So what Todd suggested was just using a similar method for the Java classes. I'm not terribly happy with the names, but this is what I did.

When you add requires= to your reader CFC, you specify a list of Java classes. Like so:

<cfcomponent output="false" hint="MS Office format reader." extensions="doc,xls" requires="org.apache.poi.hwpf.HWPFDocument, org.apache.poi.hwpf.extractor.WordExtractor, org.apache.poi.hssf.extractor.ExcelExtractor, org.apache.poi.hssf.usermodel.HSSFWorkbook" extends="reader">

(Spaces were added to me.) When Seeker runs, it will notice these requirements and use JavaLoader to load them. There is a JARs file that is autoloaded, and it is expected that if your CFC needs a jar, you will put it in the folder. Since I'm using JavaLoader, all of these JARs are plug and play. No need to restart ColdFusion. Working with the classes is simple as well:

<cfset var doc = getRequirement("org.apache.poi.hwpf.HWPFDocument")>

This calls a method in the inherited CFC that gets the class that was loaded by JavaLoader and injected by the core Seeker code. I'm not happy with that method name there, but it works.

Raymond Camden's Picture

About Raymond Camden

Raymond is a senior developer evangelist for Adobe. He focuses on document services, JavaScript, and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can even buy me a coffee!

Lafayette, LA https://www.raymondcamden.com

Archived Comments

Comment 1 by Scott P posted on 6/20/2008 at 11:32 PM

taking that one step further - someone needs to take this and create a word/excel/powerpoint to pdf converter.

Comment 2 by Peter Hoopes posted on 6/21/2008 at 12:07 AM

Is there currently a way for CF8 to process a .docx or .xlsx file uploaded and show it back - or even store it for later editing?

Comment 3 by nick tong posted on 6/21/2008 at 1:05 PM

Hi Ray - Lucene VS Verity - thoughts?

Comment 4 by Raymond Camden posted on 6/21/2008 at 3:45 PM

@Peter - Not sure. I think the POI folks may add support for those new formats, but I've heard they are pretty different.

@Nick - Rough thoughs:

So obviously if you aren't on a Mac, then Verity is built in. No need to 'install' Seeker. I also like the category and suggestions support for Verity. You could probably duplicate category support, but it would be more difficult to do suggestions. (As far as I know, I'm still learning Lucene.)

The big plus for Lucene is index size. You have no license limits like Verity (250k). As I blogged about earlier, I tested w/ an index of 25 million records.

Comment 5 by nick tong posted on 6/21/2008 at 8:04 PM

@ray - thanks. Well as i've moved over to the 'mac side' i'll give it a bash. Thanks for your thoughts. I presume you only use this locally when wanting to have an search service?

Comment 6 by Raymond Camden posted on 6/23/2008 at 11:55 PM

Correct. I'm still using Verity for production.

Comment 7 by Medman posted on 6/24/2008 at 1:20 AM

Hi Ray,

does seeker work with Coldfusion 7?

Comment 8 by Raymond Camden posted on 6/24/2008 at 1:23 AM

I haven't tested - but it should. Try it. ;)

Comment 9 by Sami Hoda posted on 6/30/2008 at 10:51 PM

Not sure the instructions are there to add cfadmin pages. You need to add a page for the menu item to show up, correct?

Comment 10 by Raymond Camden posted on 6/30/2008 at 11:02 PM

There is a file in the root of the CF Admin named custommenu.xml. If you open that up, you will see instructions on how to modify the CF Admin to show new links.

Comment 11 by Sami Hoda posted on 6/30/2008 at 11:04 PM

I just created a file extensionscustom.cfm, and added:
<a href="seeker/index.cfm" target="content">Seeker</a><br> to it. Is the XML the new approach?

Comment 12 by Raymond Camden posted on 6/30/2008 at 11:05 PM

Yes. Either works, but the XML way is the 'new' way.

Comment 13 by Sami Hoda posted on 6/30/2008 at 11:06 PM

I see, there needs to be multiple links, not just to index.cfm. So searchtool.cfm and index.cfm are the two links?

Comment 14 by Sami Hoda posted on 6/30/2008 at 11:07 PM

Or is index.cfm the only file needed?

Comment 15 by Raymond Camden posted on 6/30/2008 at 11:11 PM

index.cfm is the only one you need.

Comment 16 by Sami Hoda posted on 7/1/2008 at 12:53 AM

Blogged install instructions. Will be eval'ing Seeker.

Comment 17 by Will Wilson posted on 4/28/2009 at 12:09 AM

Hey Ray,

Any chance you could add snippets on the next version (similar to verity where it highlights text etc).

Would also be cool if you could link to pages within a framework...although I'm baffled how one would accomplish this.

Keep up the good work! Being on a mac, I'm finding this tool invaluable!

Comment 18 by Raymond Camden posted on 4/28/2009 at 12:59 AM

I don't believe Lucene support this. If I'm wrong, I'd be happy to add it.

Also, you need to clearly differentiate between snippets and context. A snippet could be from anywhere in the document, but helps identify the document, whereas context shows you the match. So I'm sure you mean context.

Comment 19 by Will Wilson posted on 4/28/2009 at 11:29 PM

That's a shame, it's a cool feature. Yeah my bad, I meant context. Mental note made :)

Comment 20 by janusz posted on 6/17/2009 at 1:20 PM

hi ray

im wondering if seeker allows for more than one index to be created?

i currently need to index lots of different tables with different columns. Instead of trying to collate them into one index, thought it may be easier to create more than one index?

thanks

Comment 21 by Raymond Camden posted on 6/17/2009 at 3:16 PM

Like Verity, Seeker works with multiple indexes. So yes, you can do multiple indexes. :)

Comment 22 by janusz posted on 6/17/2009 at 3:38 PM

hi ray.. thanks for letting me know.
if i have this...

<cf_indexquery directory="#index_folder#" indexdirectory="#index_folder#"
query="#arguments.index_qry#"
storecolumns="id,title,content,type,link" indexcolumns="id,title,content">

where would the name of the index file go?

seeker has been a lifesaver as im on a mac.

thanks

Comment 23 by Raymond Camden posted on 6/17/2009 at 3:39 PM

It's directory based. So give it a new indexdirectory value.

Comment 24 by janusz posted on 6/17/2009 at 3:45 PM

brilliant... thanks for your help.

Comment 25 by rchinoy posted on 1/29/2010 at 2:48 AM

Hi Ray,

It doesn't seem like stemming is working when I use Seeker. Is there something I need to do to get it working?

Thanks

Comment 26 by Raymond Camden posted on 2/1/2010 at 11:49 PM

Forgive me - but I'm a bit rusty on my code base. I believe you would need to change the analyzer used to parse data. I'd have to look into which are available and what could easily be used with Seeker. You can file an ER for this at RIAForge, but no promise on when I'd have time to get it working.

Comment 27 by farshid posted on 2/8/2010 at 7:38 PM

Hi,
what are you doing with MS word math equations? could you insert formula to database via rich text box or reading from doc files?
Please help me.
regards
farshid

Comment 28 by web design liverpool posted on 6/4/2010 at 1:45 PM

Hi

Maybe a stupid question but how would I search 2 indexes at the same time, one a file index the other a DB index? Or can I join the two together?

Thanks

Peter

Comment 29 by Raymond Camden posted on 6/4/2010 at 3:19 PM

Yes. Just provide a list to the collection attribute. This is only allowed if you aren't doing a category search.