getAllTheTexts - simple Apache Tika wrapper

A few days ago a reader asked me if I had code that could handle extracting text from various document formats. There are multiple tools in ColdFusion that can do this. My first thought was to build a CFC that used a large switch block to shell out to the various different utilities. For some, this would be easy. CFPDF, for example, has a text extraction feature. Others would be a bit more work. You can convert PPTs and Word docs to PDF using CFDOCUMENT and then use CFPDF to extract text. Excel files can be parsed using CFSPREADSHEET. You get the idea.

Before going down that route, however, I took a look at Apache Tika. Tika supports extracting metadata and text from numerous different text formats. (Complete list of supported formats.)

Turns out Tika has a pretty simple API. How simple? I was able to get the code down to both extract text and return metadata in fewer than 50 lines. Here's the complete code for the CFC:

As you can see, I make use of the excellent JavaLoader library (the development branch to be clear). Once you have an instance of the CFC, it is a simple matter of passing a filename to the read method. The metadata is very deep. For a PPTX I parsed I got info on the number of slides as well as the presentation template. It even returned a large amount of information on an MP3.

You can download the code plus a small example at the github repo: https://github.com/cfjedimaster/getallthetexts

Special thanks to Mark Mandel for help with a class loader issue I ran into and Jeff Coughlin as well.

Archived Comments

Comment 1 by Harry posted on 8/17/2012 at 3:45 PM

I am not sure why you are not using the Tika Method parseToString()?
e.g.
oTika = createObject("java", "org.apache.tika.Tika");
sContent = oTika.parseToString( FileInputStream );

Comment 2 by Raymond Camden posted on 8/17/2012 at 3:59 PM

Because I wanted the metadata too. Actually, the person who asked me to write this didn't ask for it - but I was impressed by the level of metadata Tika returned.

Comment 3 by Raz posted on 9/9/2012 at 8:28 PM

Hi, I tried your the code but I'm getting an error:
The setContextClassLoader method was not found.

please see the complete stack trace at:

http://pastebin.com/UiYeREFS

Thanks in advance

Comment 4 by Raymond Camden posted on 9/9/2012 at 8:37 PM

Weird - that is coming from JavaLoader itself. What version of ColdFusion are you running?

Comment 5 by Raz posted on 9/10/2012 at 4:46 AM

CF9

Comment 6 by Raymond Camden posted on 9/10/2012 at 6:15 AM

I'm on CF10. Hmmm. Maybe try the released JavaLoader? The one I have in Github is from a development branch in JavaLoader, not the release version.

Comment 7 by Joel Stobart posted on 1/4/2013 at 7:01 PM

I can't get this to work; I get the same error. Any ideas?

Comment 8 by Raymond Camden posted on 1/4/2013 at 7:30 PM

Best I can suggest is checking w/ Mark Mandel since it is a JavaLoader issue (afaik).

Comment 9 by Kelly posted on 6/7/2013 at 2:46 AM

I am having same issue as above. I am running CF 10. Did anyone solve this?

Comment 10 by Joel Stobart posted on 6/7/2013 at 4:33 PM

I did get it to work in the end. But I can't remember exactly how, definitely something related to the version of the JavaLoader. If I have time I'll dig it out and get back to you.
- Joel

Raymond Camden

getAllTheTexts - simple Apache Tika wrapper

Support this Content!

Archived Comments

Webmentions