Twitter: raymondcamden


Address: Lafayette, LA, USA

getAllTheTexts - simple Apache Tika wrapper

08-16-2012 4,889 views ColdFusion 10 Comments

A few days ago a reader asked me if I had code that could handle extracting text from various document formats. There are multiple tools in ColdFusion that can do this. My first thought was to build a CFC that used a large switch block to shell out to the various different utilities. For some, this would be easy. CFPDF, for example, has a text extraction feature. Others would be a bit more work. You can convert PPTs and Word docs to PDF using CFDOCUMENT and then use CFPDF to extract text. Excel files can be parsed using CFSPREADSHEET. You get the idea.

Before going down that route, however, I took a look at Apache Tika. Tika supports extracting metadata and text from numerous different text formats. (Complete list of supported formats.)

Turns out Tika has a pretty simple API. How simple? I was able to get the code down to both extract text and return metadata in fewer than 50 lines. Here's the complete code for the CFC:

As you can see, I make use of the excellent JavaLoader library (the development branch to be clear). Once you have an instance of the CFC, it is a simple matter of passing a filename to the read method. The metadata is very deep. For a PPTX I parsed I got info on the number of slides as well as the presentation template. It even returned a large amount of information on an MP3.

You can download the code plus a small example at the github repo: https://github.com/cfjedimaster/getallthetexts

Special thanks to Mark Mandel for help with a class loader issue I ran into and Jeff Coughlin as well.

10 Comments

These comments will soon be imported into Disqus. To add a comment, use Disqus above.
  • Commented on 08-17-2012 at 6:45 AM
    I am not sure why you are not using the Tika Method parseToString()?
    e.g.
    oTika = createObject("java", "org.apache.tika.Tika");
    sContent = oTika.parseToString( FileInputStream );
  • Commented on 08-17-2012 at 6:59 AM
    Because I wanted the metadata too. Actually, the person who asked me to write this didn't ask for it - but I was impressed by the level of metadata Tika returned.
  • Raz #
    Commented on 09-09-2012 at 11:28 AM
    Hi, I tried your the code but I'm getting an error:
       The setContextClassLoader method was not found.

    please see the complete stack trace at:

    http://pastebin.com/UiYeREFS

    Thanks in advance
  • Commented on 09-09-2012 at 11:37 AM
    Weird - that is coming from JavaLoader itself. What version of ColdFusion are you running?
  • Raz #
    Commented on 09-09-2012 at 7:46 PM
    CF9
  • Commented on 09-09-2012 at 9:15 PM
    I'm on CF10. Hmmm. Maybe try the released JavaLoader? The one I have in Github is from a development branch in JavaLoader, not the release version.
  • Commented on 01-04-2013 at 8:01 AM
    I can't get this to work; I get the same error. Any ideas?
  • Commented on 01-04-2013 at 8:30 AM
    Best I can suggest is checking w/ Mark Mandel since it is a JavaLoader issue (afaik).
  • Kelly #
    Commented on 06-06-2013 at 5:46 PM
    I am having same issue as above. I am running CF 10. Did anyone solve this?
  • Commented on 06-07-2013 at 7:33 AM
    I did get it to work in the end. But I can't remember exactly how, definitely something related to the version of the JavaLoader. If I have time I'll dig it out and get back to you.
    - Joel