A few days ago a reader asked me if I had code that could handle extracting text from various document formats. There are multiple tools in ColdFusion that can do this. My first thought was to build a CFC that used a large switch block to shell out to the various different utilities. For some, this would be easy. CFPDF, for example, has a text extraction feature. Others would be a bit more work. You can convert PPTs and Word docs to PDF using CFDOCUMENT and then use CFPDF to extract text. Excel files can be parsed using CFSPREADSHEET. You get the idea.
Before going down that route, however, I took a look at Apache Tika. Tika supports extracting metadata and text from numerous different text formats. (Complete list of supported formats.)
Turns out Tika has a pretty simple API. How simple? I was able to get the code down to both extract text and return metadata in fewer than 50 lines. Here's the complete code for the CFC:
As you can see, I make use of the excellent JavaLoader library (the development branch to be clear). Once you have an instance of the CFC, it is a simple matter of passing a filename to the read method. The metadata is very deep. For a PPTX I parsed I got info on the number of slides as well as the presentation template. It even returned a large amount of information on an MP3.
You can download the code plus a small example at the github repo: https://github.com/cfjedimaster/getallthetexts
Special thanks to Mark Mandel for help with a class loader issue I ran into and Jeff Coughlin as well.
Archived Comments
I am not sure why you are not using the Tika Method parseToString()?
e.g.
oTika = createObject("java", "org.apache.tika.Tika");
sContent = oTika.parseToString( FileInputStream );
Because I wanted the metadata too. Actually, the person who asked me to write this didn't ask for it - but I was impressed by the level of metadata Tika returned.
Hi, I tried your the code but I'm getting an error:
The setContextClassLoader method was not found.
please see the complete stack trace at:
http://pastebin.com/UiYeREFS
Thanks in advance
Weird - that is coming from JavaLoader itself. What version of ColdFusion are you running?
CF9
I'm on CF10. Hmmm. Maybe try the released JavaLoader? The one I have in Github is from a development branch in JavaLoader, not the release version.
I can't get this to work; I get the same error. Any ideas?
Best I can suggest is checking w/ Mark Mandel since it is a JavaLoader issue (afaik).
I am having same issue as above. I am running CF 10. Did anyone solve this?
I did get it to work in the end. But I can't remember exactly how, definitely something related to the version of the JavaLoader. If I have time I'll dig it out and get back to you.
- Joel