Reading Office documents with ColdFusion (2)

This post is more than 2 years old.

Yesterday I wrote a blog entry on reading Microsoft Office documents with ColdFusion, Apache POI, and JavaLoader. One of the commenters, Leah, shared some code that made use of the latest beta of POI. This makes the reading quite a bit simpler. I had tried this myself but ran into trouble. Thanks to Leah, I'm now able to demonstrate a new version that is quite a bit simpler.

First, make sure you have read the previous entry, as some of this won't make sense without the background information. The next thing you want to do is grab POI 3.5 (List of Mirror) and unzip it. Copy all the JARs, all the lib contents, and the ooxml-lib files, into a new subfolder called jars2. "jars2" as a name isn't required of course. My previous version of this code used the jars folder for the 3.2 files so I figured I'd use jars2 for the 3.5 code.

Our initialization code is virtually the same as before:

<!--- where the poi files are ---> <cfset jarpath = expandPath("./jars2")> <cfset paths = []> <cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">

<cfloop query="files"> <cfset arrayAppend(paths, directory & "/" & name)> </cfloop>

<!--- load javaloader ---> <cfset loader = createObject("component", "javaloader.JavaLoader").init(paths)>

Now for the cool part. Remember how we had around 8 or so specific Java classes to do our parsing? This was because each Office type we worked with (Word, Excel, Powerpoint) had their own code and APIs to get at the text. POI 3.5 makes this a bit simpler with a factory called the ExtractorFactory. Here is the rest of the file:

<!--- generic file reader doohicky ---> <cfset myfile = createObject("java","")>

<!--- get our required things loaded ---> <cfset extractorFactory = loader.create("org.apache.poi.extractor.ExtractorFactory")>

<!--- get files ---> <cfset filePath = expandPath("./testdocs")> <cfdirectory action="list" name="files" directory="#filePath#" filter=".doc|.ppt|.xls">

<cfloop query="files"> <cfset theFile = filePath & "/" & name> <cfset myfile.init(theFile)>

&lt;cfoutput&gt;Reading: #theFile#&lt;br/&gt;&lt;/cfoutput&gt;

&lt;cfset extractor = extractorFactory.createExtractor(myFile)&gt;



I made one File object and one instance of the ExtractorFactory. Once I've done that, look how darn simple the code is!

<cfset extractor = extractorFactory.createExtractor(myFile)>

The factory takes care of all the sniffing and ensuring the right extractor is returned. I then just run getText() and we're done. Simpler than a debate with Lindsey Lohan!

I've attached the code to the blog entry. Later today I'll talk about how to get at some of the metadata for Office documents. (Note, the attached zip does not have the jars from POI 3.5, they were a bit too big.)

Download attached file.

Raymond Camden's Picture

About Raymond Camden

Raymond is a senior developer evangelist for Adobe. He focuses on document services, JavaScript, and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can even buy me a coffee!

Lafayette, LA

Archived Comments

Comment 1 by Andrius posted on 7/8/2009 at 2:56 PM

Thank you for this elegant solution. It helped me a lot.
However, you could save the javaloader into the server scope and use it from there to avoid memory leaks (please refer to readme.txt within the javaloader folder).

There is a known memory leak issue when using a to load external jar files. Please find more info here: http://www.compoundtheory.c...

Comment 2 by Andrew Duvall posted on 10/2/2009 at 12:04 AM

in this article it says "Copy all the JARs, all the lib contents, and the ooxml-lib files, into a new subfolder called jars2"
I am using the now release final file and now have a total of 7 jar files (16.7 MB)

Question: is that all i need from the zip file?

Issue: I get Error Occurred While Processing Request

The error occurred in C:\Dev\libs\contactsImporter\javaloader\JavaLoader.cfc: line 94

92 : <cfargument name="className" hint="The name of the class to create" type="string" required="Yes">
93 : <cfscript>
94 : var class = getURLClassLoader().loadClass(arguments.className);
95 :
96 : return createJavaProxy(class);

Comment 3 by Andrew Duvall posted on 10/2/2009 at 1:54 AM

ok, this is my first time implementing a POI solution and to be quite frank, I've rarely used jar files. Since the final release of POI-3.5 was just a couple days ago. It seems they excluded some JAR files.
I didn't know where to find the FINAL versions jar files, but used the missing jars from beta6 found here:
4. poi
5. poi_contrib
6. poi_ooxml
7. poi-scratchpad

It seems like the POI project is using POM files; not sure if that is why certains JAR files are gone? not sure what POM files are used for either :)

Anyway, I don't doubt I'll be schooled here; but this is what I did to get my app working for now.

Comment 4 by Raymond Camden posted on 10/2/2009 at 1:56 AM

You won't be schooled by me. ;) I stumble in this area as well sometimes.

Comment 5 by Don posted on 2/19/2010 at 1:24 AM

Well, since your last entry on this I see version 3.6 is out. Can't find 3.5 anywhere. So I shall dive in head first and hope it works.
See, my problem is that I have THOUSANDS of RTF files on a server. The server is dieing so we moved the application to a new server. Now we have to read/edit the rtf files somehow. Previously it was done by simply using cfcontent. This opened Word and they did their thing. That doesn't work now since the files are on 1 server and the application on another. The app server does not have office at all on it.
Fun fun fun.
And now "for security" they want to put these files in an Oracle db. I can do that, but getting them out in an editable format is kicking my butt.

Comment 6 by Mike posted on 6/21/2011 at 8:33 PM

I downloaded your code and downloaded the POI 3.8 beta3. It works GREAT for old formats like ppt, xls, and doc. But as soon as I move a 2007-formatted file into the testdocs folder, it breaks. I checked all my log files and I can't find anything that references what the problem is. CF gives a 500 error message, and temporarily CF is unavailable, it must reboot quickly. It chokes on the line:

<cfset extractor = extractorFactory.createExtractor(myFile)>

for Office 2007 documents (can be any 2007 document - docx, pptx, xlsx) - and these files aren't anything special - just basic files with no macros or fanciness.

any suggestions?

Comment 7 by Raymond Camden posted on 6/23/2011 at 12:01 AM

If I remember right the code was only for pre-2007 files and they had a newer system for 2007+ Office extensions. Best bet is to check the site.

Comment 8 by Raz posted on 9/5/2012 at 8:10 AM

Hi Raymond and Mike, were you able to find a solution for the Office 2007 documents (can be any 2007 document - docx, pptx, xlsx) files? on how to read their metadata?

Comment 9 by Raymond Camden posted on 9/5/2012 at 2:48 PM