Yesterday I wrote a blog entry on reading Microsoft Office documents with ColdFusion, Apache POI, and JavaLoader. One of the commenters, Leah, shared some code that made use of the latest beta of POI. This makes the reading quite a bit simpler. I had tried this myself but ran into trouble. Thanks to Leah, I'm now able to demonstrate a new version that is quite a bit simpler.
First, make sure you have read the previous entry, as some of this won't make sense without the background information. The next thing you want to do is grab POI 3.5 (List of Mirror) and unzip it. Copy all the JARs, all the lib contents, and the ooxml-lib files, into a new subfolder called jars2. "jars2" as a name isn't required of course. My previous version of this code used the jars folder for the 3.2 files so I figured I'd use jars2 for the 3.5 code.
Our initialization code is virtually the same as before:
<!--- where the poi files are --->
<cfset jarpath = expandPath("./jars2")>
<cfset paths = []>
<cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
<cfloop query="files">
<cfset arrayAppend(paths, directory & "/" & name)>
</cfloop>
<!--- load javaloader --->
<cfset loader = createObject("component", "javaloader.JavaLoader").init(paths)>
Now for the cool part. Remember how we had around 8 or so specific Java classes to do our parsing? This was because each Office type we worked with (Word, Excel, Powerpoint) had their own code and APIs to get at the text. POI 3.5 makes this a bit simpler with a factory called the ExtractorFactory. Here is the rest of the file:
<!--- generic file reader doohicky --->
<cfset myfile = createObject("java","java.io.File")>
<!--- get our required things loaded --->
<cfset extractorFactory = loader.create("org.apache.poi.extractor.ExtractorFactory")>
<!--- get files --->
<cfset filePath = expandPath("./testdocs")>
<cfdirectory action="list" name="files" directory="#filePath#" filter=".doc|.ppt|.xls">
<cfloop query="files">
<cfset theFile = filePath & "/" & name>
<cfset myfile.init(theFile)>
<cfoutput>Reading: #theFile#<br/></cfoutput>
<cfset extractor = extractorFactory.createExtractor(myFile)>
<cfoutput><pre>#extractor.getText()#</pre></cfoutput>
<p><hr/></p>
</cfloop>
I made one File object and one instance of the ExtractorFactory. Once I've done that, look how darn simple the code is!
<cfset extractor = extractorFactory.createExtractor(myFile)>
The factory takes care of all the sniffing and ensuring the right extractor is returned. I then just run getText() and we're done. Simpler than a debate with Lindsey Lohan!
I've attached the code to the blog entry. Later today I'll talk about how to get at some of the metadata for Office documents. (Note, the attached zip does not have the jars from POI 3.5, they were a bit too big.)
Archived Comments
Thank you for this elegant solution. It helped me a lot.
However, you could save the javaloader into the server scope and use it from there to avoid memory leaks (please refer to readme.txt within the javaloader folder).
There is a known memory leak issue when using a java.net.URLClassLoader to load external jar files. Please find more info here: http://www.compoundtheory.c...
in this article it says "Copy all the JARs, all the lib contents, and the ooxml-lib files, into a new subfolder called jars2"
I am using the now release final poi-3.5-final.zip file and now have a total of 7 jar files (16.7 MB)
\jars2\lib\commons-logging-1.1.jar
\jars2\lib\junit-3.8.1.jar
\jars2\lib\log4j-1.2.13.jar
\jars2\ooxml-lib\dom4j-1.6.1.jar
\jars2\ooxml-lib\geronimo-stax-api_1.0_spec-1.0.jar
\jars2\ooxml-lib\ooxml-schemas-1.0.jar
\jars2\ooxml-lib\xmlbeans-2.3.0.jar
Question: is that all i need from the zip file?
Issue: I get Error Occurred While Processing Request
org.apache.poi.extractor.ExtractorFactory
The error occurred in C:\Dev\libs\contactsImporter\javaloader\JavaLoader.cfc: line 94
92 : <cfargument name="className" hint="The name of the class to create" type="string" required="Yes">
93 : <cfscript>
94 : var class = getURLClassLoader().loadClass(arguments.className);
95 :
96 : return createJavaProxy(class);
ok, this is my first time implementing a POI solution and to be quite frank, I've rarely used jar files. Since the final release of POI-3.5 was just a couple days ago. It seems they excluded some JAR files.
I didn't know where to find the FINAL versions jar files, but used the missing jars from beta6 found here:
http://www.jarvana.com/jarv...
folders:
4. poi
5. poi_contrib
6. poi_ooxml
7. poi-scratchpad
It seems like the POI project is using POM files; not sure if that is why certains JAR files are gone? not sure what POM files are used for either :)
Anyway, I don't doubt I'll be schooled here; but this is what I did to get my app working for now.
You won't be schooled by me. ;) I stumble in this area as well sometimes.
Well, since your last entry on this I see version 3.6 is out. Can't find 3.5 anywhere. So I shall dive in head first and hope it works.
See, my problem is that I have THOUSANDS of RTF files on a server. The server is dieing so we moved the application to a new server. Now we have to read/edit the rtf files somehow. Previously it was done by simply using cfcontent. This opened Word and they did their thing. That doesn't work now since the files are on 1 server and the application on another. The app server does not have office at all on it.
Fun fun fun.
And now "for security" they want to put these files in an Oracle db. I can do that, but getting them out in an editable format is kicking my butt.
I downloaded your code and downloaded the POI 3.8 beta3. It works GREAT for old formats like ppt, xls, and doc. But as soon as I move a 2007-formatted file into the testdocs folder, it breaks. I checked all my log files and I can't find anything that references what the problem is. CF gives a 500 error message, and temporarily CF is unavailable, it must reboot quickly. It chokes on the line:
<cfset extractor = extractorFactory.createExtractor(myFile)>
for Office 2007 documents (can be any 2007 document - docx, pptx, xlsx) - and these files aren't anything special - just basic files with no macros or fanciness.
any suggestions?
If I remember right the code was only for pre-2007 files and they had a newer system for 2007+ Office extensions. Best bet is to check the site.
Hi Raymond and Mike, were you able to find a solution for the Office 2007 documents (can be any 2007 document - docx, pptx, xlsx) files? on how to read their metadata?
Try this maybe? http://www.raymondcamden.co...