Yesterday I wrote a blog entry on reading Microsoft Office documents with ColdFusion, Apache POI, and JavaLoader. One of the commenters, Leah, shared some code that made use of the latest beta of POI. This makes the reading quite a bit simpler. I had tried this myself but ran into trouble. Thanks to Leah, I'm now able to demonstrate a new version that is quite a bit simpler.
First, make sure you have read the previous entry, as some of this won't make sense without the background information. The next thing you want to do is grab POI 3.5 (List of Mirror) and unzip it. Copy all the JARs, all the lib contents, and the ooxml-lib files, into a new subfolder called jars2. "jars2" as a name isn't required of course. My previous version of this code used the jars folder for the 3.2 files so I figured I'd use jars2 for the 3.5 code.
Our initialization code is virtually the same as before:
<!--- where the poi files are ---> <cfset jarpath = expandPath("./jars2")> <cfset paths = > <cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
<cfloop query="files"> <cfset arrayAppend(paths, directory & "/" & name)> </cfloop>
<!--- load javaloader ---> <cfset loader = createObject("component", "javaloader.JavaLoader").init(paths)>
Now for the cool part. Remember how we had around 8 or so specific Java classes to do our parsing? This was because each Office type we worked with (Word, Excel, Powerpoint) had their own code and APIs to get at the text. POI 3.5 makes this a bit simpler with a factory called the ExtractorFactory. Here is the rest of the file:
<!--- generic file reader doohicky ---> <cfset myfile = createObject("java","java.io.File")>
<!--- get our required things loaded ---> <cfset extractorFactory = loader.create("org.apache.poi.extractor.ExtractorFactory")>
<!--- get files ---> <cfset filePath = expandPath("./testdocs")> <cfdirectory action="list" name="files" directory="#filePath#" filter=".doc|.ppt|.xls">
<cfloop query="files"> <cfset theFile = filePath & "/" & name> <cfset myfile.init(theFile)>
<cfset extractor = extractorFactory.createExtractor(myFile)> <cfoutput><pre>#extractor.getText()#</pre></cfoutput>
I made one File object and one instance of the ExtractorFactory. Once I've done that, look how darn simple the code is!
<cfset extractor = extractorFactory.createExtractor(myFile)>
The factory takes care of all the sniffing and ensuring the right extractor is returned. I then just run getText() and we're done. Simpler than a debate with Lindsey Lohan!
I've attached the code to the blog entry. Later today I'll talk about how to get at some of the metadata for Office documents. (Note, the attached zip does not have the jars from POI 3.5, they were a bit too big.)