Have you ever needed to get information about Office documents with ColdFusion? In this blog entry I'll talk about how we can use some open source tools along with ColdFusion to read from various types of Microsoft Office documents. I'll be focusing on retrieving the text of the documents only. You can use this code to allow users to upload Word documents and provide basic search features against the actual contents of the documents.
Now you may ask - doesn't the built in Verity engine do this? It does, but if you have any other need for the text, you are out of luck. Verity sucks in the text and stores it in it's own format. You can't ask Verity to make a copy of the complete text. Let's get started!
Our code will make use of two open source products. The first is the Apache POI product. This is a set of Java libraries that allow deep integration into Office documents, both reading and writing. Our use of the library will be fairly simple - just reading. I downloaded the latest 3.2 version which provides support for Office formats from the old 97 versions up to the Office 2007 release. For OOXML you can use POI 3.5, which is in beta. I had trouble playing with this so decided to focus on the non-OOXML version.
After you download the zip, open up the archive and copy the 3 JAR files you find in the root. I copied these to a folder called 'jars' under my web root.
The next thing we need is JavaLoader. JavaLoader is a ColdFusion project created by Mark Mandel (the creator of Transfer). This code lets you load any random JAR file on the fly. Normally you have to copy new JARs to a specific location under your ColdFusion install and you have to follow this up with a server restart. JavaLoader is a much simpler way to handle this. It also helps with another problem - class conflicts. ColdFusion itself makes use of POI (you can see a few POI jar files in the lib folder) but the version bundled is older than what we need to use to extra our text. By using JavaLoader, we can make sure everything comes directly from the JARs in play. I downloaded JavaLoader and extracted the code to a folder named javaloader under my web root.
Alright, so let's look at some code. The first thing I want to do is get JavaLoader initialized with the JAR files from POI. When you initialize JavaLoader you pass in an array of JAR files. Since I put everything in a folder, I wrote some code to simply iterate over the directory and create a list of all the JARs. I then pass this to JavaLoader:
<!--- where the poi files are ---> <cfset jarpath = expandPath("./jars")> <cfset paths = > <cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
<cfloop query="files"> <cfset arrayAppend(paths, directory & "/" & name)> </cfloop>
<!--- load javaloader ---> <cfset variables.loader = createObject("component", "javaloader.JavaLoader").init(paths)>
Now I'm going to create a set of Java classes. This will be used by the extraction code and comes from my reading of the POI docs. Each extraction will typically involve one Java object for the file itself and one for the extractor:
<!--- generic file reader doohicky ---> <cfset myfile = createObject("java","java.io.FileInputStream")>
<!--- get our required things loaded --->
<!--- Word ---> <cfset doc = loader.create("org.apache.poi.hwpf.HWPFDocument")> <cfset wordext = loader.create("org.apache.poi.hwpf.extractor.WordExtractor")>
<!--- Excel ---> <cfset excel = loader.create("org.apache.poi.hssf.usermodel.HSSFWorkbook")> <cfset xlsext = loader.create("org.apache.poi.hssf.extractor.ExcelExtractor")>
<!--- Powerpoint ---> <cfset ppt = loader.create("org.apache.poi.hslf.HSLFSlideShow")> <cfset pptext = loader.create("org.apache.poi.hslf.extractor.PowerPointExtractor")>
Ok, now to get parsing. I created a new folder called testdocs and dumped a few Word, Excel, and Powerpoint files inside. Our code will now loop over each file and attempt to extract it based on the extension:
<!--- get files ---> <cfset filePath = expandPath("./testdocs")> <cfdirectory action="list" name="files" directory="#filePath#">
<cfoutput query="files"> <cfset theFile = filePath & "/" & name> <cfset myfile.init(theFile)>
Reading: #theFile#<br/> <cfswitch expression="#listLast(name,".")#"> <cfcase value="doc"> <cfset doc = doc.init(myfile)> <cfset wordext.init(doc)> <cfoutput>
<pre> #wordext.getText()# </pre> </cfoutput> </cfcase>
<cfcase value="xls"> <cfset excel = excel.init(myfile)> <cfset xlsext = xlsext.init(excel)> <cfoutput>
<pre> #xlsext.getText()# </pre> </cfoutput> </cfcase>
<cfcase value="ppt"> <cfset ppt = ppt.init(myfile)> <cfset pptext = pptext.init(ppt)> <cfoutput>
<pre> #pptext.getText(true,true)# </pre> </cfoutput> </cfcase> </cfswitch>
As before we use cfdirectory and then we loop over each file. Notice that I'm initializing a FileInputStream object, myfile, which will be used to seed the specific object representations of the Office documents.
For each file we check the extension, and based on the extension, use the appropriate extractor object to get the text.
That's it! If you download the zip attached to this blog entry and then run the code, you will see that the code does a reasonable good job of representing the text. For example, the text from the Excel document is laid out much like how you see it visually in MS Excel.
Sample output from an Excel sheet
Once you have the text you can store it in a database, email it, or do anything else you would like with it.
Pretty simple and powerful! Also note that you can get other data then just the text. Each extractor has it's own options and each Office object itself can do it's own unique thing. For example, with the Powerpoint support you can choose to get the text of slides or the notes, or both. Check the docs at Apache for more information.