Posted in ColdFusion | Posted on 02-04-2009 | 10,665 views
Have you ever needed to get information about Office documents with ColdFusion? In this blog entry I'll talk about how we can use some open source tools along with ColdFusion to read from various types of Microsoft Office documents. I'll be focusing on retrieving the text of the documents only. You can use this code to allow users to upload Word documents and provide basic search features against the actual contents of the documents.
Now you may ask - doesn't the built in Verity engine do this? It does, but if you have any other need for the text, you are out of luck. Verity sucks in the text and stores it in it's own format. You can't ask Verity to make a copy of the complete text. Let's get started!
Our code will make use of two open source products. The first is the Apache POI product. This is a set of Java libraries that allow deep integration into Office documents, both reading and writing. Our use of the library will be fairly simple - just reading. I downloaded the latest 3.2 version which provides support for Office formats from the old 97 versions up to the Office 2007 release. For OOXML you can use POI 3.5, which is in beta. I had trouble playing with this so decided to focus on the non-OOXML version.
After you download the zip, open up the archive and copy the 3 JAR files you find in the root. I copied these to a folder called 'jars' under my web root.
The next thing we need is JavaLoader. JavaLoader is a ColdFusion project created by Mark Mandel (the creator of Transfer). This code lets you load any random JAR file on the fly. Normally you have to copy new JARs to a specific location under your ColdFusion install and you have to follow this up with a server restart. JavaLoader is a much simpler way to handle this. It also helps with another problem - class conflicts. ColdFusion itself makes use of POI (you can see a few POI jar files in the lib folder) but the version bundled is older than what we need to use to extra our text. By using JavaLoader, we can make sure everything comes directly from the JARs in play. I downloaded JavaLoader and extracted the code to a folder named javaloader under my web root.
Alright, so let's look at some code. The first thing I want to do is get JavaLoader initialized with the JAR files from POI. When you initialize JavaLoader you pass in an array of JAR files. Since I put everything in a folder, I wrote some code to simply iterate over the directory and create a list of all the JARs. I then pass this to JavaLoader:
2<cfset jarpath = expandPath("./jars")>
3<cfset paths = []>
4<cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
5
6<cfloop query="files">
7 <cfset arrayAppend(paths, directory & "/" & name)>
8</cfloop>
9
10<!--- load javaloader --->
11<cfset variables.loader = createObject("component", "javaloader.JavaLoader").init(paths)>
Now I'm going to create a set of Java classes. This will be used by the extraction code and comes from my reading of the POI docs. Each extraction will typically involve one Java object for the file itself and one for the extractor:
2<cfset myfile = createObject("java","java.io.FileInputStream")>
3
4<!--- get our required things loaded --->
5
6<!--- Word --->
7<cfset doc = loader.create("org.apache.poi.hwpf.HWPFDocument")>
8<cfset wordext = loader.create("org.apache.poi.hwpf.extractor.WordExtractor")>
9
10<!--- Excel --->
11<cfset excel = loader.create("org.apache.poi.hssf.usermodel.HSSFWorkbook")>
12<cfset xlsext = loader.create("org.apache.poi.hssf.extractor.ExcelExtractor")>
13
14<!--- Powerpoint --->
15<cfset ppt = loader.create("org.apache.poi.hslf.HSLFSlideShow")>
16<cfset pptext = loader.create("org.apache.poi.hslf.extractor.PowerPointExtractor")>
Ok, now to get parsing. I created a new folder called testdocs and dumped a few Word, Excel, and Powerpoint files inside. Our code will now loop over each file and attempt to extract it based on the extension:
2<cfset filePath = expandPath("./testdocs")>
3<cfdirectory action="list" name="files" directory="#filePath#">
4
5
6<cfoutput query="files">
7 <cfset theFile = filePath & "/" & name>
8 <cfset myfile.init(theFile)>
9
10 Reading: #theFile#<br/>
11
12 <cfswitch expression="#listLast(name,".")#">
13
14 <cfcase value="doc">
15 <cfset doc = doc.init(myfile)>
16 <cfset wordext.init(doc)>
17 <cfoutput>
18<pre>
19#wordext.getText()#
20</pre>
21 </cfoutput>
22 </cfcase>
23
24 <cfcase value="xls">
25 <cfset excel = excel.init(myfile)>
26 <cfset xlsext = xlsext.init(excel)>
27 <cfoutput>
28<pre>
29#xlsext.getText()#
30</pre>
31 </cfoutput>
32 </cfcase>
33
34 <cfcase value="ppt">
35 <cfset ppt = ppt.init(myfile)>
36 <cfset pptext = pptext.init(ppt)>
37 <cfoutput>
38<pre>
39#pptext.getText(true,true)#
40</pre>
41 </cfoutput>
42 </cfcase>
43 </cfswitch>
44
45 <p><hr/></p>
46
47</cfoutput>
As before we use cfdirectory and then we loop over each file. Notice that I'm initializing a FileInputStream object, myfile, which will be used to seed the specific object representations of the Office documents.
For each file we check the extension, and based on the extension, use the appropriate extractor object to get the text.
That's it! If you download the zip attached to this blog entry and then run the code, you will see that the code does a reasonable good job of representing the text. For example, the text from the Excel document is laid out much like how you see it visually in MS Excel.
Sample output from an Excel sheet

Once you have the text you can store it in a database, email it, or do anything else you would like with it.
Pretty simple and powerful! Also note that you can get other data then just the text. Each extractor has it's own options and each Office object itself can do it's own unique thing. For example, with the Powerpoint support you can choose to get the text of slides or the notes, or both. Check the docs at Apache for more information.
How are people using POI in the wild? Ben Nadel has his POI Utility which enables for reading/writing Excel docs. Todd Sharp used to use it for SlideSix but has since moved on to using OpenOffice.


Good post Ray!
I see envision lot's of opportunities with this ...
Great post Ray ... :)
So it may simply be that there isn't that much for POI to hook into.
Now I'm trying to switch to Tika (http://lucene.apache.org/tika/) to support more types of document.
> I had trouble playing with this
Just curious, but what problems did you have? I just started using the beta this week and have had pretty good results so far. Though, it does require more jars than previous versions.
But I am very glad you posted this - because it made discover there is a handy factory class for extracting text. You can pass in a file and the factory returns the correct type of extractor. I have not tried it with PPT, but it seems to be working well with excel and word.
ExtractorFactory = javaLoader.create("org.apache.poi.extractor.ExtractorFactory");
inputFile = createObject("java", "java.io.File").init( "c:\myFiles\testExcel2007.xlsx");
excelExt = ExtractorFactory.createExtractor(inputFile);
WriteOutput("excelExt text=<hr>"& excelExt.getText() &"<hr>");
Leigh
Darn fat fingers.
So my next step is to delete your comment, ban your IP, and take credit! Muahahahahahahahahhahahahaha!
(Sorry! ;)
Reading Office 2007 files is very usefull. Especially when Verity does not handle docx and xlsx files.
Now I can read the text from the Office 2007 files and index the text with CFINDEX.
- Mikkel
So question: Does it make sense for me to build a generic reader CFC for Office files. Something that will get text and get MD properties? I don't want to duplicate Ben's work, but his is focused on Excel only. Would a CFC to just read various office formats be worth the time/effort? Should I ping Ben about possibly expanding his POIUtils to cover more Office formats?
And to me it is out-dated. Latest Word format is Word 2002, PDF version is 1.4 (acrobat 5).
If Adobe wants to use Verity as a sales parameter. Then they have to update the supported file types.
You could wrap it in <pre> and put that in cfdocument to get started.
You could also use OpenOffice to do it, but that requires OpenOffice running as a service on the server.
I've been meaning to blog on that topic for about 6 months now. Maybe some day...
I would be very interested in the OpenOffice method.
Very much so.
> I also have another update showing getting some basic metadata.
Cool. I look forward it. I have not gotten around to exploring metadata with the newer format yet.
PS: My response was delayed because I think my ip was banned (okay, I forgot to subscribe .. but being banned makes me sounds dangerous, like an outlaw ;-)
I am trying to use a Word document as a template, read it in using POIFS, modify some content, and create a new Word document WITH the styles from the original document.
Can you clue me in as to how to do this or where there is good example code for ColdFusion/POIFS Word document creation?
Thanks!
I am trying to convert PPTX files to PPT files because CF doesn't like to convert PPTX files to a PDF. I am using the Open Office ODFConverter to convert my PPTX files to an ODP file before I convert it to a PDF. This works great.
When I try to use CFDocument to convert that ODP file to a PDF I get a bunch of gibberish on the page.
Do you know how to convert these ODP files to a PDF?
<cfdocument
format="pdf"
srcfile="#directory#\sourceFile.odp"
filename="#directory#\destinationFile.pdf"
overwrite="yes">
</cfdocument>
Thanks for the help
@Todd... I'll check out the JODConverter and see how it goes. Thank you for the suggestion.
<cfset docxExt = extractorFactory.createExtractor(fileIo)>
Any output after this tag doesn't show, I can confirm everything works until then, and doc files work fine.
fileIo is a fileInputStream, I've tried a java.io.file too, no difference.
After running my test script, the docx file is still locked since any attempts to change the filename returns an error saying Jrun still has it.
So basically it goes into this method and never returns, any ideas?
[Add Comment] [Subscribe to Comments]