Working with Office Metadata

I’ve had a chance to play a bit more with the Apache POI project and I thought I’d share a bit of code demonstrating how to read Office document metadata. Unfortunately I was not able to get this working with Office 2007, but maybe I’ll get lucky with reader Leah again! Anyway, the code:

<!--- where the poi files are ---> <cfset jarpath = expandPath("./jars2")> <cfset paths = []> <cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">

<cfloop query=”files”> <cfset arrayAppend(paths, directory & “/” & name)> </cfloop>

<!— load javaloader —> <cfset variables.loader = createObject(“component”, “javaloader.JavaLoader”).init(paths)> </code>

This is the exact same code I’ve used in my previous two blog entries so I won’t explain it again.

<!--- read in my Word doc ---> <cfset myfile = createObject("java","").init(expandPath("./testdocs/Testing Reading Word Docs.doc"))>

Unlike my previous entries where I looped over a folder of documents, in this example I’m just using one specific Word document.

<!--- Word Support ---> <cfset doc = loader.create("org.apache.poi.hwpf.HWPFDocument")> <!--- init it with my java file input stream set to my test file ---> <cfset doc = doc.init(myfile)>

Next we create an instance of HWPFDocument. This is the specific class used for Word documents. You would use something different for PPT or Excel files. Once I create the class I pass in the file I specified earlier.

Ok, now for the fun part:

<cfset summary = doc.getSummaryInformation()>

This code retrieves a set of summary data from the document. This is another Java object itself with a set of methods to get, set, and remove document metadata. As an example:

<cfoutput> Title=#summary.getTitle()#<br/> Page Count=#summary.getPageCount()#<br/> Word Count=#summary.getWordCount()#<br/> Application=#summary.getApplicationName()#<br/> Author=#summary.getAuthor()#<br/> Comments=#summary.getComments()#<br/> CreateDateTime=#summary.getCreateDateTime()#<br/> Edit Time=#summary.getEditTime()#<br/> Keywords=#summary.getKeywords()#<br/> Last Author=#summary.getLastAuthor()#<br/> Last Printed=#summary.getLastPrinted()#<br/> Last SaveDateTime=#summary.getLastSaveDateTime()#<br/> Revision Number=#summary.getRevNumber()#<br/> Security=#summary.getSecurity()#<br/> Subject=#summary.getSubject()#<br/> Template=#summary.getTemplate()#<br/> </cfoutput>

Pretty much all of those methods should be self-explanatory, but I’ll point out some interesting ones. The getEditTime() function actually returns how long the document has been edited. I assume that is related to how long I have the document open in my application. Not sure how I’d use that but it’s cool nonetheless. getSecurity returns an integer that defines what type of security the document has (duh), and is documented in the POI API docs. (I’ve copied the values to my test CFM file attached to this entry.) Another method that I didn’t actually demonstrate above is getThumbnail(). This returns binary data for a thumbnail. The data is either in WMF or BMP format. CF can work with BMP, but my test document must have had a WMF thumbnail. I was able to save the bits to the file system but wasn’t able to actually do anything with it. My Mac wanted to use Adobe Illustrator to view it, but AI complained that the file wasn’t valid. If we can get that working, it would be cool!

So how hard is it to update the metadata?

<cfset summary.setTitle("Ray changed this doc #randRange(1,100)#")>

<!— read in my Word doc —> <cfset myfile2 = createObject(“java”,””).init(expandPath(“./testdocs/Testing Reading Word Docs.doc”))> <cfset doc.write(myfile2)> </code>

I set a new, random title. I then create a FileOutputStream using the same file name and then simply ran the write method of the doc object. This works because the summary object pointed back to the original document, so even though I modified summary, it updated the doc object as well.

Pretty simple I think. One could wrap this up into a nice CFC and make it even simpler of course.<p>Download attached file.</p>

Raymond Camden's Picture

About Raymond Camden

Raymond is a developer advocate. He focuses on JavaScript, serverless and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support.

Lafayette, LA