I've had a chance to play a bit more with the Apache POI project and I thought I'd share a bit of code demonstrating how to read Office document metadata. Unfortunately I was not able to get this working with Office 2007, but maybe I'll get lucky with reader Leah again!
Anyway, the code:
<!--- where the poi files are --->
<cfset jarpath = expandPath("./jars2")>
<cfset paths = []>
<cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
<cfloop query="files">
<cfset arrayAppend(paths, directory & "/" & name)>
</cfloop>
<!--- load javaloader --->
<cfset variables.loader = createObject("component", "javaloader.JavaLoader").init(paths)>
This is the exact same code I've used in my previous two blog entries so I won't explain it again.
<!--- read in my Word doc --->
<cfset myfile = createObject("java","java.io.FileInputStream").init(expandPath("./testdocs/Testing Reading Word Docs.doc"))>
Unlike my previous entries where I looped over a folder of documents, in this example I'm just using one specific Word document.
<!--- Word Support --->
<cfset doc = loader.create("org.apache.poi.hwpf.HWPFDocument")>
<!--- init it with my java file input stream set to my test file --->
<cfset doc = doc.init(myfile)>
Next we create an instance of HWPFDocument. This is the specific class used for Word documents. You would use something different for PPT or Excel files. Once I create the class I pass in the file I specified earlier.
Ok, now for the fun part:
<cfset summary = doc.getSummaryInformation()>
This code retrieves a set of summary data from the document. This is another Java object itself with a set of methods to get, set, and remove document metadata. As an example:
<cfoutput>
Title=#summary.getTitle()#<br/>
Page Count=#summary.getPageCount()#<br/>
Word Count=#summary.getWordCount()#<br/>
Application=#summary.getApplicationName()#<br/>
Author=#summary.getAuthor()#<br/>
Comments=#summary.getComments()#<br/>
CreateDateTime=#summary.getCreateDateTime()#<br/>
Edit Time=#summary.getEditTime()#<br/>
Keywords=#summary.getKeywords()#<br/>
Last Author=#summary.getLastAuthor()#<br/>
Last Printed=#summary.getLastPrinted()#<br/>
Last SaveDateTime=#summary.getLastSaveDateTime()#<br/>
Revision Number=#summary.getRevNumber()#<br/>
Security=#summary.getSecurity()#<br/>
Subject=#summary.getSubject()#<br/>
Template=#summary.getTemplate()#<br/>
</cfoutput>
Pretty much all of those methods should be self-explanatory, but I'll point out some interesting ones. The getEditTime() function actually returns how long the document has been edited. I assume that is related to how long I have the document open in my application. Not sure how I'd use that but it's cool nonetheless. getSecurity returns an integer that defines what type of security the document has (duh), and is documented in the POI API docs. (I've copied the values to my test CFM file attached to this entry.) Another method that I didn't actually demonstrate above is getThumbnail(). This returns binary data for a thumbnail. The data is either in WMF or BMP format. CF can work with BMP, but my test document must have had a WMF thumbnail. I was able to save the bits to the file system but wasn't able to actually do anything with it. My Mac wanted to use Adobe Illustrator to view it, but AI complained that the file wasn't valid. If we can get that working, it would be cool!
So how hard is it to update the metadata?
<cfset summary.setTitle("Ray changed this doc #randRange(1,100)#")>
<!--- read in my Word doc --->
<cfset myfile2 = createObject("java","java.io.FileOutputStream").init(expandPath("./testdocs/Testing Reading Word Docs.doc"))>
<cfset doc.write(myfile2)>
I set a new, random title. I then create a FileOutputStream using the same file name and then simply ran the write method of the doc object. This works because the summary object pointed back to the original document, so even though I modified summary, it updated the doc object as well.
Pretty simple I think. One could wrap this up into a nice CFC and make it even simpler of course.
Archived Comments
The shortest route I found was to use the ExtractorFactory. Since it can handle both formats, I used it as a generic shortcut to grab the underlying properties. (Unfortunately, I could not find a ReaderFactory.)
The ooxml format has different property types: core, extended and custom. So to determine the file format, I used java's isInstance method. CF's IsInstanceOf method did not seem to work with the objects I created using the javaLoader.
ExtractorFactory = javaLoader.create("org.apache.poi.extractor.ExtractorFactory");
inputFile = createObject("java", "java.io.File").init( "c:\myFiles\testExcel2007.xlsx" );
extractor = ExtractorFactory.createExtractor( inputFile );
// Determine the format
POIXMLTextExtractor = javaLoader.create("org.apache.poi.POIXMLTextExtractor");
isFormatOOXML = POIXMLTextExtractor.getClass().isInstance( extractor );
if (isFormatOOXML)
{
// extract core properties (author, title, etctera...)
coreProp = extractor.getCoreProperties().getUnderlyingProperties();
WriteOutput("Creator = " & coreProp.getCreatorProperty().getValue() & "<br>");
WriteOutput("Title = " & coreProp.getTitleProperty().getValue() & "<br/>");
// ...
}
else
{
summary = extractor.getSummaryInformation();
WriteOutput("Title="& summary.getTitle() &"<br/>");
WriteOutput("Author"& summary.getAuthor() &"<br/>");
// ....
}
Nice, I tried like heck to figure out how to get props for 2007 and I just couldn't figure it out. Thank you!
You are welcome. Sometimes the api only gets you so far... Eclipse's debug mode for java code often fills in the gaps. (It definitely did this time ;-)
So I asked this question on the earlier post. Does it make sense to turn this into a CFC? There is a metadata project at RIAForge, but I believe it hasn't been updated since 06. Also, I don't want to step on Ben Nadel's toes. His code does everything Excel related. But maybe a CFC that _just_ does text and metadata read/writes would be useful?
No question there would be some overlap. But a separate cfc might be useful. I could see cases where you might want to extract the text or metadata from _any_ office document, not Excel specifically.
I've spoken with Ben and he agrees. I'll see what I can whip up. I assume I have full rights to steal I mean innovate from you, Leah? ;) I'll credit you by URL to help respect your privacy.
Well, it is okay with me. But you will have to ask Leah too ;-)
Ray, thank so much for the post. I'm trying to batch extract about 4,000 word docs and POI is doing a great job until it hits a *.doc file that it determines is actually an RTF file. I know POI doesn't support RTF but I've run into a wall trying to determine the best way to ignore/possibly re-classify the file and continue processing. Any ideas?
Why not just try/catch the call?
I'm using the POI tools to successfully read the text in from a Word 2007 (docx) file, but for some reason when it completes reading the file, ti will not release the document so that I can delete it. Even when I try to manually browse to the file, Windows tells me that it cannot delete the file because jrun.exe is still using it. My Word 2003 (doc) files do not exhibit this behaviour. Any clues? Some code is below:
<!--- where the poi files are --->
<cfset jarpath = ABSPath & "\jars">
<cfset paths = []>
<cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
<!--- Get all the jar files to load --->
<cfloop query="files">
<cfset arrayAppend(paths, directory & "/" & name)>
</cfloop>
<!--- Load javaloader --->
<cfset server.loader = createObject("component", "#CFHome#/components.javaloader.JavaLoader").init(paths)>
<!--- Generic file reader --->
<cfset myfile = createObject("java","java.io.File")>
<cfset myfile.init("#UploadPath##currFileName#")>
<!--- Init the extractor factory --->
<cfset extractorFactory = server.loader.create("org.apache.poi.extractor.ExtractorFactory")>
<!--- Create Extractor --->
<cfset extractor = extractorFactory.createExtractor(myFile)>
<!--- Get Summary Info
<cfset summary = extractor.getSummaryInformation()>
<cfoutput><pre>#summary.getPageCount()#</pre></cfoutput>
--->
<!--- Get our page count --->
<cfset PagesFound = REMatch(Session.KeyPhrase,extractor.getText())>
<cfset PageCounter = ArrayLen(PagesFound)>
NOTE: I added this trying to see if i released the objects if it would let the file go, but it made no difference.
<!--- Close the file? --->
<cfset extractor = "">
<cfset extractorFactory = "">
<cfset myfile = "">
Weird - I looked around for a method that would possibly close the connection, but do not see one and I see no reason why the extractor would even need to keep it open. Maybe it is a Office 2007 versus Earlier issue?
It seems like it has something to do with the underlying java code. It creates a PushbackInputStream to read the first few bytes of the file to determine if it is binary or ooxml. Try using the createExtractor() method that accepts an InputStream:
...
<cfset fis = createObject("java", "java.io.FileInputStream").init(myFile)>
<cfset extractorFactory = server.loader.create("org.apache.poi.extractor.ExtractorFactory")>
<cfset extractor = extractorFactory.createExtractor(fis)>
...etcetera ..
<cfset fis.close()>
Ray/Leigh,
Thanks for your help! Changing to the FileInputStream did the trick and it was able to purge the file after reading it.
Thanks again!
Hey there - great post, and hoping this is the answer to what I'm looking to do. Trying to extract summary information from DOCX files on CF9, using (what I think are) the latest POI jars. Getting "The getSummaryInformation method was not found" error when trying to get summary info. I'm specifically trying to extract the create date/time from the docs. Any insight? I'm currently saying:
<cfset myfile = createObject("java","java.io.FileInputStream").init(expandPath("UPLOAD_DOCS/#DOCMT_URL_TXT#"))>
<cfset extractorFactory = createObject("java","org.apache.poi.extractor.ExtractorFactory")>
<cfset extractor = extractorFactory.createExtractor(myFile)>
<cfset summary = extractor.getSummaryInformation()>
<cfoutput><pre>#summary.getCreateDateTime()#</pre></cfoutput>
> using (what I think are) the latest POI jars.
Not quite, but they are close enough (3.5-beta). The metadata is slightly different for binary and ooxml files. Take a look at the first comment above. It utilizes different properties for the two formats.
For ooxml documents, try using the extractor's core properties.
ie
core = extractor.getCoreProperties();
created = core.getCreated();
title = core.getTitle();
etcetera ...
HTH
-Leigh
Im having trouble setting custom metadata to a word document using the Extractor Factory method, after not much luck with the. I thought the following would work, and its not throwing any errors, but i cant help think im missing something. Anyone have any ideas?
Thanks in advance
http://pastebin.com/bpff6cmY
In case anyone is wondering. I have managed to solve my earlier problem. I posted a solution on my blog:
http://www.rbrasier.com/201...
Very cool - thanks for sharing your solution.
Anyone find a solution to updating properties for Office 07 documents yet?
I've read about org.apache.poi.xwpf.usermodel.XWPFDocument and tried to implement them but I'm having trouble initilizing it.
<cfset javaloader = createObject("component", "home.cfcs.javaloader.JavaLoader").init(paths)>
<cfset fis = createObject("java","java.io.FileInputStream")>
<cfset theFile = fileQuery.directory & "/" & fileQuery.name>
<cfset fis.init(theFile)>
<cfset docx = javaloader.create("org.apache.poi.xwpf.usermodel.XWPFDocument").init(fis)>
the last code gives me an "Object instantation exception" error
The last code - so you mean the last line?
yes, the last line. I also tried
cfset docx = createObject("java","org.apache.poi.xwpf.usermodel.XWPFDocument").init(createObject("java","java.io.FileInputStream").init(theFileJava))
but it gives me an "Unable to find a constructor for class org.apache.poi.xssf.usermodel.XSSFWorkbook that accepts parameters of type ( java.io.FileInputStream ).". I'm stuck to this because I know it should accept the FileInputStream because it's a chile of InputStream. I'm using CF9, not sure if it matters.
Thanks in advance.
Did you verify you are passing in a valid file path ie FileExists(theFile)?
<i>I know it should accept the FileInputStream because it's a chile of InputStream</i>
True, but it was created by a different class loader ie createObject versus javaLoader which can cause problems in some cases. Try creating both objects with the javaLoader.
Yes, it's a valid file.
I tried this:
<cfset docx = javaloader.create("org.apache.poi.xwpf.usermodel.XWPFDocument").init(fis)>
but it's giving me an "Object instantation exception"
That is just a generic error you get whenever something goes wrong with a java object. Look in the stack trace. That is where you will find the real exception.
Thank you so much for a very prompt reply. I do appreciate it.
I modify the code a bit to make it clear:
<cfscript>
theFileCF = fileQuery.directory & "\" & fileQuery.name;
docx = createObject("java","org.apache.poi.xwpf.usermodel.XWPFDocument").init(createObject("java","java.io.FileInputStream").init(theFileCF));
</cfscript>
this still gives me an "Unable to find a constructor for class org.apache.poi.xwpf.usermodel.XWPFDocument that accepts parameters of type ( java.io.FileInputStream ). "
tried to use the javaloader:
<cfscript>
theFileCF = fileQuery.directory & "\" & fileQuery.name;
docx = javaloader.create("org.apache.poi.xwpf.usermodel.XWPFDocument").init(javaloader.create("java.io.FileInputStream").init(theFileCF));
</cfscript>
but it gives me an error "An exception occurred while instantiating a Java object. The class must not be an interface or an abstract class."
I use the poi jars from poi-bin-3.8-20120326
Forget about the "..class must not be an interface or an abstract class" error. It is usually meaningless. Just a throw away header to say "oops, something" went wrong. To find out what that "something" is, you have to review the stack trace.
http://help.adobe.com/en_US...
here's the stack trace:
Stack Trace
at cffilename_search_action2ecfm454045983.runPage(D:\Websites\Home\Employee Services\Policies & Procedures\filename_search_action.cfm:153)
coldfusion.runtime.java.JavaProxy$NoSuchConstructorException: Unable to find a constructor for class org.apache.poi.xwpf.usermodel.XWPFDocument that accepts parameters of type ( java.io.FileInputStream ).
at coldfusion.runtime.java.JavaProxy.CreateObject(JavaProxy.java:178)
at coldfusion.runtime.java.JavaProxy.invoke(JavaProxy.java:80)
at coldfusion.runtime.CfJspPage._invoke(CfJspPage.java:2360)
at cffilename_search_action2ecfm454045983.runPage(D:\Websites\Home\Employee Services\Policies & Procedures\filename_search_action.cfm:153)
at coldfusion.runtime.CfJspPage.invoke(CfJspPage.java:231)
at coldfusion.tagext.lang.IncludeTag.doStartTag(IncludeTag.java:416)
at coldfusion.filter.CfincludeFilter.invoke(CfincludeFilter.java:65)
at coldfusion.filter.ApplicationFilter.invoke(ApplicationFilter.java:363)
at coldfusion.filter.RequestMonitorFilter.invoke(RequestMonitorFilter.java:48)
at coldfusion.filter.MonitoringFilter.invoke(MonitoringFilter.java:40)
at coldfusion.filter.PathFilter.invoke(PathFilter.java:87)
at coldfusion.filter.LicenseFilter.invoke(LicenseFilter.java:27)
at coldfusion.filter.ExceptionFilter.invoke(ExceptionFilter.java:70)
No, sorry I meant the stack trace from the other example. (If the full trace is too long for comments, use pastebin).
please see
http://pastebin.com/EX97PSxS
Perfect, thanks. When you create the javaLoader, try setting the loadColdFusionClassPath property to true:
javaLoader.init(loadPaths=paths, loadColdFusionClassPath=true);
Hello, I tried the code but I'm still getting the same error
here's my code http://pastebin.com/RReHJ77m
and here's the stack trace http://pastebin.com/24zXEcQQ
Thanks in advance for your patience.
Basically the problem is the dom4j jar which is used by POI and CF. It is notorious for causing problems when multiple class loaders are used. In short, omit that jar and use the loadColdFusionClassPath=true and it should work. See example here: http://pastebin.com/iEafTZqq
Though if all you need is the properties, it is simpler to use the ExtractorFactory. Then you will not need separate code for the different types of POI files. There is an example above in the initial comments.
Thank you so much, Leigh!!! It worked!
I omit dom4j jar, use loadPaths=paths, loadColdFusionClassPath=true and ExtractorFactory.
Great! Glad to hear it is all sorted out :)