Recently Ryan Stille (one of the new ColdFusion ACPs) posted a comment on my blog entry, Some Basic Solr/Verity Differences. In that comment he pointed out that he was noticing differences in results returned by Verity and Solr. No big surprise there - but what was surprising was the lack of data returned by Solr. Spurred on by his comment I did some testing of mine and I have to say - I'm pretty disappointed. What follows are some findings in regards to testing file based collections in Solr and Verity. I'll point out that all of this has been brought to Adobe, so I'm not just complaining but actively trying to improve the problem for ColdFusion 9.X.X (i.e., whatever comes next).
Before going further, please be sure you note the qualification I made above. These issues refer to file based collections of data. In other words, cases where you ask Verity/Solr to index files, like Word Docs, PDFs, and other binary formats. It does not refer to a collection that is built from your database.
For my testing I used Windows XP and a folder of 8 documents. This folder included 1 MP3, 4 PDFs, 2 Word docs, and one text file. I indexed both using the ColdFusion Administrator. My tests were done using CF Admin Searcher, a ColdFusion Admin extension that lets you perform ad hoc queries against Verity and Solr collections. I basically opened the tool up in two tabs and performed the same search in both to do my comparisons. Here is what I found:
- The Summary field in results for Solr contained binary "junk". Verity cleaned this up. Example:

That result came from a MP3, which you expect to be 'dirty', but Verity correctly clears this from the summary. I also see these chars in PDF files as well. Word docs seemed fine though.
-
Solr failed to pick up the TITLE value for any binary file. Verity got them all. I also saw this in other metadata fields. Solr also missed the author field for example.
-
Solr failed to return any context values for results while Verity had no trouble. You should note that you have to perform a file edit to enable context with Solr (details here: http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSe9cbe5cf462523a0-5bf1c839123792503fa-8000.html) but even with this change there was no additional text in my context.
-
While I wouldn't classify this as serious testing in any way, in every test, Verity searches were about twice as fast. Now we are talking about 15 versus 30 ms, which is not something to be concerned about, but Solr was supposed to be quite a bit fasting. (To be fair, my test suites are so small as to not really be relevant.)
-
The TYPE value for results is correct for Verity, but comes back as application/octet-stream for Solr. You don't need this column of course, you can sniff the extension, but still...
All in all, this is disappointing. I don't think I can recommend Solr (specifically, Solr as bundled in ColdFusion 9) for production use... at least for a collection that is heavily file based. You can, of course, do post-processing of search results to get metadata. ColdFusion 9 supports getting metadata from Word docs now, but you have to convert it to PDF first. That's something you would definitely want to cache though.
As I said, both Ryan and I shared our findings with Adobe so I'm sure it will get corrected in the future. People making a decision about search support should consider carefully though. I don't think anyone thinks Verity will last much longer. Solr is definitely the future. But we've got a few bumps in the road to get past first.
Archived Comments
Isn't it true that Verity doesn't support the new Office 2007 file formats? This was a deal-breaker for me. It made Solr wonderful for me compared to Verity despite other faults.
Fire me off an example and I'll add it to my test. Be sure to send me a snippet of text in case I can't open it.
Thanks for taking the time to test and post your findings. On one of my sites I have a large .pdf file based collection that I planned to move from verity to solr. I think I will now wait as these issues will cause headaches for me that I would rather not have right now.
@Josh - Thanks for the test file. As you expected, Verity did NOT parse the DOCX. Solr did. Like before though it missed all the metadata. It did, though, correctly match inside. Oh - one change - it DID do context this time.
I will point out that most of the problems you've encountered are with Adobe's CF implementation of Solr, not with Solr itself. Ditching cfindex/cfsearch and using Solr's API directly ends up being a far more useful, though nowhere near as easy.
I *really* need to clean up my cfsolr library and get that released :(
Why not just build off of my Seeker OS project? ;) Although it is Lucene, not Solr.
Second what Shannon says, this is ColdFusion, not Solr.
We've gone the standalone solr route and couldn't be happier. We use Tika for binary text extraction and don't have the issues described above.
Good points on the "It isn't Solr's fault" thread. Do you guys think I don't make that clear enough? The comments should clear it up of course, but if it needs clarifying even more, I can add a note to the end of the entry.
@Ray: I think when you read the blog and come across statements like "I don't think I can recommend Solr for production user" - it means you're blaming Solr. :)
Good point - I added a qualifier and I fixed the typo.
Is there any problem continuing to use Verity when upgrading to ColdFusion 9?
@Sid: I don't believe so. Again - note that Adobe says Solr is much faster. As I said above, I do not think my speed tests are very accurate. I think, most likely, you will find Solr faster. That being said, Verity should be "good enough". I do think Verity will be removed soon. But it's not like you will be surprised by this. If Verity is removed in CF10, you will have plenty of time to migrate.
Hi @Shannon or @Peter (or anyone else) - do you have any documentation and/or CFC wrappers to share for a Standalone SOLR implementation with CF? We're on CF9, but have over 700,000 documents (think CV's and resumes), so we don't want to break old search functionality when upgrading from Verity to SOLR.
@aaron
http://cfsolrlib.riaforge.org/
It seems like some of your issues were addressed in 9.0.1 according to the doc: http://www.adobe.com/suppor...
I'd be interested in a followup post.
Yep, if you saw my post on the 901 released, I mentioned this specifically. A follow up is planned.
CF901 blog post: http://www.coldfusionjedi.c...
Is the lack of MS Office 2007 support in Verity new to CF9? We are currently on CF8 with a pretty large doc store 280k+ docs indexed and working fine with Word 2007 files. As one of the other comments said... think CV's indexed and search-able so I know I would hear about it if Word 2007 files did not show up. :-)
Basically Verity hasn't been updated in 100 years. (Well the embedded version.) I'd consider testing in Solr and only Solr.
I realize this is pretty old, but is there any easy way to include a PDF or Excel or Word file creation date in the initial solr CFINDEX? Is it possible to do this with a custom field?
If you create the index manually - sure - but then you don't get the auto text extraction. Actually wait - if you do a file/dir based index, and then follow it up with manual updates, then yes, that should work fine.
In reference to Ray's comments regarding the context field not displaying any additional information, I found the same situation. I altered the xml files as described but no new information showed up in the context field. Verity had useful information to display on search results page. Any suggestions as to retrieve the same sort of information in context field?
Ronnie, a few things:
a) Are you running 901+the latest hot fixes?
b) Did you remember to _ask_ for context with the contextPassages attribute?
c) Did you reindex your data?
Ray,
1. /opt/coldfusion9/lib/updates/hf901-00002.jar
2. <cfsearch collection="#coll#" name="Getresults" startrow="#startRow#" maxrows="#maxRow#" contextpassages="1"
contextbytes="500" contexthighlightbegin="<strong>" contextHighlightEnd="</strong>"
criteria="#searchFor#" suggestions="Always" status="info" language="English" >
3. Not explicitly. Each time the scheduled task that builds the collection (weekly) runs, it purges the collection and then rebuilds it. I assumed that indexed it so that it would never need to be reindexed per se.
So yeah, if you purge and reindex it should be refresh. Odd. At this time I'll have to punt and suggest calling Adobe for support. You mentioned you altered the XML files. If you try with a new collection, and don't change the XML, do you see anything different?
Ray, first off, I was able to fix the previous problem by making the changes to the config files. It didn't happen right away but I think that after our SAs bounced our server and I reran the collection build all was well. Now for the next problem, I posted this to another of you blogs so forgive me if it's a duplicate.
Our team has been in the process of moving our site search from Verity to Solr. We are running CF9. I am able to build the collections and search them but have run into a problem. Hopefully, someone can shed some light on a solution. We are using cfincludes which can be either cfm or html files. When the search results are displayed and clicking on one of the links that happen to be one of the included files, none of the css is applied which is usually displayed from the file containing the cfinclude. Is there any way to have Solr display the link to the file that contains the cfinclude with all the CSS displayed properly as opposed to the included file?
I responded to your question on the other comment you posted.
Is there an issue with indexing .zip files and password encrypted pdfs?
I thought i had read a few months ago (most likely longer) that password encrypted pdfs were an issue.
Garth
Not sure. The easiest thing to do though is to write a quick test and report back to us. :)
I was unable to index password encrypted pdfs. Wasn't sure if there was a setting or solr config file i could change.
In the mean time i found on http://wiki.apache.org/solr...
some information regarding this. :)
It appears that you can not with the version (1.3) that comes with CF 9.
I tried to index zip files via the CFadmin interface. This errored out.
My thought was same as above.... maybe i needed to add something to the config file.
I haven't found anything yet on http://wiki.apache.org/solr, but will keep looking.
Thanks