Posted in ColdFusion | Posted on 07-19-2010 | 5,525 views
Back in January of this year I wrote a blog entry about just how poorly Solr was integrated into ColdFusion 9. I'll take partial blame for that as I didn't test it during the betas, but at the end of the day, the support for binary files were so poor there was no way I could recommend it. Database content - sure. But for any collection of files the integration failed miserably.
This is one of the areas that was touched in the 901 updater, and I'm very happy to see that Adobe did indeed do a great job improving their support for binary files. I did a quick test where I dumped in some MP3, PDF, and Word docs into a folder and then indexed it. (I also tried an AVI for the heck of it - it didn't work though.) After the collection was created I did a few quick tests. Here is the result of a search for e*:

As you can see, it picked up quite a bit of information. Check out the summary on the MP3. It isn't clear unless you view source, but the data is line delimited and could be parsed if you wanted to.
I guess this shouldn't come as any surprise that "Yes, it did get fixed," but I'm pretty darn happy. Let me leave you with two quick notes.
First - the application I used to test searching against the collection is my CF Admin extension, CF Admin Searcher. It is a great tool for quickly testing ad hoc queries against your collections.
Secondly - on my Mac I ran into an issue the first time I tried to use Solr after my 901 update. Vinu (from Adobe) suggested checking ColdFusion9/solr/muticore/solr.xml to see if I had any collections there that didn't exist on the file system. I simply removed them all (except core0!), restarted Solr, and everything worked after that.


Have you had any issues with Solr and large files? I was doing some tests similar to yours yesterday and ran into a situation where Solr seems to be choking on one particular PDF file. The browser cranks for a while and eventually it stops and the center frame in the CF admin is blank. The document count on the collection screen remains the same (zero if it is a new collection). The PDF file in question is the coldfusion_9_cfmlref.pdf.
Note that verity indexes the file with no problem.
Thanks for the CF Admin Searcher extension. It's been very handy!
Ray
"Error","jrpp-744","07/20/10","08:28:43","cfadmin","GC overhead limit exceeded The specific sequence of files included or processed is: C:\ColdFusion9\wwwroot\CFIDE\administrator\verity\indexcollection.cfm, line: 111 "
java.lang.OutOfMemoryError: GC overhead limit exceeded
I noticed that when I updated to CF901, my JVM min/max memory settings were reset to 512M as the max. I just changed that to 512/1024. I'm running the index again and will let you know. It's still ongoing.
When I got my collection created I tried an index. It took 60 seconds. So I've got an open request w/ Adobians to see what the heck. There is no reason why Solr should take that long.
First, to the core0 question: core0 was the default collection shipped earlier with CF9. You should be able to delete it now. The restriction to have atleast one collection is no longer there.
Second, to the PDF: I think the issue is with that particular pdf. i am able to repro this. I was able to index 55 pdf's (excluding the cfml ref) of 90MB size in 3 mins with the default solr memory config of 256M. For indexing PDFs, we use the same API's as CFPDF and running CFPDF extracttext causes OOM (in ColdFusion and not Solr). We will investigate the issue with this particular pdf. Please exclude this file and let me know if you are able to index the other files.
Also to tune Solr, you can increase the solr buffer limit (default to 40) based on the solr memory parameters. i.e 80 for 512MB and so on. This will speed up the whole process.
I also just indexed 341 pdf's in approximately 5 to 8 minutes.
Renaming the extension to .pdf will fix the problem.
Error opening new_searcher exceeded limit of maxWarmingSearchers4_try_again_later Error_opening_new_searcher_exceeded_limit_of_maxWarmingSearchers4_try_again_later request: http://localhost:8983/solr/casecollection/update?c...,
If I reduce the number to 4 records at a time, no errors. If I increase it to 5 records, I start to generate the errors.
I need to index 60k+ files, so this is a bit of a concern.
Any Suggestions?
Open solrconfig.xml under <collectiondirectory>/conf and look for maxWarmingSearchers. You can increase the value of that, but make sure Solr has enough memory allocated. Restart Solr. The default may not be sufficient for Solr to handle load in production. (http://help.adobe.com/en_US/ColdFusion/9.0/Develop...)
Its odd cause the issue is on a windows 2003 server, yet performs as expected on my XP dev machine, running the same set of documents against the index.
I guess next is open a ticket with Adobe.
I did change the extension to lower case as stated in an earlier post. It worked, but seems odd that you would need to do this.
I put a try/catch around the cfindex tag and found about 2 times out of 100 files generated that issue, so I put a cfthread sleep after each index to delay the calls to cfindex and it fixed the issue, almost like Solr couldn't handle so many 'index' queries hitting it one after the other?
<cfindex
action="update"
collection="#arguments.collection#"
type="file"
query="tmpQuery"
key="file"
URLpath="urlpath"
categoryTree="categorytree"
custom1="custom1"
>
http://cfbugs.adobe.com/cfbugreport/flexbugui/cfbu...
C:\ColdFusion9\solr\solr\conf
Vinu - do you *know* if this is the source for the solrconfig.xml that winds up in your collection directory? If not...what is the source? Is it accessible to developers somewhere?
So it turns out that with SOLR, by setting cfindex action="refresh" we can work around this. It has the same effect of emptying out the collection and rebuilding it, but this way, it doesn't stomp on its own configuration file. Hope this proves helpful to anyone else with this problem.
<cfoutput>#name#</cfoutput>
which is meaningless.
The maxWarmingSearchers issues continues to be a major issue. Due to the design of our system, we can simply index the documents in a directory path, rather, we must call CFIndex for each individual file based upon language, etc., from the database.
Solr has a lot of configuration options, however, it appears that CFIndex, when it invokes Solr, specifies that each invocation should perform a commit, not wait to flush (wish I could teach the 13-year old that), and not wait for a searcher. See the line below:
Error_opening_new_searcher_exceeded_limit_of_maxWarmingSearchers32_try_again_later Error_opening_new_searcher_exceeded_limit_of_maxWarmingSearchers32_try_again_later request: http://localhost:8983/solr/indure132/update?commit...
Since it appears a commit is being forced on each index operation, the larger the collection, the worse the problem.
We have built a very poor work around by sleeping for 1000ms between each document and trapping for the error, then sleeping for 60 seconds, but this is not a good long term solution.
Any other ideas besides giving Adobe a swift kick in the #@$ to get a viable solution?
Cheers!
[Add Comment] [Subscribe to Comments]