Have you noticed that indexed PDFs don't seem to contain all the content they should? Turns out this is a performance setting in Solr. The tip below is credit Uday Ogra of Adobe:
Solr has a default upper limit of 10000 on max number of words which can be indexed in documents which approximately defaults to 20-40 pages.
We can change this default value for each collection. Suppose collection's name is newcollection.
Open file COLDFUSION_COLLECTIONS_PATH/newcollection/conf/solrconfig.xml
Here COLDFUSION_COLLECTIONS_PATH is the path you would have configured while creating the collection.
Here search <mainindex> tag. Inside this tag there would be a sub-tag <maxFieldLength> which has a default value of 10000.
You can change it to a value which will suit your indexing.
(There is one more <maxFieldLength> tag directly under <indexDefaults> tag, do not change it)
In your case I would recommend to change it to 100000.
By the way on an average a single pdf page has around 200-500 words. So for a pdf with 100 pages setting this value to 100000 should be safe enough.
Archived Comments
Coolio. This answers the question I've been looking at over here: http://forums.adobe.com/mes....
It'd be good if this made its way into the docs! (the *CF* docs, that is).
I also think it would be good to have this as an (optional) setting on CFCOLLECTION & the UI in CF Admin, rather than hacking config files. This *is* CF we're talking about after all!
I should perhaps raise an issue in the bug tracker...
--
Adam
To piggyback on Adam's comment. One can only hope that with the removal of Verity, that a new Admin area will be built to address this for Solr. Ray, any chance you can ping one of your co-workers to see if we need to add this as a feature request?
Adam/Steve: I think with all the options you can do in Solr, it would be difficult to create a web based admin to handle them all. That being said, it's not out of the realm of possibility to add _more_ stuff for some settings, like this one. Steve - I'd let Adam file the ER, let him then post the URL here, then you (and others) can vote on it.
Done:
http://cfbugs.adobe.com/cfb...
--
Adam
Is Solr part of CF Pro as well or just Enterprise? (I imagine in Pro it has limitations like Verity does/did)?
Just used to using Enterprise these days.
Further to my last comment what a pain. CF7 and CF8 and CF9 servers.
Not easy to get my Porsche driving boss to upgrade every licence (obviously) just the coding changes are not worth it for a site that works fine. But...
When "fixing" a site that someone else made and have it singing like a songbird and then find it breaks cos the server is CF8 or 7 and not 9 is kinda crushing. And makes you feel like an idiot.
I will give my employer credit in that they, like me, have been using ColdFusion since at least 1995 when I started working there. And they still are. Hence the 7, 8 and 9. Probably a few 5 and 6s they haven't told me about yet.
Why did they switch to Cold Fusion (as it was known back then)? ASP (as it was also known back then) "just broke". And my boss, hasn't changed a bit in that regard, needed it fixed fast.
Largest ISP in a capital city of a major nation. Gotta have a dynamic web site that works.
Within a working day (total novice to even ASP) I converted the ASP site to CF and neither of us have looked back since. CF was a floppy or quick download back then and it still amazes me how much better as a platform it has become.
Sixteen years later I still love ColdFusion!
Cheers!
Peter: It is part of all version of CF. It has NO restrictions. It is the Awesome.
Do CF services need to be restarted after making this minor adjustment?
Just the CF Solr service.
Thanks for this fix.
I have a document section on our work site where the user can create a category to upload documents to. When they do this, it creates a new collection for that category. I won't know the user did this, and they won't have access to the server.
I read in your instructions that I can edit each individual collection, is there a way to make an edit sitewide?
I'm using the Solr built in to CF9.
Thanks!
I don't believe so. Best I can recommend is to check the Solr docs, and see if they support server-wide directives/overrides (if that makes sense).
I'm not able to get Solr on CF10 to index pdf extensions through cfadmin or cfindex. It works fine on CF8. Any ideas on this? I noticed you're in Lafayette. I went to LSU and lived in Houma. I know lots of folks in your area. I'm in NC now.
I'm currently working with someone who is also having an issue with PDFs. Is there any chance you could send me one PDF that doesn't index?
I got my degree at USL (now ULL). Geaux Cajuns! :)
I sent the pdf to your gmail account.
Got it - a bit behind so it may be a bit. I have to warn you - I've seen PDFs that refuse to index... with no error. I reported it about 2 weeks ago but haven't gotten a response yet.
Thank you for your reply. I just wanted to be sure you received it.
Ray, when I try to unsubscribe from this thread I get:
{quote}
Unsubscribe
You have not been unsubscribed from the thread. Please ensure you correctly copied the URL from the email.
{quote}
This is with URL http://www.raymondcamden.co..., which is the one that is in the footer of the email, and reflects the correct email addy for when I first subscribed.
Can you pls unsub me?
Cheers.
--
Adam
Odd... not sure why. Well, you were unsubbed.