Indexing PDFs with Solr? Read this tip.

This post is more than 2 years old.

Have you noticed that indexed PDFs don't seem to contain all the content they should? Turns out this is a performance setting in Solr. The tip below is credit Uday Ogra of Adobe:

Solr has a default upper limit of 10000 on max number of words which can be indexed in documents which approximately defaults to 20-40 pages.

We can change this default value for each collection. Suppose collection's name is newcollection.

Open file COLDFUSION_COLLECTIONS_PATH/newcollection/conf/solrconfig.xml

Here COLDFUSION_COLLECTIONS_PATH is the path you would have configured while creating the collection.

Here search <mainindex> tag. Inside this tag there would be a sub-tag <maxFieldLength> which has a default value of 10000.

You can change it to a value which will suit your indexing.

(There is one more <maxFieldLength> tag directly under <indexDefaults> tag, do not change it)

In your case I would recommend to change it to 100000.

By the way on an average a single pdf page has around 200-500 words. So for a pdf with 100 pages setting this value to 100000 should be safe enough.

Raymond Camden's Picture

About Raymond Camden

Raymond is a senior developer evangelist for Adobe. He focuses on document services, JavaScript, and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can even buy me a coffee!

Lafayette, LA

Archived Comments

Comment 1 by Adam Cameron posted on 8/22/2011 at 3:46 PM

Coolio. This answers the question I've been looking at over here:

It'd be good if this made its way into the docs! (the *CF* docs, that is).

I also think it would be good to have this as an (optional) setting on CFCOLLECTION & the UI in CF Admin, rather than hacking config files. This *is* CF we're talking about after all!

I should perhaps raise an issue in the bug tracker...


Comment 2 by Steve W posted on 8/22/2011 at 4:10 PM

To piggyback on Adam's comment. One can only hope that with the removal of Verity, that a new Admin area will be built to address this for Solr. Ray, any chance you can ping one of your co-workers to see if we need to add this as a feature request?

Comment 3 by Raymond Camden posted on 8/22/2011 at 4:54 PM

Adam/Steve: I think with all the options you can do in Solr, it would be difficult to create a web based admin to handle them all. That being said, it's not out of the realm of possibility to add _more_ stuff for some settings, like this one. Steve - I'd let Adam file the ER, let him then post the URL here, then you (and others) can vote on it.

Comment 4 by Adam Cameron posted on 8/22/2011 at 4:58 PM
Comment 5 by Peter Tilbrook posted on 8/23/2011 at 2:28 PM

Is Solr part of CF Pro as well or just Enterprise? (I imagine in Pro it has limitations like Verity does/did)?

Just used to using Enterprise these days.

Comment 6 by Peter Tilbrook posted on 8/23/2011 at 2:38 PM

Further to my last comment what a pain. CF7 and CF8 and CF9 servers.

Not easy to get my Porsche driving boss to upgrade every licence (obviously) just the coding changes are not worth it for a site that works fine. But...

When "fixing" a site that someone else made and have it singing like a songbird and then find it breaks cos the server is CF8 or 7 and not 9 is kinda crushing. And makes you feel like an idiot.

I will give my employer credit in that they, like me, have been using ColdFusion since at least 1995 when I started working there. And they still are. Hence the 7, 8 and 9. Probably a few 5 and 6s they haven't told me about yet.

Why did they switch to Cold Fusion (as it was known back then)? ASP (as it was also known back then) "just broke". And my boss, hasn't changed a bit in that regard, needed it fixed fast.

Largest ISP in a capital city of a major nation. Gotta have a dynamic web site that works.

Within a working day (total novice to even ASP) I converted the ASP site to CF and neither of us have looked back since. CF was a floppy or quick download back then and it still amazes me how much better as a platform it has become.

Sixteen years later I still love ColdFusion!


Comment 7 by Raymond Camden posted on 8/23/2011 at 3:17 PM

Peter: It is part of all version of CF. It has NO restrictions. It is the Awesome.

Comment 8 by WolfShade posted on 3/7/2013 at 9:56 PM

Do CF services need to be restarted after making this minor adjustment?

Comment 9 by Raymond Camden posted on 3/8/2013 at 8:44 AM

Just the CF Solr service.

Comment 10 by Cliff posted on 5/8/2013 at 7:33 PM

Thanks for this fix.
I have a document section on our work site where the user can create a category to upload documents to. When they do this, it creates a new collection for that category. I won't know the user did this, and they won't have access to the server.
I read in your instructions that I can edit each individual collection, is there a way to make an edit sitewide?
I'm using the Solr built in to CF9.


Comment 11 by Raymond Camden posted on 5/8/2013 at 7:34 PM

I don't believe so. Best I can recommend is to check the Solr docs, and see if they support server-wide directives/overrides (if that makes sense).

Comment 12 by Robin posted on 5/19/2014 at 7:47 PM

I'm not able to get Solr on CF10 to index pdf extensions through cfadmin or cfindex. It works fine on CF8. Any ideas on this? I noticed you're in Lafayette. I went to LSU and lived in Houma. I know lots of folks in your area. I'm in NC now.

Comment 13 by Raymond Camden posted on 5/19/2014 at 7:50 PM

I'm currently working with someone who is also having an issue with PDFs. Is there any chance you could send me one PDF that doesn't index?

I got my degree at USL (now ULL). Geaux Cajuns! :)

Comment 14 by Robin posted on 5/20/2014 at 5:26 PM

I sent the pdf to your gmail account.

Comment 15 by Raymond Camden posted on 5/20/2014 at 5:27 PM

Got it - a bit behind so it may be a bit. I have to warn you - I've seen PDFs that refuse to index... with no error. I reported it about 2 weeks ago but haven't gotten a response yet.

Comment 16 by Robin posted on 5/20/2014 at 5:31 PM

Thank you for your reply. I just wanted to be sure you received it.

Comment 17 by Adam Cameron posted on 5/20/2014 at 5:31 PM

Ray, when I try to unsubscribe from this thread I get:

You have not been unsubscribed from the thread. Please ensure you correctly copied the URL from the email.

This is with URL, which is the one that is in the footer of the email, and reflects the correct email addy for when I first subscribed.

Can you pls unsub me?



Comment 18 by Raymond Camden posted on 5/20/2014 at 5:48 PM

Odd... not sure why. Well, you were unsubbed.