Indexing PDFs with Solr? Read this tip.

Have you noticed that indexed PDFs don’t seem to contain all the content they should? Turns out this is a performance setting in Solr. The tip below is credit Uday Ogra of Adobe:

Solr has a default upper limit of 10000 on max number of words which can be indexed in documents which approximately defaults to 20-40 pages.

We can change this default value for each collection. Suppose collection’s name is newcollection.

Open file COLDFUSION_COLLECTIONS_PATH/newcollection/conf/solrconfig.xml

Here COLDFUSION_COLLECTIONS_PATH is the path you would have configured while creating the collection.

Here search <mainindex> tag. Inside this tag there would be a sub-tag <maxFieldLength> which has a default value of 10000.

You can change it to a value which will suit your indexing.

(There is one more <maxFieldLength> tag directly under <indexDefaults> tag, do not change it)

In your case I would recommend to change it to 100000.

By the way on an average a single pdf page has around 200-500 words. So for a pdf with 100 pages setting this value to 100000 should be safe enough.

Raymond Camden's Picture

About Raymond Camden

Raymond is a developer advocate. He focuses on JavaScript, serverless and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support.

Lafayette, LA