Twitter: raymondcamden


Address: Lafayette, LA, USA

Indexing PDFs with Solr? Read this tip.

08-22-2011 3,680 views ColdFusion 11 Comments

Have you noticed that indexed PDFs don't seem to contain all the content they should? Turns out this is a performance setting in Solr. The tip below is credit Uday Ogra of Adobe:

Solr has a default upper limit of 10000 on max number of words which can be indexed in documents which approximately defaults to 20-40 pages.

We can change this default value for each collection. Suppose collection's name is newcollection.

Open file COLDFUSION_COLLECTIONS_PATH/newcollection/conf/solrconfig.xml

Here COLDFUSION_COLLECTIONS_PATH is the path you would have configured while creating the collection.

Here search <mainindex> tag. Inside this tag there would be a sub-tag <maxFieldLength> which has a default value of 10000.

You can change it to a value which will suit your indexing.

(There is one more <maxFieldLength> tag directly under <indexDefaults> tag, do not change it)

In your case I would recommend to change it to 100000.

By the way on an average a single pdf page has around 200-500 words. So for a pdf with 100 pages setting this value to 100000 should be safe enough.

11 Comments

  • Adam Cameron #
    Commented on 08-22-2011 at 6:46 AM
    Coolio. This answers the question I've been looking at over here: http://forums.adobe.com/message/3875518.

    It'd be good if this made its way into the docs! (the CF docs, that is).

    I also think it would be good to have this as an (optional) setting on CFCOLLECTION & the UI in CF Admin, rather than hacking config files. This is CF we're talking about after all!

    I should perhaps raise an issue in the bug tracker...

    --
    Adam
  • Commented on 08-22-2011 at 7:10 AM
    To piggyback on Adam's comment. One can only hope that with the removal of Verity, that a new Admin area will be built to address this for Solr. Ray, any chance you can ping one of your co-workers to see if we need to add this as a feature request?
  • Commented on 08-22-2011 at 7:54 AM
    Adam/Steve: I think with all the options you can do in Solr, it would be difficult to create a web based admin to handle them all. That being said, it's not out of the realm of possibility to add more stuff for some settings, like this one. Steve - I'd let Adam file the ER, let him then post the URL here, then you (and others) can vote on it.
  • Adam Cameron #
    Commented on 08-22-2011 at 7:58 AM
    Done:
    http://cfbugs.adobe.com/cfbugreport/flexbugui/cfbu...

    --
    Adam
  • Commented on 08-23-2011 at 5:28 AM
    Is Solr part of CF Pro as well or just Enterprise? (I imagine in Pro it has limitations like Verity does/did)?

    Just used to using Enterprise these days.
  • Commented on 08-23-2011 at 5:38 AM
    Further to my last comment what a pain. CF7 and CF8 and CF9 servers.

    Not easy to get my Porsche driving boss to upgrade every licence (obviously) just the coding changes are not worth it for a site that works fine. But...

    When "fixing" a site that someone else made and have it singing like a songbird and then find it breaks cos the server is CF8 or 7 and not 9 is kinda crushing. And makes you feel like an idiot.

    I will give my employer credit in that they, like me, have been using ColdFusion since at least 1995 when I started working there. And they still are. Hence the 7, 8 and 9. Probably a few 5 and 6s they haven't told me about yet.

    Why did they switch to Cold Fusion (as it was known back then)? ASP (as it was also known back then) "just broke". And my boss, hasn't changed a bit in that regard, needed it fixed fast.

    Largest ISP in a capital city of a major nation. Gotta have a dynamic web site that works.

    Within a working day (total novice to even ASP) I converted the ASP site to CF and neither of us have looked back since. CF was a floppy or quick download back then and it still amazes me how much better as a platform it has become.

    Sixteen years later I still love ColdFusion!

    Cheers!
  • Commented on 08-23-2011 at 6:17 AM
    Peter: It is part of all version of CF. It has NO restrictions. It is the Awesome.
  • WolfShade #
    Commented on 03-07-2013 at 10:56 AM
    Do CF services need to be restarted after making this minor adjustment?
  • Commented on 03-07-2013 at 9:44 PM
    Just the CF Solr service.
  • Cliff #
    Commented on 05-08-2013 at 10:33 AM
    Thanks for this fix.
    I have a document section on our work site where the user can create a category to upload documents to. When they do this, it creates a new collection for that category. I won't know the user did this, and they won't have access to the server.
    I read in your instructions that I can edit each individual collection, is there a way to make an edit sitewide?
    I'm using the Solr built in to CF9.

    Thanks!
  • Commented on 05-08-2013 at 10:34 AM
    I don't believe so. Best I can recommend is to check the Solr docs, and see if they support server-wide directives/overrides (if that makes sense).

Post Reply

Please refrain from posting large blocks of code as a comment. Use Pastebin or Gists instead. Text wrapped in asterisks (*) will be bold and text wrapped in underscores (_) will be italicized.

Leave this field empty