Twitter: raymondcamden


Address: Lafayette, LA, USA

jsoup adds jQuery-like parsing in Java

04-06-2012 9,616 views jQuery, ColdFusion 35 Comments

Earlier this week James Moberg introduced me to a cool little Java utility - jsoup. jsoup provides jQuery-like HTML manipulation to your server. Given a string, or a URL, you can do things like, find all the images, look for links to a PDF, and so on. Basically - jQuery for the server. I thought I'd whip up a quick ColdFusion-based demo of this so I could see how well it works.

I began by downloading the jar file and dropping into a folder called jars. Then, using ColdFusion 10, it was trivial to make it available to my code:

I then whipped up a demo that loaded (and cached) CNN's html. I create an instance of jsoup, parse the HTML, and then run a "select" using my selector, in this case, just 'img':

Notice how I can loop over the matches and grab attributes from each one. Again, very jQuery-like. I wanted to play with this a bit more free form so I created an application that lets me supply any URL and any selector. Here's that code - minus the UI cruft around it:

You can run this yourself by hitting the demo below. All in all - a very interesting Java library. Sure you could do all of this with regular expressions, but I find this syntax a heck of a lot more friendly. (And that's with me having used regex for the past 15 years.)

Talk about synchronicity - within 10 minutes of each other, both Ben Nadel and I posted on the same topic! Parsing, Traversing, And Mutating HTML With ColdFusion And jSoup

35 Comments

  • MikeZ #
    Commented on 04-06-2012 at 8:33 AM
    Quite interesting. I often need to parse data from various HTML files and this would actually make it a lot more comfortable.

    But does it require valid XML or is it able to handle formatting errors up to a certain degree?
  • Commented on 04-06-2012 at 8:34 AM
    According to their site, they specifically support messy HTML. Try a site that you know is messy with my tester app.
  • MikeZ #
    Commented on 04-06-2012 at 8:53 AM
    Thanks, I'll give it a try.
  • John Lang #
    Commented on 04-06-2012 at 8:55 AM
    This is awesome, thanks for posting it Ray!
  • James Moberg #
    Commented on 04-06-2012 at 10:45 AM
    Using CFDUMP when reviewing what jsoup returns is extremely beneficial.

    I used jsoup's whitelist to automatically remove invalid/bloated Microsoft markup from HTML in an RSS feed.
    http://pastebin.com/rvkt2GCC
  • Commented on 04-06-2012 at 10:52 AM
    Yeah - using cfdump on Java objects is a quick/dirty way to see the API. Of course, jsoup does have their JavaDocs online too. :)
  • MikeZ #
    Commented on 04-06-2012 at 11:03 AM
    Thanks James, I almost forgot about the most obvious use for this thing. Recently encountered a case where I needed to remove markup junk from copy&pasted Outlook mail content.
  • Commented on 04-09-2012 at 6:24 PM
    Ray,

    I noticed you used the new ColdFusion 10 cacheIDExists() function in your code. I wanted to point out that I'm not a fan of the function for the most part as it actually adds an additional cache lookup in the event of a cache hit the way the documentation shows how to use it:

    1st call - does it exist?
    2nd call - it exists, so call the cache and get it (the 2nd lookup) or it doesn't exist, so get it then put it in the cache

    Now consider this code:

    <cfset cnnhtml = cacheGet("cnnhtml")>
    <cfif isNull(cnnhtml)>
       <cfhttp url="http://www.cnn.com">;
       <cfset cnnhtml = cfhttp.filecontent>
       <cfset cachePut("cnnhtml",cnnhtml)>
    </cfif>

    Notice that in the case of a get, if the item is in the cache, that's it - just one call. If the item doesn't exist then you grab it and put it in the cache. This might seem like a small deal, but in a large scale system, it could be significant.
  • Commented on 04-10-2012 at 10:27 AM
    Are you saying ehcache doesn't provide a nicer way of checking for something than asking for it and noticing it is null? Or that CF doesn't have a way of using a nicer API? Is the check that expensive?
  • Commented on 04-10-2012 at 10:45 AM
    I'm saying that using cacheIDExists() is less efficient. It's not expensive for small apps. It is expensive at scale.
  • Commented on 04-10-2012 at 10:47 AM
    Is this a CF wrapper issue or just a fact of life with ehcache?
  • Commented on 05-02-2012 at 4:53 AM
    That's a really useful tool. I can immediately think of dozens of places where it would make my life so much easier.
  • MikeZ #
    Commented on 05-02-2012 at 6:50 AM
    Gave it a try on several systems.
    It's certainly interesting, but also so much slower than extracting the data using customized RegEx-based pattern that it's of rather limited use.
  • James Moberg #
    Commented on 05-02-2012 at 7:55 AM
    If you know exactly what you are searching for and using RegEx to extract known information, then RegEx is your solution.

    I jSoup to clean up unknown HTML & remove unwanted elements in a couple lines that you would be difficult to identify in advance in order to generate multiple regex statements. I'm also using it to add additional attributes & properties to certain existing HTML elements when optimizing for email messages (limited CSS support).
  • James Moberg #
    Commented on 05-02-2012 at 1:27 PM
    Ben Naddel wrote a sweet ColdFusion 10 script that uses jsoup to convert CSS style blocks to inline styles for Google-Mail-compliant email messages.
    http://www.bennadel.com/blog/2372-Best-Of-ColdFusi...

    This utility script is extremely useful and I'm not sure how it would be accomplished using RegEx-based patterns (if at all).
  • Commented on 06-19-2012 at 12:56 PM
    Hey Ray,
    Great find..
    How do you set this up with CF902?
    I tried with JavaLoader but keep getting some instantiation errors.
  • Commented on 06-19-2012 at 1:27 PM
    JavaLoader should work. What error do you get?
  • Commented on 06-20-2012 at 10:17 AM
    I am an idiot :)
    jsoup does not need to init() just do the create()
  • Alan #
    Commented on 06-22-2012 at 1:13 PM
    what about sites load with AJAX. is there anyway to tell when is page done and start caching? tried buy dot com couldn't pass loading stage
  • Commented on 06-22-2012 at 1:16 PM
    Alan: I'm a bit confused. Your question doesn't seem to have anything to do with this blog entry. This library works with HTML on the server. If you are using JS to load data remotely, you don't need this. Just use jQuery.
  • Alan #
    Commented on 06-22-2012 at 1:33 PM
    you are right you can do with JQuery :) the same question for you; why you can't do this HTML parsing with JQuery?
  • Commented on 06-22-2012 at 1:37 PM
    Um... no one said you couldn't. The idea was that if you needed to do it server-side, this was a solution. Imagine a server-side process that examines HTML files that are dynamically generated by some process.
  • Alan #
    Commented on 06-22-2012 at 1:51 PM
    i was trying to extract product information from certain websites. Which some requires login some not (i have to deal on server, JQuery is not option).

    You are right on that; this is not jsoup problem but very related and I will appreciate if you can advise me. I faced this dynamic page loading issue. if i can't grab correct content with cfhttp, that means i can't start playing with jsoup.

    thanks in advance!
  • Commented on 06-22-2012 at 3:43 PM
    I'm not quite sure what you are saying then. You said you want it server side but can't grab it with cfhttp? Why not?
  • Alan #
    Commented on 06-23-2012 at 11:28 AM
    i am trying to overcome with loading stage. the site brings ajax search result and cfhttp grabs only loading
  • Commented on 06-23-2012 at 11:31 AM
    So given some URL, lets say foo.com/index.html, which is a page that uses Ajax to load stuff in, let's say random database content.

    You are trying to use cfhttp+jsoup on it.

    If so- no - you can't do this. cfhttp will load the HTML/JS, it will not "run" it. If you want to run jsoup on the content loaded via Ajax, determine what THAT url is.
  • Alan #
    Commented on 06-23-2012 at 3:59 PM
    sorry for poor explanation. can you get tell me way how i can get content of this page with cfhttp

  • Commented on 06-23-2012 at 11:59 PM
    Are you asking how to use cfhttp? Did you check the CFML Reference?
  • Alan #
    Commented on 06-25-2012 at 1:27 PM
    yeah right that is my question:) i think it removed the link which could bring some light to my question. never mind ;)
  • Mike #
    Commented on 02-24-2013 at 4:45 PM
    Ray,

    How do I make the .jar available for this?
    "I began by downloading the jar file and dropping into a folder called jars. Then, using ColdFusion 10, it was trivial to make it available to my code"
  • Commented on 02-24-2013 at 4:52 PM
    The docs for this begins here: http://help.adobe.com/en_US/ColdFusion/10.0/Develo...
  • James Moberg #
    Commented on 07-23-2013 at 9:13 AM
    FYI: This demo doesn't work anymore as the website can't be found. The demo at:
    http://www.raymondcamden.com/index.cfm/2012/5/11/E...
    is broken too.
  • Colin Mac #
    Commented on 06-17-2014 at 3:06 PM
    So this is pretty great ... when it works. For producing a page that downloads a report via a CSV file, I'm collecting the href attribute of anchor tag nodes and putting the resulting URL in parentheses after the linked text, then stripping out all of the html. Only it's not working in my component. MyElement.attr("href") works fine in a vanilla cfm, but for some reason in a cfc within my framework it spits out the empty string. MyElement.text() works for getting the link text, but attr("href") produces nothing. writeDump( myElement ) also shows a fully formed Element object.

    I can't find anything but the most basic examples of using jsoup with ColdFusion, but I suspect something in here is breaking when it gets more complex.
  • Commented on 06-17-2014 at 3:08 PM
    So to be clear, you can take a block of code and run it in a CFM and it works ok. You take the exact same code in a CFC and it breaks? Or did you change something?
  • Colin Mac #
    Commented on 06-18-2014 at 10:43 AM
    Just to be sure, I created a cfc to put the jsoup code into, and called it from the cfm. It worked. The code below is from the cfc that doesn't work. Toward the end I experiment with cleaning the doc of all html, but I suspect I probably could have just used doc.text(), assuming my shenanigans with the href were kosher. Kosher shenanigans. Actually, just ran it and found that the doc variable retains enclosing html and body tags, even after a cleaning. Or maybe as part of the toString() method. So it would have to be text().

    //attempting to take the href and put in parens beside the link text, then remove the a tag.
    local.doc = local.jsoup.parse( arguments.html, local.baseUrl );
    local.elements = local.doc.select("a");
    for (local.el in local.elements) {
       writeDump( "href:" & local.el.attr("href") & " text:" & local.el.text() );
       local.el.appendText( local.el.text() & ' (' & local.el.attr("href") & ') ');
    }
    local.cleaner = createObject( "java", "org.jsoup.safety.Cleaner" );
    local.cl = local.cleaner.init( local.whitelist.none() );
    local.doc = local.cl.clean( local.doc );
    return local.doc.toString();

    Result: For input
    arguments.html = <a href="http://yahoo.com">yahoo</a>;
    writeDump just prints "href: text: yahoo".

Post Reply

Please refrain from posting large blocks of code as a comment. Use Pastebin or Gists instead. Text wrapped in asterisks (*) will be bold and text wrapped in underscores (_) will be italicized.

Leave this field empty