Earlier this week James Moberg introduced me to a cool little Java utility – jsoup. jsoup provides jQuery-like HTML manipulation to your server. Given a string, or a URL, you can do things like, find all the images, look for links to a PDF, and so on. Basically – jQuery for the server. I thought I’d whip up a quick ColdFusion-based demo of this so I could see how well it works.
I began by downloading the jar file and dropping into a folder called jars. Then, using ColdFusion 10, it was trivial to make it available to my code:
I then whipped up a demo that loaded (and cached) CNN’s html. I create an instance of jsoup, parse the HTML, and then run a “select” using my selector, in this case, just ‘img’:
Notice how I can loop over the matches and grab attributes from each one. Again, very jQuery-like. I wanted to play with this a bit more free form so I created an application that lets me supply any URL and any selector. Here’s that code – minus the UI cruft around it:
You can run this yourself by hitting the demo below. All in all – a very interesting Java library. Sure you could do all of this with regular expressions, but I find this syntax a heck of a lot more friendly. (And that’s with me having used regex for the past 15 years.)
Talk about synchronicity – within 10 minutes of each other, both Ben Nadel and I posted on the same topic! Parsing, Traversing, And Mutating HTML With ColdFusion And jSoup