jsoup adds jQuery-like parsing in Java
Earlier this week James Moberg introduced me to a cool little Java utility - jsoup. jsoup provides jQuery-like HTML manipulation to your server. Given a string, or a URL, you can do things like, find all the images, look for links to a PDF, and so on. Basically - jQuery for the server. I thought I'd whip up a quick ColdFusion-based demo of this so I could see how well it works.
I began by downloading the jar file and dropping into a folder called jars. Then, using ColdFusion 10, it was trivial to make it available to my code:
I then whipped up a demo that loaded (and cached) CNN's html. I create an instance of jsoup, parse the HTML, and then run a "select" using my selector, in this case, just 'img':
Notice how I can loop over the matches and grab attributes from each one. Again, very jQuery-like. I wanted to play with this a bit more free form so I created an application that lets me supply any URL and any selector. Here's that code - minus the UI cruft around it:
You can run this yourself by hitting the demo below. All in all - a very interesting Java library. Sure you could do all of this with regular expressions, but I find this syntax a heck of a lot more friendly. (And that's with me having used regex for the past 15 years.)
Talk about synchronicity - within 10 minutes of each other, both Ben Nadel and I posted on the same topic! Parsing, Traversing, And Mutating HTML With ColdFusion And jSoup

But does it require valid XML or is it able to handle formatting errors up to a certain degree?
I used jsoup's whitelist to automatically remove invalid/bloated Microsoft markup from HTML in an RSS feed.
http://pastebin.com/rvkt2GCC
I noticed you used the new ColdFusion 10 cacheIDExists() function in your code. I wanted to point out that I'm not a fan of the function for the most part as it actually adds an additional cache lookup in the event of a cache hit the way the documentation shows how to use it:
1st call - does it exist?
2nd call - it exists, so call the cache and get it (the 2nd lookup) or it doesn't exist, so get it then put it in the cache
Now consider this code:
<cfset cnnhtml = cacheGet("cnnhtml")>
<cfif isNull(cnnhtml)>
<cfhttp url="http://www.cnn.com">
<cfset cnnhtml = cfhttp.filecontent>
<cfset cachePut("cnnhtml",cnnhtml)>
</cfif>
Notice that in the case of a get, if the item is in the cache, that's it - just one call. If the item doesn't exist then you grab it and put it in the cache. This might seem like a small deal, but in a large scale system, it could be significant.
It's certainly interesting, but also so much slower than extracting the data using customized RegEx-based pattern that it's of rather limited use.
I jSoup to clean up unknown HTML & remove unwanted elements in a couple lines that you would be difficult to identify in advance in order to generate multiple regex statements. I'm also using it to add additional attributes & properties to certain existing HTML elements when optimizing for email messages (limited CSS support).
http://www.bennadel.com/blog/2372-Best-Of-ColdFusi...
This utility script is extremely useful and I'm not sure how it would be accomplished using RegEx-based patterns (if at all).
Great find..
How do you set this up with CF902?
I tried with JavaLoader but keep getting some instantiation errors.
jsoup does not need to init() just do the create()
You are right on that; this is not jsoup problem but very related and I will appreciate if you can advise me. I faced this dynamic page loading issue. if i can't grab correct content with cfhttp, that means i can't start playing with jsoup.
thanks in advance!
You are trying to use cfhttp+jsoup on it.
If so- no - you can't do this. cfhttp will load the HTML/JS, it will not "run" it. If you want to run jsoup on the content loaded via Ajax, determine what THAT url is.
How do I make the .jar available for this?
"I began by downloading the jar file and dropping into a folder called jars. Then, using ColdFusion 10, it was trivial to make it available to my code"