jsoup adds jQuery-like parsing in Java

This post is more than 2 years old.

Earlier this week James Moberg introduced me to a cool little Java utility - jsoup. jsoup provides jQuery-like HTML manipulation to your server. Given a string, or a URL, you can do things like, find all the images, look for links to a PDF, and so on. Basically - jQuery for the server. I thought I'd whip up a quick ColdFusion-based demo of this so I could see how well it works.

I began by downloading the jar file and dropping into a folder called jars. Then, using ColdFusion 10, it was trivial to make it available to my code:

I then whipped up a demo that loaded (and cached) CNN's html. I create an instance of jsoup, parse the HTML, and then run a "select" using my selector, in this case, just 'img':

Notice how I can loop over the matches and grab attributes from each one. Again, very jQuery-like. I wanted to play with this a bit more free form so I created an application that lets me supply any URL and any selector. Here's that code - minus the UI cruft around it:

You can run this yourself by hitting the demo below. All in all - a very interesting Java library. Sure you could do all of this with regular expressions, but I find this syntax a heck of a lot more friendly. (And that's with me having used regex for the past 15 years.)

Talk about synchronicity - within 10 minutes of each other, both Ben Nadel and I posted on the same topic! Parsing, Traversing, And Mutating HTML With ColdFusion And jSoup

Raymond Camden's Picture

About Raymond Camden

Raymond is a senior developer evangelist for Adobe. He focuses on document services, JavaScript, and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can even buy me a coffee!

Lafayette, LA https://www.raymondcamden.com

Archived Comments

Comment 1 by MikeZ posted on 4/6/2012 at 5:33 PM

Quite interesting. I often need to parse data from various HTML files and this would actually make it a lot more comfortable.

But does it require valid XML or is it able to handle formatting errors up to a certain degree?

Comment 2 by Raymond Camden posted on 4/6/2012 at 5:34 PM

According to their site, they specifically support messy HTML. Try a site that you know is messy with my tester app.

Comment 3 by MikeZ posted on 4/6/2012 at 5:53 PM

Thanks, I'll give it a try.

Comment 4 by John Lang posted on 4/6/2012 at 5:55 PM

This is awesome, thanks for posting it Ray!

Comment 5 by James Moberg posted on 4/6/2012 at 7:45 PM

Using CFDUMP when reviewing what jsoup returns is extremely beneficial.

I used jsoup's whitelist to automatically remove invalid/bloated Microsoft markup from HTML in an RSS feed.
http://pastebin.com/rvkt2GCC

Comment 6 by Raymond Camden posted on 4/6/2012 at 7:52 PM

Yeah - using cfdump on Java objects is a quick/dirty way to see the API. Of course, jsoup does have their JavaDocs online too. :)

Comment 7 by MikeZ posted on 4/6/2012 at 8:03 PM

Thanks James, I almost forgot about the most obvious use for this thing. Recently encountered a case where I needed to remove markup junk from copy&pasted Outlook mail content.

Comment 8 by Rob Brooks-Bilson posted on 4/10/2012 at 3:24 AM

Ray,

I noticed you used the new ColdFusion 10 cacheIDExists() function in your code. I wanted to point out that I'm not a fan of the function for the most part as it actually adds an additional cache lookup in the event of a cache hit the way the documentation shows how to use it:

1st call - does it exist?
2nd call - it exists, so call the cache and get it (the 2nd lookup) or it doesn't exist, so get it then put it in the cache

Now consider this code:

<cfset cnnhtml = cacheGet("cnnhtml")>
<cfif isNull(cnnhtml)>
<cfhttp url="http://www.cnn.com">
<cfset cnnhtml = cfhttp.filecontent>
<cfset cachePut("cnnhtml",cnnhtml)>
</cfif>

Notice that in the case of a get, if the item is in the cache, that's it - just one call. If the item doesn't exist then you grab it and put it in the cache. This might seem like a small deal, but in a large scale system, it could be significant.

Comment 9 by Raymond Camden posted on 4/10/2012 at 7:27 PM

Are you saying ehcache doesn't provide a nicer way of checking for something than asking for it and noticing it is null? Or that CF doesn't have a way of using a nicer API? Is the check that expensive?

Comment 10 by Rob Brooks-Bilson posted on 4/10/2012 at 7:45 PM

I'm saying that using cacheIDExists() is less efficient. It's not expensive for small apps. It is expensive at scale.

Comment 11 by Raymond Camden posted on 4/10/2012 at 7:47 PM

Is this a CF wrapper issue or just a fact of life with ehcache?

Comment 12 by Pete posted on 5/2/2012 at 1:53 PM

That's a really useful tool. I can immediately think of dozens of places where it would make my life so much easier.

Comment 13 by MikeZ posted on 5/2/2012 at 3:50 PM

Gave it a try on several systems.
It's certainly interesting, but also so much slower than extracting the data using customized RegEx-based pattern that it's of rather limited use.

Comment 14 by James Moberg posted on 5/2/2012 at 4:55 PM

If you know exactly what you are searching for and using RegEx to extract known information, then RegEx is your solution.

I jSoup to clean up unknown HTML & remove unwanted elements in a couple lines that you would be difficult to identify in advance in order to generate multiple regex statements. I'm also using it to add additional attributes & properties to certain existing HTML elements when optimizing for email messages (limited CSS support).

Comment 15 by James Moberg posted on 5/2/2012 at 10:27 PM

Ben Naddel wrote a sweet ColdFusion 10 script that uses jsoup to convert CSS style blocks to inline styles for Google-Mail-compliant email messages.
http://www.bennadel.com/blo...

This utility script is extremely useful and I'm not sure how it would be accomplished using RegEx-based patterns (if at all).

Comment 16 by Tim Garver posted on 6/19/2012 at 9:56 PM

Hey Ray,
Great find..
How do you set this up with CF902?
I tried with JavaLoader but keep getting some instantiation errors.

Comment 17 by Raymond Camden posted on 6/19/2012 at 10:27 PM

JavaLoader should work. What error do you get?

Comment 18 by Tim Garver posted on 6/20/2012 at 7:17 PM

I am an idiot :)
jsoup does not need to init() just do the create()

Comment 19 by Alan posted on 6/22/2012 at 10:13 PM

what about sites load with AJAX. is there anyway to tell when is page done and start caching? tried buy dot com couldn't pass loading stage

Comment 20 by Raymond Camden posted on 6/22/2012 at 10:16 PM

Alan: I'm a bit confused. Your question doesn't seem to have anything to do with this blog entry. This library works with HTML on the server. If you are using JS to load data remotely, you don't need this. Just use jQuery.

Comment 21 by Alan posted on 6/22/2012 at 10:33 PM

you are right you can do with JQuery :) the same question for you; why you can't do this HTML parsing with JQuery?

Comment 22 by Raymond Camden posted on 6/22/2012 at 10:37 PM

Um... no one said you couldn't. The idea was that if you _needed_ to do it server-side, this was a solution. Imagine a server-side process that examines HTML files that are dynamically generated by some process.

Comment 23 by Alan posted on 6/22/2012 at 10:51 PM

i was trying to extract product information from certain websites. Which some requires login some not (i have to deal on server, JQuery is not option).

You are right on that; this is not jsoup problem but very related and I will appreciate if you can advise me. I faced this dynamic page loading issue. if i can't grab correct content with cfhttp, that means i can't start playing with jsoup.

thanks in advance!

Comment 24 by Raymond Camden posted on 6/23/2012 at 12:43 AM

I'm not quite sure what you are saying then. You said you want it server side but can't grab it with cfhttp? Why not?

Comment 25 by Alan posted on 6/23/2012 at 8:28 PM

i am trying to overcome with loading stage. the site brings ajax search result and cfhttp grabs only loading

Comment 26 by Raymond Camden posted on 6/23/2012 at 8:31 PM

So given some URL, lets say foo.com/index.html, which is a page that uses Ajax to load stuff in, let's say random database content.

You are trying to use cfhttp+jsoup on it.

If so- no - you can't do this. cfhttp will load the HTML/JS, it will not "run" it. If you want to run jsoup on the content loaded via Ajax, determine what THAT url is.

Comment 27 by Alan posted on 6/24/2012 at 12:59 AM

sorry for poor explanation. can you get tell me way how i can get content of this page with cfhttp

Comment 28 by Raymond Camden posted on 6/24/2012 at 8:59 AM

Are you asking how to use cfhttp? Did you check the CFML Reference?

Comment 29 by Alan posted on 6/25/2012 at 10:27 PM

yeah right that is my question:) i think it removed the link which could bring some light to my question. never mind ;)

Comment 30 by Mike posted on 2/25/2013 at 3:45 AM

Ray,

How do I make the .jar available for this?
"I began by downloading the jar file and dropping into a folder called jars. Then, using ColdFusion 10, it was trivial to make it available to my code"

Comment 31 by Raymond Camden posted on 2/25/2013 at 3:52 AM

The docs for this begins here: http://help.adobe.com/en_US...

Comment 32 by James Moberg posted on 7/23/2013 at 6:13 PM

FYI: This demo doesn't work anymore as the website can't be found. The demo at:
http://www.raymondcamden.co...
is broken too.

Comment 33 by Colin Mac posted on 6/18/2014 at 12:06 AM

So this is pretty great ... when it works. For producing a page that downloads a report via a CSV file, I'm collecting the href attribute of anchor tag nodes and putting the resulting URL in parentheses after the linked text, then stripping out all of the html. Only it's not working in my component. MyElement.attr("href") works fine in a vanilla cfm, but for some reason in a cfc within my framework it spits out the empty string. MyElement.text() works for getting the link text, but attr("href") produces nothing. writeDump( myElement ) also shows a fully formed Element object.

I can't find anything but the most basic examples of using jsoup with ColdFusion, but I suspect something in here is breaking when it gets more complex.

Comment 34 by Raymond Camden posted on 6/18/2014 at 12:08 AM

So to be clear, you can take a block of code and run it in a CFM and it works ok. You take the *exact* same code in a CFC and it breaks? Or did you change something?

Comment 35 by Colin Mac posted on 6/18/2014 at 7:43 PM

Just to be sure, I created a cfc to put the jsoup code into, and called it from the cfm. It worked. The code below is from the cfc that doesn't work. Toward the end I experiment with cleaning the doc of all html, but I suspect I probably could have just used doc.text(), assuming my shenanigans with the href were kosher. Kosher shenanigans. Actually, just ran it and found that the doc variable retains enclosing html and body tags, even after a cleaning. Or maybe as part of the toString() method. So it would have to be text().

//attempting to take the href and put in parens beside the link text, then remove the a tag.
local.doc = local.jsoup.parse( arguments.html, local.baseUrl );
local.elements = local.doc.select("a");
for (local.el in local.elements) {
writeDump( "href:" & local.el.attr("href") & " text:" & local.el.text() );
local.el.appendText( local.el.text() & ' (' & local.el.attr("href") & ') ');
}
local.cleaner = createObject( "java", "org.jsoup.safety.Cleaner" );
local.cl = local.cleaner.init( local.whitelist.none() );
local.doc = local.cl.clean( local.doc );
return local.doc.toString();

Result: For input
arguments.html = <a href="http://yahoo.com">yahoo</a>
writeDump just prints "href: text: yahoo".