Earlier this week James Moberg introduced me to a cool little Java utility - jsoup. jsoup provides jQuery-like HTML manipulation to your server. Given a string, or a URL, you can do things like, find all the images, look for links to a PDF, and so on. Basically - jQuery for the server. I thought I'd whip up a quick ColdFusion-based demo of this so I could see how well it works.
I began by downloading the jar file and dropping into a folder called jars. Then, using ColdFusion 10, it was trivial to make it available to my code:
I then whipped up a demo that loaded (and cached) CNN's html. I create an instance of jsoup, parse the HTML, and then run a "select" using my selector, in this case, just 'img':
Notice how I can loop over the matches and grab attributes from each one. Again, very jQuery-like. I wanted to play with this a bit more free form so I created an application that lets me supply any URL and any selector. Here's that code - minus the UI cruft around it:
You can run this yourself by hitting the demo below. All in all - a very interesting Java library. Sure you could do all of this with regular expressions, but I find this syntax a heck of a lot more friendly. (And that's with me having used regex for the past 15 years.)
Talk about synchronicity - within 10 minutes of each other, both Ben Nadel and I posted on the same topic! Parsing, Traversing, And Mutating HTML With ColdFusion And jSoup
Archived Comments
Quite interesting. I often need to parse data from various HTML files and this would actually make it a lot more comfortable.
But does it require valid XML or is it able to handle formatting errors up to a certain degree?
According to their site, they specifically support messy HTML. Try a site that you know is messy with my tester app.
Thanks, I'll give it a try.
This is awesome, thanks for posting it Ray!
Using CFDUMP when reviewing what jsoup returns is extremely beneficial.
I used jsoup's whitelist to automatically remove invalid/bloated Microsoft markup from HTML in an RSS feed.
http://pastebin.com/rvkt2GCC
Yeah - using cfdump on Java objects is a quick/dirty way to see the API. Of course, jsoup does have their JavaDocs online too. :)
Thanks James, I almost forgot about the most obvious use for this thing. Recently encountered a case where I needed to remove markup junk from copy&pasted Outlook mail content.
Ray,
I noticed you used the new ColdFusion 10 cacheIDExists() function in your code. I wanted to point out that I'm not a fan of the function for the most part as it actually adds an additional cache lookup in the event of a cache hit the way the documentation shows how to use it:
1st call - does it exist?
2nd call - it exists, so call the cache and get it (the 2nd lookup) or it doesn't exist, so get it then put it in the cache
Now consider this code:
<cfset cnnhtml = cacheGet("cnnhtml")>
<cfif isNull(cnnhtml)>
<cfhttp url="http://www.cnn.com">
<cfset cnnhtml = cfhttp.filecontent>
<cfset cachePut("cnnhtml",cnnhtml)>
</cfif>
Notice that in the case of a get, if the item is in the cache, that's it - just one call. If the item doesn't exist then you grab it and put it in the cache. This might seem like a small deal, but in a large scale system, it could be significant.
Are you saying ehcache doesn't provide a nicer way of checking for something than asking for it and noticing it is null? Or that CF doesn't have a way of using a nicer API? Is the check that expensive?
I'm saying that using cacheIDExists() is less efficient. It's not expensive for small apps. It is expensive at scale.
Is this a CF wrapper issue or just a fact of life with ehcache?
That's a really useful tool. I can immediately think of dozens of places where it would make my life so much easier.
Gave it a try on several systems.
It's certainly interesting, but also so much slower than extracting the data using customized RegEx-based pattern that it's of rather limited use.
If you know exactly what you are searching for and using RegEx to extract known information, then RegEx is your solution.
I jSoup to clean up unknown HTML & remove unwanted elements in a couple lines that you would be difficult to identify in advance in order to generate multiple regex statements. I'm also using it to add additional attributes & properties to certain existing HTML elements when optimizing for email messages (limited CSS support).
Ben Naddel wrote a sweet ColdFusion 10 script that uses jsoup to convert CSS style blocks to inline styles for Google-Mail-compliant email messages.
http://www.bennadel.com/blo...
This utility script is extremely useful and I'm not sure how it would be accomplished using RegEx-based patterns (if at all).
Hey Ray,
Great find..
How do you set this up with CF902?
I tried with JavaLoader but keep getting some instantiation errors.
JavaLoader should work. What error do you get?
I am an idiot :)
jsoup does not need to init() just do the create()
what about sites load with AJAX. is there anyway to tell when is page done and start caching? tried buy dot com couldn't pass loading stage
Alan: I'm a bit confused. Your question doesn't seem to have anything to do with this blog entry. This library works with HTML on the server. If you are using JS to load data remotely, you don't need this. Just use jQuery.
you are right you can do with JQuery :) the same question for you; why you can't do this HTML parsing with JQuery?
Um... no one said you couldn't. The idea was that if you _needed_ to do it server-side, this was a solution. Imagine a server-side process that examines HTML files that are dynamically generated by some process.
i was trying to extract product information from certain websites. Which some requires login some not (i have to deal on server, JQuery is not option).
You are right on that; this is not jsoup problem but very related and I will appreciate if you can advise me. I faced this dynamic page loading issue. if i can't grab correct content with cfhttp, that means i can't start playing with jsoup.
thanks in advance!
I'm not quite sure what you are saying then. You said you want it server side but can't grab it with cfhttp? Why not?
i am trying to overcome with loading stage. the site brings ajax search result and cfhttp grabs only loading
So given some URL, lets say foo.com/index.html, which is a page that uses Ajax to load stuff in, let's say random database content.
You are trying to use cfhttp+jsoup on it.
If so- no - you can't do this. cfhttp will load the HTML/JS, it will not "run" it. If you want to run jsoup on the content loaded via Ajax, determine what THAT url is.
sorry for poor explanation. can you get tell me way how i can get content of this page with cfhttp
Are you asking how to use cfhttp? Did you check the CFML Reference?
yeah right that is my question:) i think it removed the link which could bring some light to my question. never mind ;)
Ray,
How do I make the .jar available for this?
"I began by downloading the jar file and dropping into a folder called jars. Then, using ColdFusion 10, it was trivial to make it available to my code"
The docs for this begins here: http://help.adobe.com/en_US...
FYI: This demo doesn't work anymore as the website can't be found. The demo at:
http://www.raymondcamden.co...
is broken too.
So this is pretty great ... when it works. For producing a page that downloads a report via a CSV file, I'm collecting the href attribute of anchor tag nodes and putting the resulting URL in parentheses after the linked text, then stripping out all of the html. Only it's not working in my component. MyElement.attr("href") works fine in a vanilla cfm, but for some reason in a cfc within my framework it spits out the empty string. MyElement.text() works for getting the link text, but attr("href") produces nothing. writeDump( myElement ) also shows a fully formed Element object.
I can't find anything but the most basic examples of using jsoup with ColdFusion, but I suspect something in here is breaking when it gets more complex.
So to be clear, you can take a block of code and run it in a CFM and it works ok. You take the *exact* same code in a CFC and it breaks? Or did you change something?
Just to be sure, I created a cfc to put the jsoup code into, and called it from the cfm. It worked. The code below is from the cfc that doesn't work. Toward the end I experiment with cleaning the doc of all html, but I suspect I probably could have just used doc.text(), assuming my shenanigans with the href were kosher. Kosher shenanigans. Actually, just ran it and found that the doc variable retains enclosing html and body tags, even after a cleaning. Or maybe as part of the toString() method. So it would have to be text().
//attempting to take the href and put in parens beside the link text, then remove the a tag.
local.doc = local.jsoup.parse( arguments.html, local.baseUrl );
local.elements = local.doc.select("a");
for (local.el in local.elements) {
writeDump( "href:" & local.el.attr("href") & " text:" & local.el.text() );
local.el.appendText( local.el.text() & ' (' & local.el.attr("href") & ') ');
}
local.cleaner = createObject( "java", "org.jsoup.safety.Cleaner" );
local.cl = local.cleaner.init( local.whitelist.none() );
local.doc = local.cl.clean( local.doc );
return local.doc.toString();
Result: For input
arguments.html = <a href="http://yahoo.com">yahoo</a>
writeDump just prints "href: text: yahoo".