Posted in ColdFusion | Posted on 03-13-2009 | 3,610 views
For today's Friday Puzzler (yes, I know it's been a while), I have something of a doozy. It may not be a 5 minute puzzle, but it could still be fun, and most of all, it will be helpful to the Model-Glue team. The Model-Glue docs (http://docs.model-glue.com/) were written using Robohelp, and unfortunately, the original files are missing. We need to get an "export" of the docs so that they can be republished in a new format. Your task, if you choose to accept it, is to write a scraper for the docs that can download and store each page from the documentation. This needs to keep the HTML for layout purposes, so it can't be just plain text.
Anyone up for that challenge?


http://softbytelabs.com/us/bw/
I did just that when a competitor of ours "stole" content off our site.
http://www.gnu.org/software/wget/
The other view is - I'd like to a CF solution too, just for fun. :)
I never knew about the recursive option. Amazing.
I learned something new today: Thanks John Lyons (and Ray for issuing the challenge to begin with).
Check out http://scrubyt.org lets you grab parts of pages (even interact with them)
for example:
ebay_data = Scrubyt::Extractor.define do
fetch 'http://www.ebay.com/'
fill_textfield 'satitle', 'ipod'
submit
click_link 'Apple iPod'
record do
item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
price '$71.99'
end
next_page 'Next >', :limit => 5
end
If so, I have a page up at the link below that will assist in reverse engineering things.
http://tinyurl.com/2g8kd6
If anyone creates a utility to grab this content and convert it to basic HTML pages, I'd love to know about it so I may steer folks to obtain it if needed. It would be a nice one to see.
Cheers all... Rick :)
I kept the code simple using CFHTTP calls and looping through the pages using FindNoCase() to strip out the desired content.
Is this what you were looking for?
[Add Comment] [Subscribe to Comments]