Raymond Camden's Blog Rss

Friday Puzzler - Helping the Model-Glue Team

10

Posted in ColdFusion | Posted on 03-13-2009 | 3,610 views

For today's Friday Puzzler (yes, I know it's been a while), I have something of a doozy. It may not be a 5 minute puzzle, but it could still be fun, and most of all, it will be helpful to the Model-Glue team. The Model-Glue docs (http://docs.model-glue.com/) were written using Robohelp, and unfortunately, the original files are missing. We need to get an "export" of the docs so that they can be republished in a new format. Your task, if you choose to accept it, is to write a scraper for the docs that can download and store each page from the documentation. This needs to keep the HTML for layout purposes, so it can't be just plain text.

Anyone up for that challenge?

Comments

[Add Comment] [Subscribe to Comments]

Why not use BlackWidow...?

http://softbytelabs.com/us/bw/
It probably defeats the purpose of the challenge I know but why not just use some freeware software to do it? (no names mentioned)

I did just that when a competitor of ours "stole" content off our site.
does it have to be in CF?
Use the windows port of wget and down the whole thing.

http://www.gnu.org/software/wget/
I look at this in 2 views. The practical need is for the MG team. If you use black magic voodoo to get the content, I'm happy and they are happy (by they I mean me too, I'm on the team :)

The other view is - I'd like to a CF solution too, just for fun. :)
I just tried the wget thing (no need to grab the windows version, the linux did just fine).

I never knew about the recursive option. Amazing.

I learned something new today: Thanks John Lyons (and Ray for issuing the challenge to begin with).
wget is nice but, what if you actually want to do something with the data, or only grab parts of the page (probably a more likely scenario).

Check out http://scrubyt.org lets you grab parts of pages (even interact with them)

for example:

ebay_data = Scrubyt::Extractor.define do

fetch 'http://www.ebay.com/'
fill_textfield 'satitle', 'ipod'
submit
click_link 'Apple iPod'

record do
item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
price '$71.99'
end
next_page 'Next >', :limit => 5

end
The question in my mind here is to ask if RoboHelp will be used in the future for the docs.

If so, I have a page up at the link below that will assist in reverse engineering things.
http://tinyurl.com/2g8kd6

If anyone creates a utility to grab this content and convert it to basic HTML pages, I'd love to know about it so I may steer folks to obtain it if needed. It would be a nice one to see.

Cheers all... Rick :)
My solution is posted at http://edbartram.com/blog/2009/03/converting-model... with a zip file containing the Model-Glue docs in HTML format.

I kept the code simple using CFHTTP calls and looping through the pages using FindNoCase() to strip out the desired content.

Is this what you were looking for?
@Ed (and all), Dan Wilson is monitoring this post as he is the one trying to get the docs.

[Add Comment] [Subscribe to Comments]