Using DevTools to Scrape Web Content

So yesterday I blogged a demo that was - by my own admission - somewhat silly and not really worth your time to read. However, I was thinking later that there was one particular aspect of how I built that demo that may be actually be useful.

While I was creating the demo, I needed to get a list of all the songs the Cure recorded. I found this quickly enough on Wikipedia:

Screen shot of Wikipedia page

So there's only 67 songs there - in theory I could have typed that in about 5 minutes. But why do something by hand when you can use code?!?!?

I began by right clicking on the first link and selecting "Inspect Element." (As a quick FYI, I'm using Firefox for this, but everything I'm showing should work in every modern browser. And shoot - I just tested and it's not supported in Edge. Tsk tsk.)

Screen shot of devtools focused on the link tag

It may be a bit hard to see in the screen shot, but I noticed two things here. First, the link used a title attribute with the name of the song. Second, I noticed there was a div named mw-category that appeared to "wrap" all the links. I figured this out by mousing over the div in the Inspector panel and noticing the highlight above.

Screen shot of devtools showing the div highlighted

Cool. So now I switched to the Console. For my first command, I wanted to grab all the links within that div:

links = document.querySelectorAll('.mw-category a');

When it was done, I tested to see if it seemed right by checking the length:

Confirming I got the right data

Notice how I got 67 items and it matches what the Wikipedia page says as well. Cool! So, now I've got a NodeList of data that I can iterate over like an array. (It isn't an array, but I can use it as such.) So first I made a new array:

titles = [];

And then I populated it:

links.forEach((a) => titles.push(a.title));

And when done, I took a quick look to ensure it seemed ok:

Testing the titles value

Cool! And for the final operation, I simply copied it to my clipboard using:

copy(titles)

This is the only part that is not supported by Edge. Hopefully they add that soon. The end result is a string version of the array I was able to drop right into my editor and go to town with.

If any of the following didn't make sense, I've created a quick video showing the process I went through.

Like This?

If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can also subscribe to the email feed to get notified of new posts.

See Also