Scraping a web page in Node with Cheerio

Scraping a web page in Node with Cheerio

In yet another example of “I will build the most stupid crap ever if bored”, this week I worked on a Node script for the sole purpose of gathering data about SiriusXM. I’m a huge fan of the radio service (mostly because 99% of my local radio stations are absolute garbage, except for KRZS), and I was curious if the service had an API of some sorts. I was not able to find one, but I did find this page:

http://xmfan.com/guide.php

Which had a constantly updating list of what’s playing. I reached out to the site to ask how they were getting their data, but I never heard back from them. Therefore I figured why not simply scrape the data myself locally?

In order to do this I decided to try Cheerio, a jQuery library specifically built for the server. It lets you perform jQuery-like operations against HTML in your Node apps. I first heard about this from one of my new coworkers, Erin McKean, who joined us on the LoopBack team at IBM a few weeks back.

My script was rather simple, so here is the entire module I built.


let cheerio = require('cheerio');
let request = require('request');

function getData() {
    return new Promise(function(resolve, reject) {
        request('http://xmfan.com/guide.php', function(err, response, body) {
            if(err) reject(err);
            if(response.statusCode !== 200) {
                reject('Invalid status code: '+response.statusCode);
            }
            let $ = cheerio.load(body);
            let channelList = $('td[width=140]');

            let channels = [];

            for(let i=0;i<channelList.length;i++) {
                let t = channelList.get(i);
                let channel = $(t).text();
                let artistNode = $(t).next();
                let artist = $(artistNode).text();
                let title = $(artistNode).next().text();
                //console.log(channel +'-'+ artist +'-'+ title);
                channels.push({channel:channel, artist:artist, title:title});
            }

            resolve(channels);

        });
    });
}

module.exports = getData;

Essentially - I suck down the contents of the HTML and then use a selector to get the left hand column of the tables used to represent the music data. This is - obviously - brittle. But let’s carry on. After I have those nodes, I can then iterate over them and find the nodes next to them in the table row. This is all very much like any other jQuery demo, but I’m running this completely server-side. The end result is an array of objects containing a channel, artist, and title.

To use this, I set up a simple script to run my module and then insert the data into Mongo. In order to ensure I don’t get duplicate data, I store a timestamp with each record, and first see if a matching record within five minutes was stored. Here’s my code:


var sucker = require('./sucker.js');
var MongoClient = require('mongodb').MongoClient;


let url = 'mongodb://localhost:27017/siriusxm';

MongoClient.connect(url).then(function(db) {
	console.log('connected like a boss');

	var data = db.collection('data');

	sucker().then(function(channels) {
		let toProcess = channels.length;
		let done = 0;
		let inserted = 0;
		console.log('got my result, size is '+toProcess);

		/*
		so the logic is as follows: 
			iterate over each result
			look for a match w/n a 3 minute time frame
		*/
		let dateFilter = new Date(new Date().getTime() - 5*60000);
		
		channels.forEach(function(channel) {
			channel.timestamp = new Date();
			//console.log(channel);

			data.find({
				'title':channel.title,
				'channel':channel.channel,
				'artist':channel.artist,
				'timestamp':{
					'$gte':dateFilter
				}
			}).toArray(function(err, docs) {
				if(err) console.log('Err', err);
				if(docs && docs.length === 0) {
					data.insert(channel, function(err, result) {
						if(err) throw(err);
						if(!err) {
							inserted++;
							done++;
							if(done === toProcess) {
								db.close();
								console.log('Total inserted: ',inserted);
							}
						}
					});
				} else {
					//console.log('not inserting');
					done++;
					if(done === toProcess) {
						db.close();
						console.log('Total inserted: ',inserted);
					}
				}
			});

		});
	}).catch(function(err) {
		console.log('unhandled error', err);
		db.close();	
	});


}).catch(function(err) {
	console.log('mongodb err', err);
});

I figure this isn’t too interesting, but I will point out one bit I don’t like. I’m not running this as a server, just a script, and I needed a way to close down the connection when done. Since everything is async, I could have used Promises, but I decided to go the lame way out and simply keep track of how many results I had processed. This means I’ve got a bit of duplication in the two blocks that handle closing the connection.

I’m thinking that the next step will be to add Node Cron, which is fairly easy to install, but always takes me forever to figure out the right syntax. I’ll then let it run for a month or so and see if I can get some interesting analytics. For example, how often is the Cure played? This is important stuff, people!

Here’s an example of it working - and you can see where on the second run it ignored a bunch of songs that had been recorded before:

Da Script

And here are a few rows in the database:

Da data

You can take a look at the full code (currently anyway) here: https://github.com/cfjedimaster/NodeDemos/tree/master/siriusxmparser

Raymond Camden's Picture

About Raymond Camden

Raymond is a developer advocate for Extend by Auth0. He focuses on serverless and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support.

Lafayette, LA https://www.raymondcamden.com

Comments