Earlier this week I got a pretty interesting request from a client: He wanted to use a service that would make a voice call to a number, ask them to record a message, and then store the result. He had an API already in mind, Nexmo.

Nexmo has a suite of APIs related to communication, of which voice processing is just one of them. Their Voice API covers a variety of different aspects including the ability to design a complete "phone tree" - you know - the "Dial 1 if you want X, Dial 2 if you want Y" thing we all love. Their docs are good too, but I had some difficulty wrapping my head around the process first.

So given our need (given a phone number, call me and get a recording), the process works like so:

  1. Prompt the user for the phone number.

  2. Your server code makes a HTTP request to Nexo saying, "Hey, I want you to call this number, and I want you to execute the script at URL so and so."

This is the part that confused me. Instead of just telling the API want to do, you have to set up a script on your server that includes the directions for the call. This is written in "VoiceXML", which I had never heard of before but anyone can write XML.

  1. Ok, so you have to build that XML, and the XML is essentially the script that Nexmo will use when calling you. This is basic TTS (Text to Speech) stuff so you want to be a bit careful what you type. My client had a web 2.0 name, and you know what those look like, so I spelled it phonetically instead. You also include instructions telling Nexmo where to send the recording. You can also include other information, like a session ID, etc, to help associate the recording with the file.

All in all not too terribly complex. So how about a demo? First off - before you use this code you will need to sign up for an account. This will give you your account keys as well as setup a number that is usable for receiving calls. You get 10 English pounds of credit which I think is like 5 million American dollars or so. No idea. In all my testing though I've only used about 25% of my credit so it has been enough so far. I'll skip over the code that is boilerplate Node/Express, but if anyone wants a complete zip, just ask.

So my first page is just a regular HTML page:


<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		<title></title>
		<meta name="description" content="">
		<meta name="viewport" content="width=device-width">
	</head>
	<body>

	<h2>Enter Your Phone Number</h2>

	<form id="telForm">
		<input type="tel" id="tel" placeholder="Phone Number" value="3374128987" required>
		<input type="submit" value="Call">
	</form>
	<div id="result"></div>
	
	<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
	<script src="js/app.js"></script>
	</body>
</html>

Nothing too scary here. Note that I used both a placeholder and a value attribute for the code, which doesn't make sense, but whenever I'm working on code that includes a form, I almost always hard code crap into my forms so I don't have to type. The idea being that I'd remove those hard coded values before going into production. The JavaScript is literally nothing but "listen for the form submit, get the phone number, and do an Ajax post to /ring. Again, I'll happily share it if folks want to see it. Ok, so let's take a look at the Node side. Here is a snippet of app.js:


var nexmoAPI = require('./nexmo');

app.set('domain', 'http://2d518ed0.ngrok.io');

var nexmo = new nexmoAPI('my key','some other value','19418773508');

app.post('/ring', function(req, res) {
	var number = req.body.number;
	console.log('begin ring process to '+number);
	nexmo.call(number,app.get('domain') + '/response', function(result) {
		res.send('Done');
	});
});

So a few things here. First, I'm getting a copy of my little Nexmo "library", which right now is all of one method that I'll show in a second. I initialize it with my account keys and a "from" number that will be used when dialing. My /ring handler looks for a form POST with the number to dial and then runs a method on my library called call. I pass in the phone number to call as well as the URL Nexmo should use to get the VoiceXML stuff I mentioned before.

This is where we do a quick digression. For Nexmo to work, it needs to hit your server, which in development can be a problem. But then I remembered the awesome ngrok. This is a free service that creates a tunnel on the Internet to your local machine. I even wrote about the service way back in 2014 (Expose Yourself with ngrok).

At the command line, I simply typed ngrok http 3000 and I had my tunnel set up:

ngrok

Along with a nice 'report' in the terminal, you also get a cool web based admin which lets you inspect the calls to and from and your server. For something like the Nexmo API, this was crucial and incredibly helpful. While I don't have it in my demo here, the original demo I did for my client had dynamic aspects to the VoiceXML and ngrok was able to show me the XML I was sending back to Nexmo. In case you're curious, here is how it looked for my testing.

Ok, so once ngrok gave me an 'external' domain, I was able to use it in my code. You see it in use in the code above, now let's look at my nexmo.call method.


var request = require('request');

var Nexmo = function(key,secret,defaultFrom) {
	this.config = {
		key:key,
		secret:secret,
		defaultFrom:defaultFrom
	};
	return this;
};

Nexmo.prototype.call = function(number,responseUrl,cb) {
	var theUrl = `https://rest.nexmo.com/call/json?api_key=${this.config.key}&api_secret=${this.config.secret}&to=${number}&from=${this.config.defaultFrom}&answer_url=${encodeURI(responseUrl)}`;
	console.log('theUrl',theUrl);

	request(theUrl, function(error, response, body) {
		var response = JSON.parse((body));
		console.log(body);
		cb(1);
	});
}

module.exports = Nexmo;

As I said, I've only built one method, and to be clear, the APIs at Nexmo are quite a bit more involved, but this was sufficient to do what my client needed. As you can see, all of the arguments passed to call basically just get appended to a URL. That includes the responseUrl thats a route on my local app that ngrok will proxy access for me. Whew. Got all that? And yeah - I'm not doing any error handling in this demo. Ok, so let's look at /response:


/*
I handle the vxml response. I need to replace a grand total of one variable. I could
use Handlebars/etc, but that seemed silly for a simple replacement.

Using a sync file read since it is read one time.
*/
var responseVXML = fs.readFileSync('./response.vxml').toString();
app.get('/response', function(req, res) {
	var saveURL = app.get('domain') + '/save';
	var toSend = responseVXML.replace('{{saveURL}}',saveURL);
	res.send(toSend);
});

The comments in the handler mostly explain what I'm doing. I wrote my VoiceXML response as a file but I needed one small part of it to be dynamic. I could have used a templating language like Handlebars, but since it was literally one value, I just went with a quick string replace. Most likely that's something I'd change later on if this project got more complex. Here is the XML:


<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.1">
    <form>
	<record name="recording" beep="true" dtmfterm="true" maxtime="100s">
        <prompt>Hello from a Node.js app. Leave me a message.</prompt>
        <prompt>When done, please hit the pound key.</prompt>
		<filled>
			<prompt>
				Your recording was saved. Thank you.
			</prompt>
			<submit next="{{saveURL}}" method="post" namelist="recording" enctype="multipart/form-data"/>
		</filled>
	</record>
    </form>
</vxml>

So, VoiceXML is rather complex (see the docs over on Nexmo for more examples), but what we've got here is a packet that includes two sentences that will be read to the user over the phone. You can tweak the voice if you want. I specify where Nexmo should send the recording, and if I wanted to include additional information, I'd add it to my XML. Most likely you will need to so you can pass along a unique identifier for the request that you can use in the final step, where the recording data is sent back.

As an aside, the system worked really well in terms of pronunciation. As I said, I had to handle a "weird web name" but outside of that it was fine. Also, in one test I kinda mumbled real soft, and the system prompted again for the recording automatically. All in all it felt very polished and professional. And look - I actually recording a call from my iPhone. Pardon the kids who - even though I asked them to be quiet - decided it wasn't time for that. (The phone call comes in around 10 seconds into the video. Sorry for not trimming.)

Alright - so let's look at the final aspect - saving the recording. Remember, in the previous code, we told Nexmo what to say and where to send it. It can send both regular form fields and the audio all in one post. I used Formidable to make processing this simple.


app.post('/save', function(req, res) {
	console.log('about to save like a boss');
	var form = new formidable.IncomingForm();
	form.parse(req, function(err, fields, files) {
		console.log('fields', fields);
		console.log('files',files);
		if(files.recording) {
			console.log('Yes, we have a recording.');
			fs.createReadStream(files.recording.path).pipe(fs.createWriteStream('./recordings/'+files.recording.name));
		}
	});
	res.send(`
<?xml version="1.0" encoding="UTF-8"?>
<vxml version = "2.1" >
   <form>
      <block>
         <exit/>
      </block>
   </form>
</vxml>
		`);
});

Yeah, that's a big ugly XML packet I'm sending back there in the end. Nexmo seems to work fine withouth a 'formal' XML response, but I believe this is what you should do and so I did it. (Although again, the code here could be cleaner.)

In case your curious, Nexmo sent the recording very fast. I saw it show up on my server before the "Your recording was saved." text was spoken to me. You can listen to my last test here:

recording-1465500970679.wav

Hopefully this was helpful. Let me know if you have any questions by leaving a comment below.