Building Your Own Serverless Search Engine with OpenWhisk

May 2, 2017 serverless openwhisk

(This post is more than 2 years old.)

Building Your Own Serverless Search Engine with OpenWhisk

This is a demo I've been working on for some time. It isn't necessarily that complex (or cool), but it's just taken me a while to get the parts together. As you know, I'm a huge proponent of static site generators. My own site is run on one and I recently released released a book on the topic with Brian Rinaldi.

One of the things I cover in the book is how to "bring dynamic back" to a static site. That includes things like forms, comments, and search. In the book I recommend Google's Custom Search Engine feature. It's what I use for search here and it works well.

But I got to thinking - how difficult would it be to set up a similar system with OpenWhisk? All I would need is two parts:

A way to index my content.
A way to search my content via an API.

Turns out, there's a pretty cool service that does this already - Tapir. Tapir lets you specify a site and it will begin indexing it for your automatically. You then simply use an API end point in your code to perform searches. This is a cool service, but I can't recommend it. While it still "runs", the folks behind it no longer support it so it's not something I'd suggest. But it serves as a good basis for what I'd like to build in OpenWhisk, so that's how I got started!

Indexing

To handle indexing, I needed a few components. First, I needed a way to parse RSS entries. That's easy enough. I built an action for this a few months ago. You can see it here: https://github.com/cfjedimaster/Serverless-Examples/tree/master/rss. Here's the code:

const request = require('request');
const parseString = require('xml2js').parseString;

function main(args) {

	return new Promise((resolve, reject) => {

		if(!args.rssurl) {
			reject({error:"Argument rssurl not passed."});
		}

		request.get(args.rssurl, function(error, response, body) {
			if(error) return reject(error);

			parseString(body, {explicitArray:false}, function(err, result) {
				if(err) return reject(err);
                resolve({entries:result.rss.channel.item});
			});

		});

	});
}

exports.main = main;

Pretty trivial. I'm using xml2js to handle the parsing and then filtering down the result to just the items. Done.

The next thing I needed was a way to work with ElasticSearch. IBM Bluemix lets you provision a new instance in seconds, so I did that. Once I had it provisioned, I made use of a package I created to work with ElasticSearch. You can find the code here: https://github.com/cfjedimaster/Serverless-Examples/tree/master/elasticsearch

It's also shared on OpenWhisk itself so you can bind your own copy at "/rcamden@us.ibm.com_My Space/elasticsearch". The package has actions for adding items to your ElasticSearch instance, performing bulk operations, and doing searches. Obviously ElasticSearch supports more, but I built only what I needed. For my usage, I decided to use the bulk operation. I figured I'd take the RSS items and insert them all at once.

To make this work, I made a new action to sit between them. I touched on this in my previous post about using sequences as a way to massage input/output for actions. In that post I was focused on input/output, but obviously a sequence can be created between two "pure" actions and one in the middle that massages the data from one to the other. In my case, the action was called flattenRSSEntriesForSearch. Here's the code.

function main(args) {

	/*
	create a new array of:
		url
		description
		pubDate
		title
	*/
	let entries = args.entries.map( (entry) => {
		return {
			url:entry.link,
			body:entry.description,
			published:entry.pubDate,
			title:entry.title
		};
	});

	/*
	ok, now we need to prep it for the bulk action
	PREPARE THE BULK!!!

	ex: 
	{ index:  { _index: 'myindex', _type: 'mytype', _id: 1 } },
     // the document to index
    { title: 'foo' },
	*/
	let bulk = [];

	entries.forEach( (e) => {
		let action = {"index":{"_type":"entry", "_id":e.url}};
		let document = e;
		bulk.push(action);
		bulk.push(document);
	});

	return {
		body:bulk,
		index:'blogcontent'
	}

}

exports.main = main;

For the most part, this is just "convert one array to another", and frankly, I just read the docs on bulk inserts and followed their direction. One cool thing about this setup is that ElasticSearch is smart enough to take my input and update existing items I already created. Notice I'm using the URL as the ID. Since URLs are unique, they work great as a primary key for my ElasticSearch data.

So I took my RSS action, my 'joiner' action above, and my bulk action, and made a sequence called rssToES. Here's how I'd call it from the command line:

wsk action invoke -b -r rssToES --param rssurl http://feeds.feedburner.com/raymondcamden

Then all I needed to do was make a trigger to call this once a day. Bam. Done. (Ok, I lie. I didn't bother making the scheduled task because I'd probably forget about it and it's just a demo so there's no need for it, but I could. Honestly.)

Search

Ok, so how do we handle search? First, we need to support a search string, obviously. ElasticSearch has a hella-long list of ways to search, but they also support a simple "just give me a damn string" style search which is what I'll use now. It's supported by the search action of my package, so that's good to go. But - that's not what I want to expose to the web.

So once again, I rely on the technique in the last post of using a sequence to massage my data. First, I created an "input" action called rssSearchEntry:

let index = 'blogcontent';
let type = 'entry';

function main(args) {

    //args.q required - the search

    return {
        index:index,
        type:type,
        q:args.q
    }

}

My ElasticSearch search action needs the search term as well as the index and type. I set index and type to hard coded values and then just pass on the search string. Search isn't too exciting but you can check out the code in the repo here.

Once the search action is done, I've got a result that includes metadata as well as matched documents. So I built a third action, rssSearchExit, to massage that into a simple array.

// I remove metadata from ES I don't care about
function main(args) {

   //credit for regex: http://stackoverflow.com/a/822464/52160
   let result = args.hits.hits.map((entry) => {
    return {
        url:entry._id,
        title:entry._source.title,
        published:entry._source.published,
        context:entry._source.body.replace(/<(?:.|\n)*?>/gm, '').substr(0,250),
        score:entry._score
    }  
   });
 
    return {
        headers:{
            'Access-Control-Allow-Origin':'*',
            'Content-Type':'application/json'
        },
        statusCode:200,
        body:new Buffer(JSON.stringify(result)).toString('base64')
    };

}

Note that I also replace the body of the match, which includes the full HTML of the blog entry, with a shorter 'context' value that has HTML removed. This seemed like a good idea to me, but obviously you could leave that up to the client-side code if you wanted.

The last part of the action simply enables CORS and returns my result.

So with those actions done, I made a new sequence that simply wrapped the three together and enabled web action support. Woot!

The Front End

Almost done with our Google Replacement. I'll be taking phone calls from Silicon Valley investors soon, I can just feel it! I whipped up a simple HTML form for the search:

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		<title></title>
		<meta name="description" content="">
		<meta name="viewport" content="width=device-width">
	</head>
	<body>

		<h1>Search</h1>
		<p>
		<input type="search" id="search"> <input type="button" id="searchBtn" value="Search!">
		</p>

		<div id="results"></div>

		<script src="app.js"></script>
	</body>
</html>

Basically a search box, button, and empty DIV for the results. Now for the JavaScript. Before I share this code, note that I'm using some ES6 stuff here and that is completely arbitrary. It is not a requirement for working with OpenWhisk APIs. I just did it because I like the shiny.

document.addEventListener('DOMContentLoaded', init, false);

let $search, $searchBtn, $results;
let searchAPI = 'https://openwhisk.ng.bluemix.net/api/v1/web/rcamden@us.ibm.com_My%20Space/default/rssSearch.http?q=';

function init() {
	console.log('ready to listen to your every need...');
	$search = document.querySelector('#search');
	$searchBtn = document.querySelector('#searchBtn');
	$results = document.querySelector('#results');

	$searchBtn.addEventListener('click', doSearch, false);
}

function doSearch() {
	//clear results always
	$results.innerHTML = '';

	let value = $search.value.trim();
	if(value === '') return;
	fetch(searchAPI + encodeURIComponent(value)).then( (resp) => {
		resp.json().then((results) => {
			console.log(results);
			if(!results.length) {
				$results.innerHTML = '<p>Sorry, I found nothing for that.</p>';
				return;
			}

			let result = '<ul>';
			results.forEach((entry) => {
				result += `
<li><a href="${entry.url}">${entry.title}</a><br/>
${entry.context}</li>
				`;
			});
			result += '</ul>';
			$results.innerHTML = result;
		});
	});
}

Alright, so outside of the fancy ES6 stuff, this should be like every other AJAX-search engine built in the past decade. I've pointed to my endpoint (see searchAPI) and just do an AJAX call to get my results. And yeah... that's it. Want to see it live? (Keep in mind I am not actually updating my content in ElasticSearch via a schedule, so it's only got about 10 blog entries in it. I'd search for 'openwhisk'.)

https://cfjedimaster.github.io/Serverless-Examples/rss_search/frontend/

Wrap Up

So, I think this process is actually pretty cool. I've got complete control over how my content is indexed and I've got complete control over the search API. I also have more control over how it gets embedded in a static site. If I wanted to, I could go even further. Tapi provided end points to add older content (ie stuff not in your RSS feed currently) as well as to allow for updates and deletes. OpenWhisk isn't free, but I'm only going to be charged when indexing runs and when someone searches. Thoughts?