About a year or so ago I was lucky enough to catch a presentation from Marc Esher where he discussed NoSQL and MongoDB in particular. I found it interesting - but I filed it away as one of those cool technologies that I probably wouldn't ever have a good reason to actually use. Fast forward a year or so and I found myself thinking more about OpenAmplify and their API. Readers here know I've done a few blog posts on their text analysis API. One of the things I normally mention in my posts is that their API could be critical for a site that contains a lot of user generated content. It occurred to me that I know a site like that... this one.

Now - my blog does not get a lot of traffic in the grand scheme of things. But I have been running it for over 9 years. So while I don't get a lot of content every day (I average 12 or so comments per day), I do have a large set of data to look at (44000+ comments). As an experiment, I thought it would be interesting to run OpenAmplify against all 44000+ comments and see what kind of text analysis I could get from it. This quest involved two main aspects:

  1. Getting the analysis. OpenAmplify has an incredibly simple API. Basically send it text and get stuff back. This part I wasn't worried about.
  2. Report on stats. So that would be easy - once I have my data stored. Right?

Turned out the second item was a bit difficult. When you send your text to OpenAmplify, you get a lot of data back. I mean a lot of data. How big? Check out this screen shot. It's a portion of the result. If you click it, it will take you to a 1 meg picture of the entire cfdump.

So the immediate issue I had was... how do I store this? Storing a structure into a database isn't impossible. You can simply flatten it. So a struct like so:

Could be stored into a database table with these columns: age, gender, name_firstname, and name_lastname. But that struct was simple. What happens when you deeply nested structs? Arrays? Or the monster you see above. You get the idea. We've got a storage problem. Enter MongoDB.

One of the most interesting aspects of the NoSQL space is that they allow you to arbitrarily store complex data. That's what I was told anyway. I thought I'd take a look at how real that claim was. I began by downloading MongoDB and setting it up. Their docs are really well done and they have specific instructions for each operating system. I was up and running in about five minutes. Next is the ColdFusion aspect. For a while now Marc Esher has worked on a wrapper for MongoDB called CFMongoDB. (He really should come up with some kind of a neater name. Maybe something with Cold in it!) I downloaded his code as well and took it for a spin. I had his sample code up and running rather quickly so installation of the bits was no longer a problem.

Ok - so let's back up a bit and talk about the process. I began by creating a CFM script that I could schedule. It would take a few hundred blog comments that had not yet been processed, run OpenAmplify on them, and store the result as a JSON string to the database. Here is that script:

<cfsetting requesttimeout="999"> <cfset dsn = "myblog"> <cfset oakey = "moo">

<!--- get total count ---> <cfquery name="getCommentCount" datasource="#dsn#"> select count(id) as total from tblblogcomments </cfquery>

<cfoutput> There are #numberFormat(getCommentCount.total)# comments.<br/> </cfoutput>

<!--- get what we need to analyze ---> <cfquery name="getCommentCount" datasource="#dsn#"> select count(id) as total from tblblogcomments where analysis is null </cfquery>

<cfoutput> There are #numberFormat(getCommentCount.total)# comments to be analyzed.<br/> </cfoutput> <cfflush>

<!--- get a sub set ---> <cfquery name="getComments" datasource="#dsn#"> select id, comment from tblblogcomments where analysis is null and comment is not null and comment != '' limit 0,600 </cfquery> <cfset oa = new openamplify(oakey)>

<cfloop query="getComments"> <cftry> <cfset analysis = oa.parse(text=comment,analysis="all")> <cfset json = serializeJSON(analysis)> <cfquery datasource="#dsn#"> update tblblogcomments set analysis = <cfqueryparam cfsqltype="cf_sql_longvarchar" value="#json#"> where id = <cfqueryparam cfsqltype="cf_sql_varchar" value="#id#"> </cfquery> <cfif currentRow mod 10 is 0> <cfoutput>Processed row #currentRow#<br/><cfflush></cfoutput> </cfif> <cfcatch> <cfoutput>Skipped #id# because: #cfcatch.message#<br/></cfoutput> <!--- todo - save as {}? ----> </cfcatch> </cftry> </cfloop>

<cfoutput> <p> Done. Processed #getComments.recordCount# items. </cfoutput>

So why did I store the string in the database? I wanted a temporary solution while I was still learning MongoDB. I also knew it would take a few days to process all 44K entries. The OpenAmplify folks were kind enough to give me an increase in my usage, but I still planned for this to run over a week. (I believe it only took 3 days though.) Essentially I used the database here as a simple storage place for large JSON strings.

The flip side then was to take that JSON and store it into MongoDB. MongoDB takes any data, right?

<cfsetting requesttimeout="999"> <cfscript> dbName = "deepcomment"; javaloaderFactory = createObject('component','cfmongodb.core.JavaloaderFactory').init(); mongoConfig = createObject('component','cfmongodb.core.MongoConfig').init(dbName=dbName, mongoFactory=javaloaderFactory); mongo = createObject('component','cfmongodb.core.Mongo').init(mongoConfig);

collectionName = "comments"; comments = mongo.getDBCollection( collectionName );

//wipe out and start over comments.remove({});

numComments = comments.count();

//get comments with analysis q = new com.adobe.coldfusion.query();
q.setDatasource("myblog");
q.setSQL("select id, entryidfk, name, email, comment, analysis, posted from tblblogcomments where analysis is not null");
commentQuery = q.execute().getResult();

writeOutput("Found #numberFormat(numComments)# comments<br>"); writeOutput("Found #numberFormat(commentQuery.recordCount)# comments in the database<br>");

//populate the articles collection if we need to if( commentQuery.recordCount gt 0 ){ all = [];

for( i = 1; i LTE commentQuery.recordCount; i++){

comment = { id=commentQuery.id[i], entryidfk=commentQuery.entryidfk[i], name=commentQuery.name[i], email=commentQuery.email[i], posted=commentQuery.posted[i], analysis=deserializeJSON(commentQuery.analysis[i]) }; arrayAppend( all, comment );

if(i mod 1000 == 0) { comments.saveAll( all ); all = []; writelog(file="application",text="inserted #i#, #arrayLen(all)#"); }

}

comments.saveAll( all ); writeOutput("inserted #i-1# comments<br>"); }

//get an idea of what the data look like first = comments.find(sort={"ID"=-1},limit=5); writeDump( var=first.asArray(), label="first 5 comments", expand="false" );

mongo.close();

</cfscript>

If you've never seen any MongoDB (or CFMongoDB) code before, let me explain it. (And let me be very clear here. I'm new to this whole area. If I screw anything up here terminology wise, I apologize in advance!) I begin by initializing the CFMongoDB wrappers. That would normally be within an Application.cfc (and in fact, I moved it a bit later). But you can think of that as simply the setup of the connection between ColdFusion and MongoDB. By the way, MongoDB suffers from the same serious fault as ColdFusion. It doesn't work well if you forget to run it. Yes - I did that a lot.

I created an arbitrary collection to store my data and called it comments. The remove function there was used because my code was nuking and recreating while I tested. You do not have to do this in production. (And should not I assume.)

Next I ran a query to get my comments (the one with analysis). I loop over the results and for each one, I create a structure that contains information about the comment as well as the deserialized analysis structure from OpenAmplify. This struct is added to an array which then gets passed to MongoDB.

And it worked. Just like that. That ginourmous set of data just... got stored. I didn't have to translate it to a table at all. The only real issue I ran into was memory management. You can see I nuke the array every 1000 items. That was ColdFusion/RAM related, not MongoDB's fault.

Finally at the end there you can see me getting a set of data just to look at and ensure things are kosher. "Getting" stuff is where things get interesting.

So - MongoDB supports a few interesting ways to fetch your data. You can do a find command based on a key. So for example, here is one example of that:

var cursor=db.comments.find({"ANALYSIS.Topics.TopTopics.Topic.Name":"grid"})

That long key name there is straight from OpenAmplify. MongoDB knows then to search for this particular key in the data and also handles the fact that not every instance will have that key. You can do distinct counts. You can do counts. You can get and sort. There's an interesting SQL to Mongo chart that gives a lot of examples of this. Once you wrap your head around it, it almost feels a bit more natural than SQL. Here is an example that allows me to enter a keyword and get a report on how many times it showed up in comments:

<cfparam name="form.keyword" default="">

<cfoutput> <form method="post"> Keyword: <input type="text" name="keyword" value="#form.keyword#"> <input type="submit" value="Look Up" /> </form> </cfoutput>

<cfif len(trim(form.keyword))> <cfset comments = application.mongo.getDBCollection(application.collectionName)> <cfset criteria = { "ANALYSIS.Topics.TopTopics.Topic.Name"=form.keyword}>

<!--- case sensitive ---> <!--- <cfset res = comments.count(criteria=criteria)> ---> <cfset res = comments.query().regex("ANALYSIS.Topics.TopTopics.Topic.Name","(?i)#form.keyword#").count()> <cfoutput>The keyword #form.keyword# comes up #res# time(s).</cfoutput>

</cfif>

Note how I point to the key - that's actually an array. But MongoDB handles that just fine. The only issue I ran into was case sensitivity. You can see my original example was a bit simpler. To make it case insensitive I had to switch to a regex.

So that's all fine and good - but for deeper work you need to use what's called MapReduce. This is... I don't know. It's insane Ninja Voodoo to me. My gut feeling is that it's something along the lines of...

  • Here is an arbitrary function to run on each object...
  • It outputs a basic stat
  • And then I store those stats
  • And then I query against that stat

That's a pretty poor explanation. Essentially it is a two step process. My understanding is that the "reduce" part is normally done behind the scenes on a schedule since it involves the most work. The reporting then is done against the reduced data so that it runs a lot quicker. Here is a example. (And thanks again to Marc Esher for producing an example for me.)

<cfscript>

comments = application.mongo.getDBCollection( application.collectionName );

map = " function() { if(""ANALYSIS"" in this && ""Topics"" in this.ANALYSIS) { for(var i=0; i<this.ANALYSIS.Topics.TopTopics.length;i++) { var topicOb = this.ANALYSIS.Topics.TopTopics[i]; if(topicOb.Topic != null) { emit(topicOb.Topic.Name, {count:1}); }

} } } ";

reduce = " function(key, emits){ var total = 0;

for( var i in emits ){ total += emits[i].count; } return {count: total}; } ";

result = comments.mapReduce( map=map, reduce=reduce, outputTarget="comment_topic_rank", options={} );

ranks = application.mongo.getDBCollection("comment_topic_rank"); sorted = ranks.find(sort={"value.count"=-1},limit=20);

</cfscript>

<h2>Keyword Report</h2> <table border="1"> <tr> <th>Keyword</th> <th>Count</th> </tr>

<cfloop index="item" array="#sorted.asArray()#"> <cfoutput> <tr> <td>#item._id#</td> <td>#numberFormat(item.value["count"])#</td> </tr> </cfoutput> </cfloop>

</table>

This template reports on all the topics in my database. While I had 44K comments, each comment had 1-N topics. This one will loop over each item, loop over each topic, and store the count. This gets stored into comment_topic_rank (the reduced data?) and once I have that, it's trivial to sort against. For folks curious about my stats, here is the report:

This takes about 8 seconds to generate - but again - I'm regenerating the reduced data when I don't need to. And frankly - considering the amount of data it's parsing - that still seems rather fast. Here is another example that looks at the Mood data (how happy/sad are my comments):

<cfscript>

comments = application.mongo.getDBCollection( application.collectionName );

map = " function() { if(""ANALYSIS"" in this && ""Styles"" in this.ANALYSIS) { emit(this.ANALYSIS.Styles.Polarity.Mean.Name, {count:1}); } } ";

reduce = " function(key, emits){ var total = 0;

for( var i in emits ){ total += emits[i].count; } return {count: total}; } ";

result = comments.mapReduce( map=map, reduce=reduce, outputTarget="comment_mood_rank", options={} );

ranks = application.mongo.getDBCollection("comment_mood_rank"); sorted = ranks.find(sort={"value.count"=-1},limit=20);

</cfscript>

<h2>Mood Report</h2> <table border="1"> <tr> <th>Keyword</th> <th>Count</th> </tr>

<cfloop index="item" array="#sorted.asArray()#"> <cfoutput> <tr> <td>#item._id#</td> <td>#numberFormat(item.value["count"])#</td> </tr> </cfoutput> </cfloop>

</table>

Which results in...

All in all - pretty fascinating I think. Imagine how useful that would be to a company like Sony? Using OpenAmplify to parse comments as they come in real time, store them in MongoDB, and provide a live report of the mood on their forums. Even if a bit delayed, it could very well be a critical way to notice when something swings the mood. If customer service reps see a big drop, it could mean some breaking story (oh, like a security issue) that they need to address.

Anyway - I'm pretty darn impressed by MongoDB and the CFMongoDB wrapper. Do I think I'll be using it soon in production? I don't know. I think in this use case it was incredibly well suited. I would have shot myself before attempting to convert that large structure into a set of tables. It's nice to know that when such an issue does rise again, I'll have a good tool to make use of.

Some Final Notes

  • Stalk Marc Esher. No, seriously, do it. I've seen him present 3 times now. The first one was on continuous integration and Hudson. It changed my life. Seriously. The second time was this crap. Epic. Last time it was on ORM (and the issues you encounter). Not new to me, but his presentation did a damn good job of summarizing the issues. I've yet to see him give a bad presentation. Of course, now that I've said this I've probably created unrealistic expectations. ;)
  • Run MongoDB at the command line and pay attention to it. You can actually output debug messages to the console while working and I found that to be very helpful to my testing.
  • I've already linked to the MongoDB docs - but also check out their excellent cookbook.
  • Don't forget the report any issues you have with CFMongoDB. I found a few small bugs while running it so I made sure to actually report them on the Github site. It's amazing how few people actually bother to do so. As an open source developer, let me make it clear. We can't fix bugs we don't know about.