Linking dynamic content to Wikipedia

This morning while in the shower (don’t ask - I don’t know why these things come up when they do) I thought it might be interesting to write up a quick demo of hyperlinking dynamic content to Wikipedia pages. I’ve seen a few sites that will link certain keywords to Wikipedia so folks can get more information about a particular concept or idea. I’m not sure if this is going to be actual useful, but here is what I came up with.

Let's begin by creating some simple static variables. So for example, here is our "article":

<!--- This represents our database content. ---> <cfsavecontent variable="body"> This is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, Belgium, and Klaus. </cfsavecontent>

Next, here are the keywords we will be looking for. Now you may ask - why not simply link any keyword in the text? Well, this would require Wikipedia having an API that can take a block of text and return "recognized" words. However, even if Wikipedia did have that, I probably wouldn't use it. You wouldn't want most of your content to turn into links. You really want to focus on the critical, important words in your text and not things that don't really relate to the concepts at hand. This is where your SMEs (subject matter experts) come in and tell you what words make sense to auto-link. Also - I'd probably imagine a system where you have a global list of keywords (things that are always linked) as well as article specific content. As an example, "Ewok" may not be globally linked on a Star Wars site, but would be linked on an article concerning Return of the Jedi. Obviously there are many different alternatives here. For now though, I just have a list:

<!--- This represents our list of keywords. ---> <cfset keywords = "moonpies,lava,beer,belgium">

Ok, finally, let's write a simple UDF to handle our links:

<cffunction name="wikify" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = ""> <cfloop index="keyword" list="#arguments.keywords#"> <cfset link = '<a href="http://wikipedia.org/wiki/#urlEncodedFormat(keyword)#">'> <cfset arguments.str = replacenocase(arguments.str, keyword, link & keyword & "</a>", "all" )> </cfloop> <cfreturn arguments.str> </cffunction>

This UDF isn't terribly complex. All it does is loop through the keywords and create a link to each one. I followed the "syntax" I saw on Wikipedia where /wiki/X always works, even if a term isn't found. So if we actually call and output our result...

<!--- call our UDF to wifi-fy our keywords ---> <cfset content = wikify(body, keywords)> <cfoutput>#content#</cfoutput>

We get:

This is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, belgium, and Klaus.

Close, but notice how Belgium became belgium. I bet they won't like that. Let's update our UDF to version 2:

<cffunction name="wikify2" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = ""> <cfloop index="keyword" list="#arguments.keywords#"> <cfset link = '<a href="http://wikipedia.org/wiki/#urlEncodedFormat(keyword)#">'> <cfset arguments.str = reReplaceNoCase(arguments.str, "(" & keyword & ")", link & "\1" & "</a>", "all" )> </cfloop> <cfreturn arguments.str> </cffunction>

Version 2 makes one critical change. Instead of a simple replaceNoCase, I switched to reReplaceNoCase, the regex version. This allows me to match the keyword and use it, case preserved, in the replacement. Woot. Here is the result.

This is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, Belgium, and Klaus.

Nice - but I wanted to make it better. I noticed that if I included a term like St. Louis, the result didn't end up on the right St. Louis page. As I said earlier, Wikipedia won't throw an error, it just says it doesn't know about the content. While I could form the keyword in such a way that it worked, I decided to take a look at the search form, which seems to always work nicely. I did what any self-respecting web developer would do - I viewed source. I noticed their form was not using POST. That meant I could I could link to their search directly using a simple link. I copied over the hidden form field they had, and then simply updated my link like so.

<cffunction name="wikify3" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = ""> <cfloop index="keyword" list="#arguments.keywords#"> <cfset link = '<a href="http://wikipedia.org/w/index.php?title=Special:Search&search=#replace(keyword,' ','+','all')#">'> <cfset arguments.str = reReplaceNoCase(arguments.str, "(" & keyword & ")", link & "\1" & "</a>", "all" )> </cfloop> <cfreturn arguments.str> </cffunction>

In this version, the only thing new is the link. Notice the replace call there. I found that when I used urlEncodedFormat, it seemed to go a bit too far in escaping, specifically the "." in St. Louis. Simply replacing spaces seemed to work fine for me. The end result seems to work great:

his is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, Belgium, and St. Louis.

Woot. Almost there. I figured I'd have a problem with "submatches" - in other words, using a keyword of "cat" and having something like catalog in my text. I was right. You can see this here:

This is a body of text. It will mention cats and catalogs.

Luckily - regex can help us again. There is a way to tell regular expressions to look for a word boundary, ie, match X but as a word, not as part of generic text. This is done with the \b escape sequence. Here is version 4 of the UDF.

<cffunction name="wikify4" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = ""> <cfloop index="keyword" list="#arguments.keywords#"> <cfset link = '<a href="http://wikipedia.org/w/index.php?title=Special:Search&search=#replace(keyword,' ','+','all')#">'> <cfset arguments.str = reReplaceNoCase(arguments.str, "(\b" & keyword & "\b)", link & "\1" & "</a>", "all" )> </cfloop> <cfreturn arguments.str> </cffunction>

This correctly handles the issue. I've included the entire test script below (multiple UDFs, multiple calls, etc) if you want to download and play with it. While it runs fast enough, you probably want to consider caching the result of the update. Assuming you are using ColdFusion 9, that would be a quick cacheGet/cachePut set of calls.

<cffunction name="wikify" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = ""> <cfloop index="keyword" list="#arguments.keywords#"> <cfset link = '<a href="http://wikipedia.org/wiki/#urlEncodedFormat(keyword)#">'> <cfset arguments.str = replacenocase(arguments.str, keyword, link & keyword & "</a>", "all" )> </cfloop> <cfreturn arguments.str> </cffunction> <cffunction name="wikify2" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = ""> <cfloop index="keyword" list="#arguments.keywords#"> <cfset link = '<a href="http://wikipedia.org/wiki/#urlEncodedFormat(keyword)#">'> <cfset arguments.str = reReplaceNoCase(arguments.str, "(" & keyword & ")", link & "\1" & "</a>", "all" )> </cfloop> <cfreturn arguments.str> </cffunction> <cffunction name="wikify3" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = ""> <cfloop index="keyword" list="#arguments.keywords#"> <cfset link = '<a href="http://wikipedia.org/w/index.php?title=Special:Search&search=#replace(keyword,' ','+','all')#">'> <cfset arguments.str = reReplaceNoCase(arguments.str, "(" & keyword & ")", link & "\1" & "</a>", "all" )> </cfloop> <cfreturn arguments.str> </cffunction> <cffunction name="wikify4" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = ""> <cfloop index="keyword" list="#arguments.keywords#"> <cfset link = '<a href="http://wikipedia.org/w/index.php?title=Special:Search&search=#replace(keyword,' ','+','all')#">'> <cfset arguments.str = reReplaceNoCase(arguments.str, "(\b" & keyword & "\b)", link & "\1" & "</a>", "all" )> </cfloop> <cfreturn arguments.str> </cffunction> <!--- This represents our database content. ---> <cfsavecontent variable="body"> This is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, Belgium, and Klaus. </cfsavecontent> <!--- This represents our lit of keywords. ---> <cfset keywords = "moonpies,lava,beer,belgium"> <!--- call our UDF to wifi-fy our keywords ---> <cfset content = wikify2(body, keywords)> <cfoutput>#content#</cfoutput> <hr/> <cfsavecontent variable="body"> This is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, Belgium, and St. Louis. </cfsavecontent> <!--- This represents our lit of keywords. ---> <cfset keywords = "moonpies,lava,beer,belgium,st. louis"> <!--- call our UDF to wifi-fy our keywords ---> <cfset content = wikify3(body, keywords)> <cfoutput>#content#</cfoutput> <hr/> <cfsavecontent variable="body"> This is a body of text. It will mention cats and catalogs. </cfsavecontent> <!--- This represents our lit of keywords. ---> <cfset keywords = "cat"> <!--- call our UDF to wifi-fy our keywords ---> <cfset content = wikify4(body, keywords)> <cfoutput>#content#</cfoutput>

Raymond Camden's Picture

About Raymond Camden

Raymond is a developer advocate. He focuses on JavaScript, serverless and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support.

Lafayette, LA https://www.raymondcamden.com

Comments