Linking dynamic content to Wikipedia

This post is more than 2 years old.

This morning while in the shower (don't ask - I don't know why these things come up when they do) I thought it might be interesting to write up a quick demo of hyperlinking dynamic content to Wikipedia pages. I've seen a few sites that will link certain keywords to Wikipedia so folks can get more information about a particular concept or idea. I'm not sure if this is going to be actual useful, but here is what I came up with.

Let's begin by creating some simple static variables. So for example, here is our "article":

<!--- This represents our database content. ---> <cfsavecontent variable="body"> This is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, Belgium, and Klaus. </cfsavecontent>

Next, here are the keywords we will be looking for. Now you may ask - why not simply link any keyword in the text? Well, this would require Wikipedia having an API that can take a block of text and return "recognized" words. However, even if Wikipedia did have that, I probably wouldn't use it. You wouldn't want most of your content to turn into links. You really want to focus on the critical, important words in your text and not things that don't really relate to the concepts at hand. This is where your SMEs (subject matter experts) come in and tell you what words make sense to auto-link. Also - I'd probably imagine a system where you have a global list of keywords (things that are always linked) as well as article specific content. As an example, "Ewok" may not be globally linked on a Star Wars site, but would be linked on an article concerning Return of the Jedi. Obviously there are many different alternatives here. For now though, I just have a list:

<!--- This represents our list of keywords. ---> <cfset keywords = "moonpies,lava,beer,belgium">

Ok, finally, let's write a simple UDF to handle our links:

<cffunction name="wikify" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = "">

&lt;cfloop index="keyword" list="#arguments.keywords#"&gt;
	&lt;cfset link = '&lt;a href="http://wikipedia.org/wiki/#urlEncodedFormat(keyword)#"&gt;'&gt;
	&lt;cfset arguments.str = replacenocase(arguments.str, keyword, link & keyword & "&lt;/a&gt;", "all" )&gt;
&lt;/cfloop&gt;

&lt;cfreturn arguments.str&gt;	

</cffunction>

This UDF isn't terribly complex. All it does is loop through the keywords and create a link to each one. I followed the "syntax" I saw on Wikipedia where /wiki/X always works, even if a term isn't found. So if we actually call and output our result...

<!--- call our UDF to wifi-fy our keywords ---> <cfset content = wikify(body, keywords)>

<cfoutput>#content#</cfoutput>

We get:

This is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, belgium, and Klaus.

Close, but notice how Belgium became belgium. I bet they won't like that. Let's update our UDF to version 2:

<cffunction name="wikify2" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = "">

&lt;cfloop index="keyword" list="#arguments.keywords#"&gt;
	&lt;cfset link = '&lt;a href="http://wikipedia.org/wiki/#urlEncodedFormat(keyword)#"&gt;'&gt;
	&lt;cfset arguments.str = reReplaceNoCase(arguments.str, "(" & keyword & ")", link & "\1" & "&lt;/a&gt;", "all" )&gt;
&lt;/cfloop&gt;

&lt;cfreturn arguments.str&gt;	

</cffunction>

Version 2 makes one critical change. Instead of a simple replaceNoCase, I switched to reReplaceNoCase, the regex version. This allows me to match the keyword and use it, case preserved, in the replacement. Woot. Here is the result.

This is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, Belgium, and Klaus.

Nice - but I wanted to make it better. I noticed that if I included a term like St. Louis, the result didn't end up on the right St. Louis page. As I said earlier, Wikipedia won't throw an error, it just says it doesn't know about the content. While I could form the keyword in such a way that it worked, I decided to take a look at the search form, which seems to always work nicely. I did what any self-respecting web developer would do - I viewed source. I noticed their form was not using POST. That meant I could I could link to their search directly using a simple link. I copied over the hidden form field they had, and then simply updated my link like so.

<cffunction name="wikify3" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = "">

&lt;cfloop index="keyword" list="#arguments.keywords#"&gt;
	&lt;cfset link = '&lt;a href="http://wikipedia.org/w/index.php?title=Special:Search&search=#replace(keyword,' ','+','all')#"&gt;'&gt;
	&lt;cfset arguments.str = reReplaceNoCase(arguments.str, "(" & keyword & ")", link & "\1" & "&lt;/a&gt;", "all" )&gt;
&lt;/cfloop&gt;

&lt;cfreturn arguments.str&gt;	

</cffunction>

In this version, the only thing new is the link. Notice the replace call there. I found that when I used urlEncodedFormat, it seemed to go a bit too far in escaping, specifically the "." in St. Louis. Simply replacing spaces seemed to work fine for me. The end result seems to work great:

his is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, Belgium, and St. Louis.

Woot. Almost there. I figured I'd have a problem with "submatches" - in other words, using a keyword of "cat" and having something like catalog in my text. I was right. You can see this here:

This is a body of text. It will mention cats and catalogs.

Luckily - regex can help us again. There is a way to tell regular expressions to look for a word boundary, ie, match X but as a word, not as part of generic text. This is done with the \b escape sequence. Here is version 4 of the UDF.

<cffunction name="wikify4" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = "">

&lt;cfloop index="keyword" list="#arguments.keywords#"&gt;
	&lt;cfset link = '&lt;a href="http://wikipedia.org/w/index.php?title=Special:Search&search=#replace(keyword,' ','+','all')#"&gt;'&gt;
	&lt;cfset arguments.str = reReplaceNoCase(arguments.str, "(\b" & keyword & "\b)", link & "\1" & "&lt;/a&gt;", "all" )&gt;
&lt;/cfloop&gt;

&lt;cfreturn arguments.str&gt;	

</cffunction>

This correctly handles the issue. I've included the entire test script below (multiple UDFs, multiple calls, etc) if you want to download and play with it. While it runs fast enough, you probably want to consider caching the result of the update. Assuming you are using ColdFusion 9, that would be a quick cacheGet/cachePut set of calls.

<cffunction name="wikify" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = "">

&lt;cfloop index="keyword" list="#arguments.keywords#"&gt;
	&lt;cfset link = '&lt;a href="http://wikipedia.org/wiki/#urlEncodedFormat(keyword)#"&gt;'&gt;
	&lt;cfset arguments.str = replacenocase(arguments.str, keyword, link & keyword & "&lt;/a&gt;", "all" )&gt;
&lt;/cfloop&gt;

&lt;cfreturn arguments.str&gt;	

</cffunction> <cffunction name="wikify2" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = "">

&lt;cfloop index="keyword" list="#arguments.keywords#"&gt;
	&lt;cfset link = '&lt;a href="http://wikipedia.org/wiki/#urlEncodedFormat(keyword)#"&gt;'&gt;
	&lt;cfset arguments.str = reReplaceNoCase(arguments.str, "(" & keyword & ")", link & "\1" & "&lt;/a&gt;", "all" )&gt;
&lt;/cfloop&gt;

&lt;cfreturn arguments.str&gt;	

</cffunction> <cffunction name="wikify3" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = "">

&lt;cfloop index="keyword" list="#arguments.keywords#"&gt;
	&lt;cfset link = '&lt;a href="http://wikipedia.org/w/index.php?title=Special:Search&search=#replace(keyword,' ','+','all')#"&gt;'&gt;
	&lt;cfset arguments.str = reReplaceNoCase(arguments.str, "(" & keyword & ")", link & "\1" & "&lt;/a&gt;", "all" )&gt;
&lt;/cfloop&gt;

&lt;cfreturn arguments.str&gt;	

</cffunction> <cffunction name="wikify4" output="false" returnType="string"> <cfargument name="str" type="string" required="true"> <cfargument name="keywords" type="string" required="true"> <cfset var keyword = ""> <cfset var link = "">

&lt;cfloop index="keyword" list="#arguments.keywords#"&gt;
	&lt;cfset link = '&lt;a href="http://wikipedia.org/w/index.php?title=Special:Search&search=#replace(keyword,' ','+','all')#"&gt;'&gt;
	&lt;cfset arguments.str = reReplaceNoCase(arguments.str, "(\b" & keyword & "\b)", link & "\1" & "&lt;/a&gt;", "all" )&gt;
&lt;/cfloop&gt;

&lt;cfreturn arguments.str&gt;	

</cffunction>

<!--- This represents our database content. ---> <cfsavecontent variable="body"> This is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, Belgium, and Klaus. </cfsavecontent>

<!--- This represents our lit of keywords. ---> <cfset keywords = "moonpies,lava,beer,belgium">

<!--- call our UDF to wifi-fy our keywords ---> <cfset content = wikify2(body, keywords)>

<cfoutput>#content#</cfoutput>

<hr/>

<cfsavecontent variable="body"> This is a body of text. It will mention keywords like moonpies and lava. It will also include other things like beer, Belgium, and St. Louis. </cfsavecontent>

<!--- This represents our lit of keywords. ---> <cfset keywords = "moonpies,lava,beer,belgium,st. louis">

<!--- call our UDF to wifi-fy our keywords ---> <cfset content = wikify3(body, keywords)>

<cfoutput>#content#</cfoutput>

<hr/>

<cfsavecontent variable="body"> This is a body of text. It will mention cats and catalogs. </cfsavecontent>

<!--- This represents our lit of keywords. ---> <cfset keywords = "cat">

<!--- call our UDF to wifi-fy our keywords ---> <cfset content = wikify4(body, keywords)>

<cfoutput>#content#</cfoutput>

Raymond Camden's Picture

About Raymond Camden

Raymond is a senior developer evangelist for Adobe. He focuses on document services, JavaScript, and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can even buy me a coffee!

Lafayette, LA https://www.raymondcamden.com

Archived Comments

Comment 1 by Gary Stanton posted on 6/1/2010 at 5:13 PM

Heh! That's awesome... I always fancied doing something like this on one of my sites, but it was never important enough to seriously look into... Thanks!

Comment 2 by Zarko posted on 6/1/2010 at 5:24 PM

Good one... I always liked these "incremental" posts :)

Comment 3 by Raymond Camden posted on 6/1/2010 at 5:26 PM

Thanks @Zarko - they are hard to do unless you plan for it. ;) I knew I'd run into a few issues along the way so I made sure to work on progressively updated UDFs.

Comment 4 by Zarko posted on 6/1/2010 at 5:29 PM

Suggestion: In case you have "large" body, 1000+ chars of text. Keeping it in variable maybe could be problematic, maybe next step would be to do this in jQuery? :)

Comment 5 by Raymond Camden posted on 6/1/2010 at 5:42 PM

Why would 1K+ chars be an issue? If we assume an "article" length on the web as our average size, I think this solution would work fine, and would be even faster with simple caching. I love jQuery - and I am considering a jQuery version of this, but I don't see a great _need_ to use it over server side, know what I mean? I was actually considering jQuery for something else: Highlighting the auto links with a 'marker' so folks know it goes off site.

Comment 6 by andy matthews posted on 6/1/2010 at 5:48 PM

@Zarko...

Drawback to doing it via AJAX is that search engines wouldn't see the links as it's done client side. If that's not a concern then AJAX would certainly be a good option.

Comment 7 by andy matthews posted on 6/1/2010 at 5:54 PM

Ray...

You wouldn't even need jQuery for a marker. Just use an attribute selector:

a[href*='wikipedia'] {
background: url('http://www.adinstruments.co... no-repeat;
padding-left: 18px;
line-height: 22px;
}

with this type of HTML:
<a href="http://wikipedia.org/w/inde...">something like this</a>

Comment 8 by Raymond Camden posted on 6/1/2010 at 5:56 PM

That works in CSS2 or 3?

Comment 9 by Zarko posted on 6/1/2010 at 6:03 PM

@Andy - Usually you don't want Google to crawl out from your text, especially not to wikipedia :) I'd even add to Ray's code nofollow option to try to avoid this.
But makes sense if you don't care about SEO

Comment 10 by andy matthews posted on 6/1/2010 at 6:09 PM

Ray...

Attribute selectors have been around since CSS2, but apparently the version I provided is CSS3:

http://www.w3.org/TR/css3-s...

Comment 11 by Raymond Camden posted on 6/1/2010 at 6:22 PM

@Andy - Interesting. Probably safe enough to use and not worry about the folks not supporting CSS3.

@Zarko - (btw - love your name - is that ok to say? reminds me of Ming from Flash Gordon) good point on the nofollow!

Comment 12 by Zarko posted on 6/1/2010 at 6:45 PM

@Ray - Thanks! I read your comment out laud, asked "What's Flash Gordon" and millisecond later heard yell and noticed pencil flying towards my face from the direction my team manager Marko Simic (runner up from CF9 contest). Sorry! Sorry! Feel bad that I grew up little bit later so (again) Wikipedia helped me http://en.wikipedia.org/wik.... ┼Żarko means glowing/bright in our language.

Comment 13 by todd sharp posted on 6/2/2010 at 12:36 AM

Nice one Ray. One possible enhancement would be to limit the auto-linking to be only the first occurrence of a given keyword. Perhaps you could add an optional third arg to the UDF to toggle that on/off?

Comment 14 by Raymond Camden posted on 6/2/2010 at 12:38 AM

Yeah -I can see using an argument to handle either/or.

Comment 15 by Jose Galdamez posted on 6/2/2010 at 2:03 AM

I was a bit hesitant to go through this blog post from top to bottom cause of the length, but like Zarko said, the progressive nature of it made it a lot easier to go through. I think the RegEx approach is pretty clever. Great work, Ray.

Re: the CSS2 vs. CSS3 selectors, one work around would be to use jQuery since it supports CSS3 selectors. The more consistent approach, IMO, would be just give the link some funky class. You know, like

<a href="http://www.wikipedia.org/foo" class="funky">Linked Text</a>

Comment 16 by Scott A. Wimmer posted on 6/12/2010 at 4:32 AM

The shower is where my best concepts originate.