November 7, 2007 (This post is more than 2 years old.)

Ask a Jedi: Getting all the link labels from a string in ColdFusion

coldfusion

A reader asked me how they could use regex to find all the link labels in a string. Not the links - but the label for the link. It is relatively easy to grab all the matches for a regex in ColdFusion 8, consider the following code block:


<cfsavecontent variable="s">
This is some text. It is true that <a href="http://www.cnn.com">Harry Potter</a> is a good
magician, but the real <a href="http://www.raymondcamden.com">question</a> is how he would stand up
against Godzilla. That is what I want to <a href="http://www.adobe.com">see</a> - a Harry Potter vs Godzilla
grudge match. Harry has his wand, Godzilla has his <a href="http://www.cfsilence.com">breath</a>, it would
be <i>so</i> cool.
</cfsavecontent>

<cfset matches = reMatch("<[aA].?>.?</[aA]>",s)> <cfdump var="#matches#">

I create a string with a few links in it. I then use the new reMatch function to grab all the matches. My regex says - find all HTML links. It isn't exactly perfect, it won't match a closing A tag that has an extra space in it, but you get the picture. This results in a match of all the links:

But you will notice that the HTML links are still there. How can we get rid of them? I simply looped over the array and did a second pass:


<cfset links = arrayNew(1)>
<cfloop index="a" array="#matches#">
	<cfset arrayAppend(links, rereplace(a, "<.*?>","","all"))>
</cfloop>

<cfdump var="#links#">

This gives you the following output:

p.s. Running on ColdFusion 7? Try the reFindAll UDF as a replacement to reMatch.

Support this Content!

If you like this content, please consider supporting me. You can become a Patron, visit my Amazon wishlist, or buy me a coffee! Any support helps!

Want to get a copy of every new post? Use the form below to sign up for my newsletter.

Archived Comments

Comment 1 by todd sharp posted on 11/7/2007 at 7:40 PM

Coolness!

Here's a client side solution using JavaScript for those who may want to do the same:

<html>
<head>
<script>
getLabels = function(){
var labels = document.getElementsByTagName('a');
var container = document.getElementById('linkLabels');
for(var i = 0; i<labels.length; i++){
container.innerHTML += labels[i].innerHTML + '<br />';
}
}
</script>
</head>
<body>
<p>
This is some text. It is true that <a href="http://www.cnn.com">Harry Potter</a> is a good
magician, but the real <a href="http://www.coldfusionjedi.com">question</a> is how he would stand up
against Godzilla. That is what I want to <a href="http://www.adobe.com">see</a> - a Harry Potter vs Godzilla
grudge match. Harry has his wand, Godzilla has his <a href="http://www.cfsilence.com">breath</a>, it would
be <i>so</i> cool.
</p>
<h1>Link labels:</h1>
<div id="linkLabels"></div>
<input type="button" name="getLabelBtn" onclick="javascript:getLabels();" value="Get labels!" />
</body>
</html>

Comment 2 by Raymond Camden posted on 11/7/2007 at 7:47 PM

Does JS allow you to do a getElementsByTagName - but restrict it to crap inside a specific div/span? Ie, imagine that P was wrapped in div id="content", and I only wanted to scan that.

Doable?

Comment 3 by Doug posted on 11/7/2007 at 8:02 PM

Those link "labels" are usually called "link text" :-) I had to click through from the goog just to see if that's what you meant.

On the JS question, if you're using one of the many advanced JS libraries these days (mootools, jquery, etc), you can target the script using CSS selector notation, so it's very possible to target an element or specific elements within an element based on some pretty complex criteria.

Comment 4 by todd sharp posted on 11/7/2007 at 8:04 PM

Yep. Instead of document.getElementsByTagName you just reference it as element.getElementsByTagName like so:

<html>
<head>
<script>
getLabels = function(){
var p = document.getElementById('pContainer');
var labels = p.getElementsByTagName('a');
var container = document.getElementById('linkLabels');
for(var i = 0; i<labels.length; i++){
container.innerHTML += labels[i].innerHTML + '<br />';
}
}
</script>
</head>
<body>
<p id="pContainer">
This is some text. It is true that <a href="http://www.cnn.com">Harry Potter</a> is a good
magician, but the real <a href="http://www.coldfusionjedi.com">question</a> is how he would stand up
against Godzilla. That is what I want to <a href="http://www.adobe.com">see</a> - a Harry Potter vs Godzilla
grudge match. Harry has his wand, Godzilla has his <a href="http://www.cfsilence.com">breath</a>, it would
be <i>so</i> cool.
</p>
<p id="ignoreMe">There are <a href="http://www.coldfusionjedi.com">links</a> in <a href="http://www.adobe.com">here</a> but they are ignored.</p>
<h1>Link labels:</h1>
<div id="linkLabels"></div>
<input type="button" name="getLabelBtn" onclick="javascript:getLabels();" value="Get labels!" />
</body>
</html>

Comment 5 by todd sharp posted on 11/7/2007 at 8:06 PM

Good point Doug. Here is the Spry selector reference doc: http://labs.adobe.com/techn...

Comment 6 by Kam posted on 11/7/2007 at 8:07 PM

You can call getElementsByTagName() on any instance of a DOM element, so this is valid:

var foo = document.getElementById("foo");
var aInFoo = foo.getElementsByTagName("a");

Or using jquery, $("#foo a") :)

Comment 7 by Raymond Camden posted on 11/7/2007 at 8:11 PM

Nice guys. Thanks.

Comment 8 by H Jaber posted on 11/7/2007 at 8:14 PM

This regex <[aA].*?>.*?</[aA]> could also go like this <(a).*?>(.*?)</\>. By placing the (a) in parenthesis, we can use back referencing to match the closing tag, </\1> which refers to the (a). Also, placing the (.*?) like so will allow us to access the label without having to use the 2nd loop in your example, that would be back reference \2.

So we can just do the following: <cfset matches = rereplacenocase(s,"<(a).*?>(.*?)</\>","\2","ALL")>

Comment 9 by Raymond Camden posted on 11/7/2007 at 8:18 PM

HJ - THe problem though is that your code will return ALL the text labels mushed together. It will work fine if there is one link, but not if there are multiple.

Comment 10 by H Jaber posted on 11/7/2007 at 8:47 PM

A more complete example:

<cfset a = arraynew(1)>
<cfset start = 1>
<cfloop condition="true">
<cfset match = refindnocase("<(a).*?>(.*?)</\1>",s,start,true)>
<cfif match.pos[1] eq 0>
<cfbreak>
<cfelse>
<cfset start = match.pos[1] + match.len[1]>
<cfset lbl = mid(s,match.pos[3],match.len[3])>
<cfset arrayappend(a,lbl)>
</cfif>
</cfloop>
<cfdump var="#a#">

Comment 11 by James Allen posted on 11/7/2007 at 9:15 PM

As an alternative approach to this problem, which works in CF7 (and possibly older versions) here is the type of code I generally use:

<cfsavecontent variable="s">
This is some text. It is true that <a href="http://www.cnn.com">Harry Potter</a> is a good
magician, but the real <a href="http://www.coldfusionjedi.com">question</a> is how he would stand up
against Godzilla. That is what I want to <a href="http://www.adobe.com">see</a> - a Harry Potter vs Godzilla
grudge match. Harry has his wand, Godzilla has his <a href="http://www.cfsilence.com">breath</a>, it would
be <i>so</i> cool.
</cfsavecontent>

I didn't know reMatch in CF8 actually returns a list of the data matched, rather than the position and length as in older versions. Very cool.

Comment 12 by Lola LB posted on 11/8/2007 at 1:04 AM

The code samples are coming in a bit on the small side in Safari 3 - could you jack it up a little bit more? Thanks!

Comment 13 by Raymond Camden posted on 11/8/2007 at 1:21 AM

Confirmed. The CSS says this for font:

font: 500 1em/1.5em 'Lucida Console', 'courier new', monospace ;

That makes NO sense to me. 500.. 500 what? Any ideas folks?

Comment 14 by Steven Levithan posted on 11/8/2007 at 2:32 AM

@Raymond:

500 in this case is the font-weight. Most people just set font-weight to "normal" or "bold", but according to spec you could also use "bolder", "lighter", and the numeric values 100 - 900, where 400 is the same as normal, and 700 is the same as bold (see http://www.w3.org/TR/REC-CS... ).

As for the regex in this post, in will incorrectly match things like "<abbr>LASER</abbr>...<a>...</a>" as a single match. You can fix this by changing the leading "<[aA]" to "<[aA]\b". The regex is also quite inefficient, especially with invalid data. One easy change to avoid backtracking pitfalls when the data contains unclosed opening <a> tags would be to change "<[aA]\b.*?>" to "<[aA]\b[^>]*>".

Comment 15 by Raymond Camden posted on 11/8/2007 at 2:35 AM

Steven: Thanks for thew CSS.

Good catch on <abbr> - but why would you say it is inefficient? If the fix is done like you said \b, are you saying it is still bad?

Comment 16 by Steven Levithan posted on 11/8/2007 at 4:11 AM

Yes, but only if efficiency is an important concern. If you are only working with small amounts of valid data, it won't make much difference either way.

Basically, in most cases where people use ".*" or ".*?" in a regex, it is not the most efficient way to accomplish their goal since it is not what they really mean (unless they're trying to match until the end of the string or line, in which case the former would be ideal). It works because of backtracking and other regex functionality which compensates for the impreciseness, but that has performance overhead. Take the equivalent patterns "<.*?>" and "<[^>]*>". The latter more accurately describes what is really meant, and will typically be faster as a result (I could break down the actual step processes your average regex engine takes when working with lazy vs. greedy quantification, and get into the nitty-gritty of backtracking and internal engine optimizations, etc., but this is probably not the best place for that). In fact, you might want to change that to "<[^<>]*>" so that when running against input like "<...<...>", it will fail invalid starting positions faster.

The inner ".*?" in the regex "<a>.*?</a>" could be significantly optimized using Jeffrey Friedl's "unrolling the loop" pattern. However, that will have some impact on readability, and hence wouldn't be a good tradeoff in simple cases.

<rant>
In any case, although I love ColdFusion, I despise it's implementation of regular expressions ... both the fact that it uses a weak flavor when the more powerful java.util.regex engine underlies, and the dearth of useful regex functions and functionality (pronounced by CF8's lame implementation of REMatch).
</rant>

Here are a couple ways you could implement this in JavaScript using a single iteration over the string (I'm using "[\S\s]" to match any character because in JavaScript dots don't match newlines like in ColdFusion):

var match, matches = [];
while (match = /<a\b[^>]*>([\S\s]*?)</a>/gi.exec(input)) {
matches.push(match[1]);
}

-- OR --

var matches = [];
input.replace(/<a\b[^>]*>([\S\s]*?)</a>/gi, function ($0, $1) {
matches.push($1);
});

In CF, you could do something quite similar using underlying Java objects and methods.

Comment 17 by Steven Levithan posted on 11/8/2007 at 4:18 AM

If anyone tries to run my JavaScript code, they will quickly discover that I forgot to escape my forward slashes inside the regex literals (i.e., "</a>" should be "<\/a>"). :-)

Comment 18 by Steven Levithan posted on 11/8/2007 at 5:35 AM

Oh, and James Allen's CF code above is also obviously very similar to the JavaScript code I posted. The main difference is that it has to mid() the match text from the full string each time it finds a link, rather than just using a backreference.

Comment 19 by Johan Steenkamp posted on 11/8/2007 at 7:52 AM

If the string content is XHTML then you could also use XML functions:

Now you can get any part of the link you want via XmlText and XmlAttributes of arrLinks

Comment 20 by James Allen posted on 11/8/2007 at 2:10 PM

@Steven:
Thanks for your detailed clarifications and suggestions. I didn't realise you could use backreferences like that - very useful.

Comment 21 by Emmet posted on 11/10/2007 at 11:54 PM

Don't forget regexlib.com for regex n00bs like myself.

Comment 22 by Steven Levithan posted on 11/13/2007 at 10:19 AM

Better yet, forget regexlib and spend time improving your regex skillz. :-P Otherwise, how will you know which of the nearly 100 different email/e-mail regexes currently listed there are a good fit for your needs?

Support this Content!

Archived Comments

Webmentions