January 20, 2011 (This post is more than 2 years old.)

Quick Regex example - matching multiple things at once

coldfusion

Here is something I've never tried to do before with regex - match multiple "rules" but within one regex. Consider for example password validation. Normally this requires a string pass multiple rules:

Must be N characters long
Must contain lower case characters
Must container upper case characters

I can do any of those rules easily enough but in the past I've done it "long" hand:

<cfset s = ["aaaa","aAa","AAAA","a9", "A9", "aA9","aaaAAA7"]>

<cfloop index="test" array="#s#"> <cfoutput>#test# ok? </cfoutput>


<cfif len(test) gte 7 and reFind("[a-z]", test) and reFind("[A-Z]", test)>
yes
<cfelse>
no
</cfif><br/>

</cfloop>

That works - but it seemed like there must be some way with regex to say "I want to ensure A matches, and B, and C, but I don't care where." My Google-fu failed until I came across this excellent blog post: Password Validation via Regular Expression. In this blog entry, Nilang Shah, makes use of a "positive lookahead." These are items you can ensure match in a regex but don't get returned in the match.

Let me be honest - I don't quite get how this stuff works. His example though worked perfectly. I took his third example and removed the requirement for a special character and got this:

<cfset s = ["aaaa","aAa","AAAA","a9", "A9", "aA9","aaaAAA7"]>

<cfloop index="test" array="#s#"> <cfoutput>#test# ok? </cfoutput> <cfset regex = "^.(?=.{7,})(?=.\d)(?=.[a-z])(?=.[A-Z]).*$">


<cfif reFind(regex, test)>
yes
<cfelse>
no
</cfif><br/>

</cfloop>

I don't quite get why we have to anchor it nor do I get the .* in the look aheads. But I can say it works great.

Support this Content!

If you like this content, please consider supporting me. You can become a Patron, visit my Amazon wishlist, or buy me a coffee! Any support helps!

Want to get a copy of every new post? Use the form below to sign up for my newsletter.

Archived Comments

Comment 1 by James Moberg posted on 1/21/2011 at 4:38 AM

Are you able to add the ability to identify and flag non-ASCII characters in the regex or would that require an additional step?
http://lifehacker.com/57216...

Comment 2 by cygro posted on 1/21/2011 at 11:39 AM

@James: Of course you can.
RegEx support character codes by using the \x__ format (where __ is replaced by the Hex-Code of the character, e.g "\x65" equals "A").

To get all characters with a character code greater than or equal 128 you may use this expression: "[^\x00-\x7F]".
Try it out by copy the following line to the Firebug console in Firefox (or the JS-console of any other useful browser):
('test ?? and some other characters > x7F').match(/[^\x00-\x7F]/gi)

In CF:
<cfif reFind("[^\x00-\x7F]", "test ?? and some other characters > x7F")>

Comment 3 by cygro posted on 1/21/2011 at 11:43 AM

Hey Ray why does your blog does not support Chinese characters? :-(

Please replace the questionmarks in my comment above by any other characters > x7F, such as ©®¼½¾Øø

Comment 4 by Tom Eldredge posted on 1/21/2011 at 7:26 PM

The only downside to this is that you cannot tell your user what the problem is, just "I don't like your password."

PS: And, if someone tells you they are a RegEx expert, they're probably lying. I understand the lookahead (and lookback), but still get surprised by some regular expression behaviors. I've used them since my early days, and there are still things to learn.

Comment 5 by Raymond Camden posted on 1/21/2011 at 7:32 PM

@Tom: Agreed. (To both points. ;)

@cygro: Sorry about that. It may be a BlogCFC bug. I'll look into it. Cool gravatar btw. ;)

Comment 6 by cygro posted on 1/21/2011 at 9:09 PM

@Tom:
Why can't you tell the user what's wrong with the password?
By using reReplace(...) with backreferences you could flag the matching characters (for example '<span class="bad-character">\1</span>') and display a message to the user.
You even can do it in JavaScript.

What you say about the RegExperts is absolutely true. I know only two so far: Ben Forta and Jeffery Friedl whose book "Mastering Regular Expressions" is my main ressource of RegEx knowledge.

@Ray: Thanks for the compliment. You may create one for yourself here: http://www.sp-studio.de/ (cool South Park style avatar generator) ;-)

Comment 7 by Tom Eldredge posted on 1/21/2011 at 9:52 PM

@cygro: Yes, but he's looking for the absence of characters, or no minimum length. True, if you were looking for "bad characters" you could flag one (or loop through and show all of them), but putting all of the checks into a single "can you find this RegEx" means you don't know WHY it was bad, just that it was not found.

This may not be a problem, just pointing out a downside.

Comment 8 by James Moberg posted on 1/21/2011 at 11:55 PM

@cygro: Thanks for the non-ASCII regex info.

When I received the notification via email on my iPhone, the Chinese characters were viewable. Is it possible that the database sanitized them while the email was sent separately?

Do you have to add anything to a website's header in order to enable the rendering of these characters? (I had to do this on a multi-language website I once worked on.)

Comment 9 by Raymond Camden posted on 1/22/2011 at 8:18 PM

It's just a simple issue to fix - but my blog here is a bit out of date compared to stock BlogCFC.

Comment 10 by LearningCF posted on 1/25/2011 at 2:59 AM

I do not think the regular expression you are using is quite correct (but perhaps I've misunderstood something). For one, your requirements say "Must be N characters long", but the regular expression and your long-hand solution allow for 7 or more characters. Secondly, it's possible to create some input strings that pass your long-hand solution, but not your regular expression. Two examples are "aaAbbcc" and "AAAbbaa". I tried changing your regular expression to <cfset regex = "^(?=.{7,})(?=.*[a-z])(?=.*[A-Z]).*$"> and it seemed to more closely match the behavior of your long-hand solution--but I won't guarantee that it's correct! :-)
I wrote a sample program to display the results of some tests:
<cffunction name="setDebugAbort" access="private" returntype="void" output="false">
<cfdump var="#arguments#"/><cfabort>
</cffunction>

<cfscript>
WriteOutput("<html>" & NL);
WriteOutput("<head>" & NL);
WriteOutput('<style type = "text/css">' & NL);
WriteOutput('table { border:1px solid black; border-collapse: collapse; }' & NL);
WriteOutput('th { border: 1px solid black; }' & NL);
WriteOutput('td { border: 1px solid black; }' & NL);
WriteOutput('.correct { background-color:green; }' & NL);
WriteOutput('.incorrect { background-color:red; }' & NL);
WriteOutput("</style>" & NL);
WriteOutput("</head>" & NL);
WriteOutput("<body>" & NL);
WriteOutput("<table>" & NL);
WriteOutput("<tr>" & NL);
WriteOutput("<th>Test String</th>" & NL);

iter = functionNames.iterator();
while (iter.hasNext()) {
functionName = iter.next();
WriteOutput("<th>#functionName#</th>" & NL);
}
WriteOutput("</tr>" & NL);

for (testString in testStrings) {
correctAnswer = testStrings[testString]["valid"];
reason = testStrings[testString]["reason"];
answers = ArrayNew(1);

WriteOutput("<tr>" & NL);
WriteOutput('<td title="#reason#">#testString#</td>' & NL);

ArrayAppend(answers, simpleWay(testString));
ArrayAppend(answers, camdenWay(testString));
ArrayAppend(answers, anotherWay(testString));

answerIterator = answers.iterator();
while (answerIterator.hasNext()) {
answer = answerIterator.next();
if (answer EQ correctAnswer) {
WriteOutput('<td class="correct">#answer#</td>' & NL);
} else {
WriteOutput('<td class="incorrect">#answer#</td>' & NL);
}
}

WriteOutput("</tr>" & NL);
}

WriteOutput("</table>" & NL);
WriteOutput("</body>" & NL);
WriteOutput("</html>" & NL);

</cfscript>

Comment 11 by Raymond Camden posted on 1/25/2011 at 3:03 AM

Are you saying that the regex fails because it allows for >7? If so - that was a miscommunication. I'd never demand -exactly- N characters. I've seen that one time before and it was a royal pain i the rear.

As to my 'hack' solution, it was just to give you the _idea_ of how I would solve the 'how do I check for N things in one string' solution. I didn't mean for it to be as fully 'locked down' as the regex.

Comment 12 by Raymond Camden posted on 1/25/2011 at 3:04 AM

Btw - cfdump supports abort as an argument. No need to write a UDF for it. :)

Comment 13 by LearningCF posted on 1/25/2011 at 3:10 AM

Yes, my CF is probably bad. :-)

As for the requirements, your first paragraph says:

Normally this requires a string pass multiple rules:

* Must be N characters long
* Must contain lower case characters
* Must container upper case characters

The part of the regular expression in your solution that I'm talking about is {7,} which means match at least 7 occurrences of the pattern. Since there's no number after the comma you can have as >= 7 characters. If you want to limit it, it would be, for example, {7,11}, which is equivalent to x >= 7 and x <= 11.

Anyways, just trying to be helpful.

Comment 14 by Raymond Camden posted on 1/25/2011 at 3:17 AM

Yeah - I coulda been more precise. I shoulda said: Must be at least N characters long.

Comment 15 by LearningCF posted on 1/25/2011 at 3:30 AM

Actually, my example erred. All you really need to match is the following:

The reason for this is that the ?= constructs are look ahead assertions. They consume no characters in the string; they merely return true or false when their assertion is evaluated.

So, there are 3 separate assertions: Firstly, that the string is 7 or more characters in length. Secondly, that the string contains at least one lower case character in the range a-z, and lastly, that the string contains at least one upper case character in the range A-Z. The original solution also had an assertion that the string contain one or more digits (the \d).

Once you've got that, there's no need for the .*$, which meant only "match zero or more characters until the end of the line". Anything modified with * is tricky since it always matches, so it often something that can be dispensed with in the regular expression.

Comment 16 by Raymond Camden posted on 1/25/2011 at 3:36 AM

I guess my question is - why in the lookaheads is .* required? To me, I would have written it as

(?=.{7,})(?=[a-z])(?=[A-Z])

Comment 17 by LearningCF posted on 1/25/2011 at 3:47 AM

Ah! Good question. The reason for that is when you remove the .*, the assertion is essentially saying that the very next character will be *both* an uppercase letter and a lower-case letter. Since both (?=[a-z]) and (?=[A-Z]) consume no characters, they will both evaluate against the first character of the string, and at best only one will return true. The .* in this context is necessary to allow the assertion to examine the entire string.

Comment 18 by Raymond Camden posted on 1/25/2011 at 3:52 AM

Ok... I don't get that 100%. But I get it maybe 51% which is 51% more than before. :) So thank you!

Support this Content!

Archived Comments

Webmentions