A reader of mine had an interesting question. Is it possible to find all the dates in a string? In theory you could parse all the words and attempt to turn each into a date. You would need to check each word and a "reasonable" amount of words after it. Perhaps up to 4. I decided to take an initial stab at a simpler solution - looking just for dates in the form of mm/dd/yyyy. (Note to all of my readers outside of America. The code I'm showing here would actually work fine in your locales as well.)
First - let's create a simple string.
<cfsavecontent variable="str">
This is some text. I plan on taking over the world on 12/1/2011. After I
do that, I plan on establishing the Beer Empire on 1/2/2012. But on 3/3/2013
I'll take a break. But this 13/91/20 is not a valid date.
</cfsavecontent>
Now let's do a regex based on Number/Number/Number.
<cfset possibilities = reMatch("\d+/\d+/\d+", str)>
This gives us an array of possible matches that we can loop over:
<cfloop index="w" array="#possibilities#">
<cfif isDate(w)>
<cfoutput>#w# is a date.<br/></cfoutput>
</cfif>
</cfloop>
Which gives us...
12/1/2011 is a date.
1/2/2012 is a date.
3/3/2013 is a date.
Any thoughts on this technique? The entire template is below.
<cfset possibilities = reMatch("\d+/\d+/\d+", str)>
<cfloop index="w" array="#possibilities#">
<cfif isDate(w)>
<cfoutput>#w# is a date.<br/></cfoutput>
</cfif>
</cfloop>
<cfsavecontent variable="str">
This is some text. I plan on taking over the world on 12/1/2011. After I
do that, I plan on establishing the Beer Empire on 1/2/2012. But on 3/3/2013
I'll take a break. But this 13/91/20 is not a valid date.
</cfsavecontent>
Archived Comments
It depends on how versatile you want the whole thing to be.
The 1-12 and 1-31 limitation can already be done inside the regular expression as well, but once the source adds dates with (international) formats like YYYY-MM-DD or DD.MM.YYYY or just two-digit years there's a lot more work left to do.
Hi Michael: I can't see a way of dealing with the fact 29/2/2012 is a date but 29/2/2013 is not. So I think one is still going to need to check each match anyhow, so there's perhaps a balance to be reached between complexity of regex and expectations of false positives (which are then dealt with via the date check in the loop)?
--
Adam
Plus, certainly in the past, I've had issues with ambiguous dates (such as 2/3/99) being parsed US style rather than UK style, despite the server locale.
It's a nasty area...
BTW did you mean to except single digit years ? \d{1,2}/\d{1,2}/\d{4}|\d{2}/ might be a better expression to capture just 'natural' style possible dates.
Tom - I think US/UK could be ignored if you assumed all the dates mentioned in text applied to your current locale. (Or the current locale as set by setLocale.)
Good point on the {} range.
Thanks Ray! I had been banging my head on this problem for two full days until I got the sense to contact a Jedi. =]
For fun and out of desperation to procrastinate, I came up with this RegEx:
(\d+/\d+/\d+)|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(t|tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s*\d+(st|nd|rd|th)?,\s*\d{4})|(\d+-\w{3}-\d+)
In addition to mm/dd/yy or mm/dd/yyyy etc, it also supports stuff like "Aug 1, 2010" and "September 3rd, 2010" as well as DD-MMM-YY.
ColdFusion will recognize all of those with the exception of the "3rd", "2nd", etc, so doing a quick REReplaceNoCase takes care of that exception:
REReplaceNoCase(w,"(st|nd|rd|th)?,",",")
Ok I guess I should get back to work.
That's pretty epic. :)