So, I worked on an interesting problem today - parsing iCal feeds. My desire to parse iCal feeds stems from the fact that I want to translate an iCal feed to an RSS feed. This would let me turn a calendar into an RSS feed that can be added to my.yahoo.com for example. (Obviously you would only publish events in the future.)
So, working on this I ran into an interesting issue. I needed to parse a string that looked a bit like this:
foo=goo:zoo
Everything after the colon was data. Everything before the data is considered params. There are two ways we can make this crazy. First off - the params side can have colons too - if they are in quotes, and not just ":", but...
foo="http://www.cnn.com":CNN
Let's make things interesting again - you can have multiple params if you separate them by commas - but again, commas can be inside quotes:
kidnames="jacon,lynn,noah",foo="http://www.cn.com":CNN
Now - as we all know - ColdFusion has some nifty string parsing functions. The list functions are especially useful for cutting up strings, but in this case, are useless. What I ended up doing then was writing a UDF called: findNotInQuotes. Here it is:
var inQuotes = false;
var x = 1;
var c = "";
if(arraylen(arguments) gte 3) x = arguments[3];
for(; x lte len(data); x=x+1) {
c = mid(data,x,1);
if(c is """") {
if(inQuotes) inQuotes=false;
else inQuotes = true;
}
if(c is target and not inQuotes) return x;
}
return 0;
}
This worked for me, and I was able to write a conditional loop. However - I think it could be done better. My first thought was - why not rewrite all the list functions so they will ignore delimiters in quotes. This seems a bit crazy. Instead - why not simply write a function that will "split" a string into an array, using your delimiters (and why not allow for delimiters of multi-chars, something CF doesn't let you do). Then you can simply loop over the array.
Any thoughts on this?
For those interested in the iCal code - I've got it working, but I want to convert it to a nice CFC first. There are a set of helper functions also included with it so you can parse their time formats. I'm also going to add a function so you can pass in an iCal start date and an iCal duration value and get an end date.
Archived Comments
I ran into the same problem recently when trying to parse <a href="http://www.crockford.com/JS...">JSON</a> data in ColdFusion. I ended up implementing an almost identical solution in my <a href="http://jehiah.com/projects/...">CFJSON</a> functions where I ignore comma delminators when inside a quoted string.
... looks like you only deal with double quotes but I'm assuming you could have the same problem with single quotes.
like this? (sort of)
http://www.rsscalendar.com/...
it generates a rss feed and can export to ical and outlook. I don't think you can import.
This is exactly what i have been trying to do for ages!!!! would you be so nice to release it??
Thanks for the heads up, thanks man.
Rob.
So, only the params side can have colons in quotes? Why not just get the listLast() of a colon-delimited list to get the "right side" of the colon?
Then for the params, I'd think it would be cleaner to just do a listToArray, then loop the array looking for quotes, reassembling params based on the existence of quotes inside the array item inside your loop -- then you aren't looping inside of a loop with multiple conditionals.
That's exactly how I approached it Nathan! It works like a charm. The benefit is that this method has very few iterations, and properly handles empty parameters and "=" signes inside of parameter values. Here's the code. Works like a champ and returns a tidy little structure containing the parameters. If it doesn't post correctly, I'll email it.
//The basic idea is that we create an array from the list, ignoring the quotes until
//we're rebuilding the parameters. Only when we have found a new parameter do we worry
//about quotes.
//
//This method has the advantage that it uses a very small number of iterations and realtivley
//little string parsing and allocation as ooposed to looping over every character in the string.
//
//Parameter: s - the *complete* string description in the format: param=val,param=val:Name
//Returns: A Struct indexed by parameter name.
function getParameters(s) {
var data = ListLast(s, ":");
var paramArray = ListToArray(Left(s, Len(s) - (Len(data) + 1)));
var paramStruct = StructNew();
var paramName = "";
var paramVal = "";
var paramTmp = "";
var inQuote = false;
var i = 1;
//Loop over each token in the array.
for (i = 1; i lte ArrayLen(paramArray); i = i + 1) {
paramTmp = paramArray[i];
//We know we have a new parameter when we find an "=" sign in the token
//*and* we're not currently in a quoted section. In theory, "=" is
//perfectly valid inside a value (URLS, for example http://cnn.com/?foo=1)
if (Find("=", paramTmp) and not inQuote) {
if (i is not 1) {
paramStruct[paramName] = Replace(paramVal, """", "", "all");
paramVal = "";
}
paramName = ListFirst(paramTmp, "=");
paramVal = ListAppend(paramVal, ListRest(paramTmp, "="));
//If this parameter begins with quote, we know to ignore any
//subsequent "=" until the quote has been closed
if (Left(paramVal, 1) is """") inQuote = true;
}
//If we didn't have a new parameter, just append the value to the rest
//of the value string we have compiled since the last new parameter
else {
paramVal = ListAppend(paramVal, paramTmp);
//If this parameter ends in a quote, we know we've hit the end of the
//value string, and the next "=" will denote a new parameter
if (Right(paramVal, 1) is """") inQuote = false;
}
}
//Catch the results of the last iteration
paramStruct[paramName] = Replace(paramVal, """", "", "all");
return paramStruct;
}
Oops - there's a harmless extra step in there....when the paramName and paramValue are being reset, it just needs the following:
paramName = ListFirst(paramTmp, "=");
paramVal = ListRest(paramTmp, "=");
I had an extra ListAppend in there that was harmless, but unnecessary.
It was bugging me last night that this could utilize a regex to do the pattern matching, so I revisited it this AM. Sure enough, regex's do the trick quite nicely, even though the regex required is a beast. It's debatable whether this code is any easier to follow than the last version I posted. I'm pretty sure the performance difference between the two is negligible.
Anyway, here's the same function using RegEx:
//This function uses regular expressions to parse the string
function getParameters(s) {
var data = ListLast(s, ":");
var paramStruct = StructNew();
//This regex will match the subsequent params in the string
var regEx = ",\w*(?!"")=""{0,1}[^""]*""{0,1}[,:]{1}";
var i = 1;
//The offset is used to compensate for the "," which appears in subsequent
//matches using the regex. It just simplifies the conditionals a bit.
var offset = 0;
var posStart = 1;
var posNext = 1;
//When posStart = 0, we haven't found any more matches
while (posStart neq 0) {
posNext = ReFind(regEx, s, posStart + 1);
//If we found another match after this one, parse out the current
if (posNext gt 0)
paramTmp = Mid(s, posStart + offset, posNext - posStart - offset);
//If we didn't find another match after this one, it was the last one
else
paramTmp = Mid(s, posStart + offset, Len(s) - Len(data) - posStart - offset);
//Split the name=value pairs
paramStruct[ListFirst(paramTmp, "=")] = Replace(ListRest(paramTmp, "="), """", "", "all");
posStart = posNext;
posNext = ReFind(regEx, s, posStart);
//This is done to compensate for the comma in all matches other than the first.
offset = 1;
}
return paramStruct;
}
D'Oh. I've decided to spam your blog all night.
J/K.
Actually, I forgot to remove some test code from that last one. You don't need the "var i = 0;" declaration, and you don't need the "posNext = ReFind(regEx, s, posStart);" that occurs right before the end of the loop. It was for testing and is completely redundant.
Sorry for the blog spam!
Wow, I go to sleep and lots of comments show up. :)
Nathan: Sorry - colons can also appear in the value area, if they are also wrapped in quotes. Therefore, to 'split' the string, I need to get the first non-quoted colon. Actually... wait... so the value CAN have a colon and not be wrapped. So - you must find the first non quoted colon and split there.
I'll be posting the cfc later today. I got the main guts done. I'm now just working on a 'helper' function that translates iCal duration values.
Actually Ray, all you need to worry about is the last colon, not any colons in the params/values. That's why I parsed off the last "data" chunk before splitting on the commas in my first go at this.
Actually, the more I play with the RegEx version, the more I would personally avoid it and go with the comma-based parsing. There are too many ways to trick the regex, and building a bullet-proof one is proving to be pretty nasty. Not to metnion do you really want to have to remember what that thing does when you come back to the code later? (shudder). The comma-based parsing method that Nathan suggested and I posted seems to be pretty bullet-proof.
Roland: I disagree. You could have this....
param=x:my value has a : in it
In this case, the data is:
my value has a : in it
Again, I have to find the first colon that is not in a quote.
As for Nathan's approach - it makes sense, but I don't see it as any better than mine - just different. In my approach, I search for a string not in quotes to know where to split. But involve "workk" to handle.
I think I'm missing something Ray. In your original example, you say strings can be formatted like so:
kidnames="jacon,lynn,noah",foo="http://www.cn.com":CNN
This means the generic format is:
(comma-separated list of params):(data)
Is that correct? If that assumption is correct, then the only significant colon is the last one - you can have as many colons as you want before that last one, and it won't matter because if you parse off the "data" chunk first, you no longer need to worry whether a colon is quoted or unquoted. That's easily accomplished by a ListLast(s, ":").
If I'm reading you correct, you're saying you could have a string that looks like this:
kidnames="jacon,lynn,noah",foo="http://www.cn.com",param=x:my:CNN
If you've already parsed off the "data" segment, then the colon in teh param name is completely insignificant. When parsed, you'd expect:
Data:
CNN
Parameters:
kidnames = jacon,lynn,noah
foo=http://www.cnn.com?boo=1,2,...
param = x:my
Is this correct, or am I completely misreading your post?
No, sorry, the format for the "right hand portion" _does_ allow colons as well. So again, you cut things off at the first colon not in quotes.
Here is a sample from the RFC - note the mailto value:
ATTENDEE;RSVP=TRUE;ROLE=REQ-PARTICIPANT:MAILTO:
jsmith@host.com
Ohhhhhhhhh.
Well if you're just looking for the first colon that's not enclosed in quotes, that's easy! Just use this regex:
"(?!"".*):(?!.*"")"
Then you find the first unquoted colon this way:
pos = ReFind("(?!"".*):(?!.*"")", myString)
One liner! :)
Nevermind. I give up. I read the RFC and this is perfectly valid too:
ORGANIZER;SENT-BY:"MAILTO:sray@host.com":MAILTO:jsmith@host.com
which breaks the regex. :( I don't know that you're going to get any better than character by character, Ray.
Any progress on an effective iCal tool?
Is that to me? If so, I haven't worked on anything ical related in a while.