As a follow up to my quick little regex post yesterday, I thought I'd share another one today. This is something I'm adding to BlogCFC later this week, but as I was working on Adobe Groups today I figured I'd test it out there first. It's a formatting trick used n many places, including Google+, and it's something so simple you probably don't even have to document it. What this code will do is convert any word surrounded with asterisks to bolded text and any word surrounded with underscores to italics. Here's the UDF:
<cfscript>
function simpleFormat(s) {
s = rereplace(s, "\*(\w+?)\*","<b>\1</b>","all");
s = rereplace(s, "_(\w+?)_","<i>\1</i>","all");
return s;
}
</cfscript>
And here is my sample text:
This is some text I feel *very* strongly about. Maybe _you_ don't feel strongly about it, but honestly, maybe you are just slow.* Or maybe you aren't *slow* per se but you could _kinda_ slow.
- = No offense to anyone who walks slow!
Notice I intentionally used a * as a pointer to a note at the end of the string. That was to test solitary characters. Here is the output: (Note - edited to remove the blockquote as it messed up formatting.)
This is some text I feel very strongly about. Maybe you don't feel strongly about it, but honestly, maybe you are just slow.* Or maybe you aren't slow per se but you could kinda slow. * = No offense to anyone who walks slow!
Not rocket science, and I know folks are going to say "what about lists, links, etc etc", but it's nice, simple, and an effective update to plain text. By the way, if you want to handle turning vertical spaces into paragraphs, you can simply add XHTMLParagraphFormat to the code as well.
Archived Comments
When using special characters to denote markup of some kind I usually go with a super simple regex like:
s = rereplace(s, "\*(.*?)\*","<b>\1</b>","all");
so it will catch links like you mentioned above, or spaces in words, however, it runs into the problem is someone has multiple * or _ in their posts that are not actually related to one another it will create undesired formatting which is something avoided with your regex.
I am sure someone has figured out the best way to do this right?
I had .* as well, but it messed up with my single * char. I found \w worked great as it ensured alphanum between the * chars. Any spaces or punctuation would break it, which I'm fine with. So this would fail (probably, didn't test): I am in *love.*
But I'd be ok with that.
Ray, I have been reading for years. I love you writing style and your corky fun attitude to life. I was just teasing my soon to be wife about walking slow yesterday. Made me laugh. Keep it up.
~Nathan Sego~
If you catch yourself using .*? you should stop and say "what do I _really_ want to match here", because it's usually not "don't match anything if you don't have to; oh, ok, well just one of any char then move on; oh, well if I must, then two of any character; etc".
Often, it's better to do a greedy negated character class containing any characters that cannot occur, so in this case [^*]+ though sometimes it needs to be more complicated (e.g. negative lookahead).
For the solo * problem, one way of solving that is using word boundaries to guarantee both the * characters are "touching" words, rather than in isolation. So \B\*\b[^*]+\b\*\B means that there must *not* be alphanumerics "outside" either of the asterisks, but there must be alphanumerics "inside" the asterisks. Still not perfect though.
I'd probably just solve this problem by using Markdown - there's a Java version available which I've got working via CFML; just a create object and a then a single method call, and it'll do the HTML conversion for you. :)