Correction to earlier ColdFusionBloggers.org post and a warning about removing HTML

You know that old joke where the person says something to the affect of, "I don't make mistakes. I thought I made a mistake once, but I was wrong." That's what happened to me today. Earlier I posted about how I had forgotten a HTMLEditFormat in my display. Turns out there was a reason I had forgotten it. I had made the decision not to escape HTML, but to simply remove it. I was using this code:

<cfset portion = reReplace(content, "<.*?>", "", "all")>

This line uses regular expressions to remove HTML from a string. And it works fine too. Accept when the string contains invalid, or partial HTML. Consider this string:

I really wish I could meet Paris Hilton. She is the most intelligent person in the world. Maybe Lindsey is close though. If I met them, I'd show them my web page at <a href=

Notice how the HTML ends there at the end? Some RSS feeds provide only a portion of their feeds, and sometimes these portions end in the middle of an HTML tag. (BlogCFC doesn't have this bug. :) So I added one more line of code:

<cfset portion = reReplace(portion, "<.*$", "")>

This looks for an incomplete HTML tag at the end of the string. Once added my display issue on ColdFusionBloggers was fixed.

I write these posts for free — if they're useful to you, you can buy me a coffee. It helps more than you'd think.

I've updated the zip. Also, Lola Beno fixed up some issues I had with my SQL script. By "fix" I mean completely rewrote, so thanks Lola!

Archived Comments

Comment 1 by Lola LB posted on 8/2/2007 at 4:05 AM

Taking another look at layout.cfm, I noticed some code at the bottom, having to do Google Analytics. This is specifically for *your* ColdFusion Bloggers site, right? If so, it'd probably be wise to advise users to remove this code or replace with their own Google Analytics account. ;-)

Comment 2 by Toby Reiter posted on 8/2/2007 at 4:11 AM

Ray,
Wouldn't you be better off with the following?

<cfset portion = reReplace(portion, "<[^>]*$", "")>

Assuming CF uses a greedy regex search, which I think it does, wouldn't that strip out "<b>fun</b> and games!" from "Look here for <b>fun</b> and games"?

Comment 3 by Steve Bryant posted on 8/2/2007 at 4:35 AM

Ray,

I recently modified my local copy of your StripHTML() UDF to the following to solve that issue:

REReplaceNoCase(str,"(<|^)[^>]*(>|$)","","ALL")

Seems to do the trip. Meant to post that to cflib, but I haven't gotten around to it.

Comment 4 by Tom Chiverton posted on 8/2/2007 at 12:11 PM

Ray, what if I post text that says
"the cost of coldfusion is < than the cost of .Net"
?
Won't you're reg. exp. swallow the whole 2nd half of the post ?

Comment 5 by Raymond Camden posted on 8/2/2007 at 4:09 PM

@Lola - good point. I'll add that to the next build.

@Toby - CF has both greedy and non greedy. The form I used was non greedy.

@Steve - if you email me directly I can double check and do a update on cflib.

@Tom - Crap. I hope you are wrong, but I'll do a test later today. Of course, if a person did that, they should have escaped it, not typed aliteral <. So that shouldn't happen normally, AND if it does, it just trims down their summary a bit. So I'd probably not be too worried about it.

Comment 6 by Adam Tuttle posted on 8/2/2007 at 5:35 PM

The literal "<" is actually invalid in RSS feeds. That doesn't mean that it won't show up there - but if your blog software does its job correctly, it will escape characters like that when it builds the RSS.