Correction to earlier ColdFusionBloggers.org post and a warning about removing HTML

You know that old joke where the person says something to the affect of, "I don't make mistakes. I thought I made a mistake once, but I was wrong." That's what happened to me today. Earlier I posted about how I had forgotten a HTMLEditFormat in my display. Turns out there was a reason I had forgotten it. I had made the decision not to escape HTML, but to simply remove it. I was using this code:


<cfset portion = reReplace(content, "<.*?>", "", "all")>

This line uses regular expressions to remove HTML from a string. And it works fine too. Accept when the string contains invalid, or partial HTML. Consider this string:

I really wish I could meet Paris Hilton. She is the most intelligent person in the world. Maybe Lindsey is close though. If I met them, I'd show them my web page at <a href=

Notice how the HTML ends there at the end? Some RSS feeds provide only a portion of their feeds, and sometimes these portions end in the middle of an HTML tag. (BlogCFC doesn't have this bug. :) So I added one more line of code:


<cfset portion = reReplace(portion, "<.*$", "")>

This looks for an incomplete HTML tag at the end of the string. Once added my display issue on ColdFusionBloggers was fixed.

I've updated the zip. Also, Lola Beno fixed up some issues I had with my SQL script. By "fix" I mean completely rewrote, so thanks Lola!

Archived Comments

Comment 1 by Lola LB posted on 8/2/2007 at 4:05 AM

Taking another look at layout.cfm, I noticed some code at the bottom, having to do Google Analytics. This is specifically for *your* ColdFusion Bloggers site, right? If so, it'd probably be wise to advise users to remove this code or replace with their own Google Analytics account. ;-)

Comment 2 by Toby Reiter posted on 8/2/2007 at 4:11 AM

Ray,
Wouldn't you be better off with the following?

Assuming CF uses a greedy regex search, which I think it does, wouldn't that strip out "<b>fun</b> and games!" from "Look here for <b>fun</b> and games"?

Comment 3 by Steve Bryant posted on 8/2/2007 at 4:35 AM

Ray,

I recently modified my local copy of your StripHTML() UDF to the following to solve that issue:

REReplaceNoCase(str,"(<|^)[^>]*(>|$)","","ALL")

Seems to do the trip. Meant to post that to cflib, but I haven't gotten around to it.

Comment 4 by Tom Chiverton posted on 8/2/2007 at 12:11 PM

Ray, what if I post text that says
"the cost of coldfusion is < than the cost of .Net"
?
Won't you're reg. exp. swallow the whole 2nd half of the post ?

Comment 5 by Raymond Camden posted on 8/2/2007 at 4:09 PM

@Lola - good point. I'll add that to the next build.

@Toby - CF has both greedy and non greedy. The form I used was non greedy.

@Steve - if you email me directly I can double check and do a update on cflib.

@Tom - Crap. I hope you are wrong, but I'll do a test later today. Of course, if a person did that, they should have escaped it, not typed aliteral <. So that shouldn't happen normally, AND if it does, it just trims down their summary a bit. So I'd probably not be too worried about it.

Comment 6 by Adam Tuttle posted on 8/2/2007 at 5:35 PM

The literal "<" is actually invalid in RSS feeds. That doesn't mean that it won't show up there - but if your blog software does its job correctly, it will escape characters like that when it builds the RSS.

Raymond Camden

Correction to earlier ColdFusionBloggers.org post and a warning about removing HTML

Support this Content!

Archived Comments

Webmentions