Did you know that xmlFormat, which is supposed to make a string safe for XML, doesn't always work? Specifically it will ignore the funky Microsoft Word characters like smart quotes. If you are delivering dynamic content via XML, you cannot rely on xmlFormat alone. This is what I'm using now in toXML:
<cffunction name="safeText" returnType="string" access="private" output="false">
<cfargument name="txt" type="string" required="true">
<cfset arguments.txt = replaceList(arguments.txt,chr(8216) & "," & chr(8217) & "," & chr(8220) & "," & chr(8221) & "," & chr(8212) & "," & chr(8213) & "," & chr(8230),"',',"","",--,--,...")>
<cfreturn xmlFormat(arguments.txt)>
</cffunction>
The replaceList comes from Nathan Dintenfas' SafeText UDF. toXML, in case you don't remember, is a simple CFC that converts native ColdFusion datatypes to XML. Very useful for handing data to Spry.
Archived Comments
Out of idle curiosity, what app is choking on the unescaped quotes? I ask because the frilly double quotes shouldn't have to be escaped, right? They aren't special characters as far as XML is concerned, so they should be treated like any other Unicode sequence.
Have you tried overriding the default UTF-8 encoding? I end up having to do the Replace() trick on almost all of my XML to set everything to ISO-8859-1 or a whole bunch of stuff starts to break. Lame, but at least workable.
There is a group of non-standard characters that Microsoft introduced known as Windows-1252. They are often used in Office Documents and are not automatically converted by XMLFormat.
http://en.wikipedia.org/wik...
So, is this a bug in xmlFormat, or just an encoding issue that developers are expected to deal with?
Speaking for myself, I call it a bug in xmlFormat. xmlFormat should clean EVERYTHING, _or_, at least throw an error saying it could not convert. Right now the behavior is bad since it silently fails.
I'm no Unicode guru, but it seems to me that this is more a bug in the XML object's ToString() method. It silently writes a header that puts everything in UTF-8 even when there are non UTF-8 characters in the stream.
I mean, I may be reading the spec wrong, but as near as I can tell the only things that must be escaped (ie, XMLFormat-ed) are the really obvious ones (brackets, ampersands, and quotes):
http://www.w3.org/TR/2006/R...
Everything else is left up to proper Unicode encoding.
-R