Here is a mystery for folks. I've updated my parsing engine for coldfusionbloggers.org. I'm using CFHTTP now so I can check Etag type stuff. I take the result text and save it to a file to be parsed by CFFEED.
But before I do that I check to ensure it's valid XML. Here is where it gets weird. Charlie Griefer's blog works with CFFEED directly, but isXML on the result returns false. But - I can xmlParse the string no problem. Simple example:
<cfset f= "http://cfblog.griefer.com/feeds/rss2-0.cfm?blogid=30">
<cfhttp url="#f#">
<cfset text = cfhttp.filecontent>
<cfif isXml(text)>
yes
<cfelse>
no
<cfset z = xmlParse(text)>
<cfdump var="#z#">
</cfif>
If you run this, you will see "no" output, and than an XML object. If you use CFFEED on the URL directly, that works as well. So it seems like isXML is being strict about something. I can update my code to try/catch an xmlParse obviously, but I'd rather figure out why the above is happening first.
Archived Comments
hi ray:
i sent this entry to jon clausen, the big brain behind cfblog. i know you said you think it's a "cf thing" more than a "cfblog thing", but i figured jon might have some insights he can offer up.
I ran that URL through the w3c XML validator and it didn't validate.
Where exactly did you validate? I tried here and it validated:
http://www.validome.org/rss...
http://www.validome.org/xml... doesn't validate it either. I was trying to use a more generic validator to simulate what isXML would behave. Is it because its missing a DTD/DOCTYPE Declaration?
Is it any surprise that it's Charlies feed? Does that shock anyone?
It's passed at the two validators I tried:
http://validator.w3.org
http://feedvalidator.org
This is really neat! I can't see a reason this would fail to be valid XML.
hey when you roll hard core like me, you uncover problems that n00bs such as yourself don't encounter :P
um yeah, just to clarify... my previous comment was in response to todd (best not to piss off jeff, i figure) :)
If you look within the CDATA for each item there are quite a few tags that are malformed or they get chopped. If you correct these then the xml validates.
LOL! It's ok to piss me off, you have my permission.
Interesting insights, Ray. I dug into it briefly, but here's what I can tell so far:
The problem is <![CDATA[]]> in the xmlNode. I always use the W3C validator which escapes malformed HTML within the CDATA. I used CDATA on purpose because xmlFormat() doesn't always re-format correctly for valid RSS Feeds - especially when non-technical users are providing the input.
Coldfusion's isXML() doesn't appear to escape the CDATA content, however. For example, the following feed using xmlFormat() with Charlie's content returns isXML() true:
http://cfblog.griefer.com/f...
Whereas the original does not:
http://cfblog.griefer.com/f...
Interesting stuff!
So are you saying that if I had
<b>foo
In my CDATA, CF would consider it bad because I enver closed the B?
O_O
did jon just call me a 'non-technical user'? :)
@charlie:
"did jon just call me a 'non-technical user'? :)"
Errr..... :-O No actually, the original change to using CDATA was from a couple of non-technically oriented blog portals like pieceoftexas.com. Users were pasting from word and even with the WYSIWYG, xmlFormat() wasn't cleaning it up enough. There were also intermittent problems with feed readers decoding inline javascript like YouTube posts, etc. from users content.
@Ray
I'm going to play around with it, but it appears that any raw HTML in CDATA will cause isXML() to fail - which is the reason for using CDATA in the first place.
Jon, please let me know asap and I will file a bug report for it. Btw, I know of the xmlFormat issue and it bugs the you know what out of me.
Ray said: "So are you saying that if I had <b>foo In my CDATA, CF would consider it bad because I enver closed the B?"
In my testing, it didn't appear to be looking for parity/balance of tags, so much as it was looking for parity of brackets. That is, <b> without </b> is okay, as is <a>foo</b>, but <b (no closing bracket) is not. His feed at this moment has an A tag that has been chopped off between the tagName and its first attribute.
I think parsing xml will work with what I'll call "sub xml," while isXML(value) would verify that the value is a valid xml document. The difference would be that parsing an xml doc for a substructure would allow you to query that substructure by parsing xml again, while it wouldn't still be a valid xml doc.
I've confirmed this and will write a follow up blog entry a bit later this morning. I"m going to log the bug report right now.