Many web sites now include a simple way to autodiscover the RSS feed for the site. This is done via a simple LINK tag and is supported by all the modern browsers. You should see - for example - a RSS icon in the address bar at this blog because I have the following HTML in my HEAD block:
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://feedproxy.google.com/RaymondCamdensColdfusionBlog" />
I was talking to Todd Sharp today about how ColdFusion could look for this URL and I came up with the following snippet.
<cfset urls = ["http://www.raymondcamden.com", "http://www.coldfusionbloggers.org", "http://www.androidgator.com", "http://www.cfsilence.com/blog/client"]> <cfloop index="u" array="#urls#">
<cfoutput>Checking #u#<br/></cfoutput> <cfhttp url="#u#">
<cfset body = cfhttp.fileContent>
<cfset linkTags = reMatch("<link[^>]+type=""application/rss+xml"".?>",body)>
<cfif arrayLen(linkTags)>
<cfset rssLinks = []>
<cfloop index="ru" array="#linkTags#">
<cfif findNoCase("href=", ru)>
<cfset arrayAppend(rsslinks, rereplaceNoCase(ru,".href=""(.?)"".", "\1"))>
</cfif>
</cfloop>
<cfdump var="#rsslinks#" label="RSS Links">
<cfelse>
None found.
</cfif>
<p/>
</cfloop>
The snippet begins with a few sample URLs I used for testing. We then loop over each and perform a HTTP get. From this we can then use some regex to find link tags. You can have more than one so I create an array for my results and append to it the URLs I find within them. Nice and simple, right? You could also turn this into a simple UDF:
if(arrayLen(linkTags)) {
var rssLinks = [];
for(var ru in linkTags) {
if(findNoCase("href=", ru)) arrayAppend(rsslinks, rereplaceNoCase(ru,".href=""(.?)"".*", "\1"));
}
}
return rssLinks;
<cfscript>
function getRSSUrl(u) {
var h = new com.adobe.coldfusion.http();
h.setURL(arguments.u);
h.setMethod("get");
h.setResolveURL(true);
var result = h.send().getPrefix().fileContent;
var rssLinks = [];
var linkTags = reMatch("<link[^>]+type=""application/rss\+xml"".*?>",result);
}
</cfscript>
Not sure how useful this is - but enjoy!
Archived Comments
Cool, but Node.js ability to parse the DOM really shines at this.
Regex is really not a good tool for phasing HTML.
I wonder if this problem can be solved with CFGroovy and Rhino...
Regex is really not good for HTML? I strongly disagree. Obviously a DOM parser is easier to work with - but I don't see how you can say regex is a bad tool.
If you go on stackoverflow, you'll see tons of questions about how to parse something out of HTML, or how to use regex to clean up HTML, and most top answers would suggest them to stop using Regex and use a real DOM.
Wouldn't it be great we we can manipulate the DOM in CF... :)
Heh, well, regex is powerful. Not terribly easy. ;) Hence the large amount of questions probably.
Henry makes a good point... Wouldn't it be cool if we could parse a page DOM-like... I know I've tackled something like that for a project or two -- how to hierarchically address data within a page for ease of update... Maybe even using something like what Ben Nadel did a while back with an XML search/parser routine. [creative juices beginning to boil]
Very cool! At first I thought it said "regedit"..oh dear
Regex *cannot* correctly parse HTML, especially "wild" HTML, (and it should not be encouraged to do so).
Here's a trivial example of how the existing code will fail:
<!-- Temporary: -->
<link rel="alternate" title="RSS <temp until Tuesday>" type="application/rss+xml" href="http://feedproxy.google.com..." />
<!-- Previous:
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://feedproxy.google.com..." />
-->
If the source code is valid XHTML, you can use XmlParse to get a DOM (and XmlSearch to obtain the href value).
If it's non-XML HTML then you need to resort to Java-based stuff. (Well, for ACF and OpenBD; Railo has HtmlParse function.)
Considering it's a web-based language, it really is a pity there isn't (more) support built-in for proper HTML parsing and DOM manipulation.
@Peter: I'd argue that it can - it may not be 100%, but is any solution 100%? Given your specific example, would that even be valid html? Shouldn't the inner < and > have to be escaped? How often is HTML valid XHTML? I'd doubt more than 1/3 if that much. Given that my solution works most of the time I'd call it more than acceptable solution. (But as it is my solution, I'm probably biased. ;)
You can't assume what you find on the Internet is valid HTML - hence my "wild" comment.
Browsers will accept < and > in attributes, even if it's not valid, and thus people will do that (and not realise their mistake).
If you're acting on specific known and unchanging content, regex can be used and may be good enough (assuming you know the rules of both HTML and regex well enough).
If you're writing software, and/or coping with unknown HTML, trying to use a regex can quickly get complicated and still not get you what you want.
Another example - entirely valid HTML that will trip your matches up in a few ways:
<link
type = 'application/xml'
href = './path/to/rss'
/>
If I *had* to extract RSS URLs with Regex, I'd probably end up with something like this:
<cfsavecontent trim variable="MatchHtmlCommentRegex">
<!--(?s:(?!<!--|-->).)+-->
</cfsavecontent>
<cfsavecontent trim variable="MatchLinkTagRegex">
(?six)
<link
( # <attr>
\s++
[\w-]++ # name
\s*+=\s*+
(?:
[^'"]++ # unquoted value
|
'[^']*+' # single-quoted value
|
"[^"]*+" # double-quoted value
)
)+ # </attr>
\s*+
/?>
</cfsavecontent>
<cfset PossibleRssTypes = "application/rss+xml|application/xml|text/rss" />
<cfsavecontent trim variable="FilterRssLinkRegex">
(?ix)
\stype\s*+=\s*+
['"](?:#ReQuote(PossibleRssTypes)#)['"]
</cfsavecontent>
<cfsavecontent trim variable="GetHrefRegex">
(?six)
(?<=
\shref\s*+=\s*+(['"])
)
(?:(?!\1).)+
</cfsavecontent>
<cfset HtmlText = cfhttp.fileContent.replaceAll( MatchHtmlCommentRegex , '' ) />
<cfloop index="CurLinkTag" array="#rematch( MatchLinkTagRegex , HtmlText )#">
<cfif refind( FilterRssLinkRegex , CurLinkTag ) >
<cfset ArrayMerge( RssLinks , rematch( GetHrefRegex , CurLinkTag ) ) />
</cfif>
</cfloop>
And that *still* wouldn't be a 100% solution. (Even if you fix any mistakes I might have missed.)
(Btw, that wont run as-is, since it uses syntax constructs that CF's Apache ORO regex engine doesn't support.)
If those regex were written in traditional one-line format, how maybe people might decipher what's going on?
<!--(?s:(?!<!--|-->).)+-->
(?si)<link(\s++[\w-]++\s*+=\s*+(?:[^'"]++|'[^']*+'|"[^"]*+"))+\s*+/?>
(?i)\stype\s*+=\s*+['"](?:#ReQuote(PossibleRssTypes)#)['"]
(?si)(?<=\shref\s*+=\s*+(['"]))(?:(?!\1).)+
Alternatively, here's something that *would* be a 100% solution:
<cfloop index="RssType" list="application/rss+xml,application/xml,text/rss">
<cfset CurHref = HtmlSearch
( cfhttp.fileContent
, '//link[type=#RssType#][@href]'
) />
<cfset ArrayAppend( RssLinks , CurHref ) />
</cfloop>
Unarguably that's far more maintainable than the long buggy regex solution.
And you don't even need to know XPath to have an idea what's happening.
If only we had Html~ functions equivalent to the existing Xml~ functions... :(
Finally, here's a blog post that might have been simpler than than writing all the above, but I only remembered it after I'd done it all, so you can have both...
http://www.codinghorror.com...
:)
Interesting stuff there. :) Thanks - I'm still going to use my solution. Um... well, not that I have a -need- for it now. ;)
In case you didn't follow the link out of the Coding Horror post, the HTML Regex parser answer on SO is one of my favorites. Atwood's block quote doesn't do it justice:
http://goo.gl/uUO2E
(still love you Ray!)
( The unobfuscated URL for it being: http://stackoverflow.com/q/... )