Wow, what an exciting blog title there. I bet you saw "process change" and just jumped for joy. All joking aside, let me talk a bit about how ColdFusionBloggers.org used to work and what I just did to help with performance.
Let me start by saying that I think I've made some good changes here. But I'm definitely open to some feedback/criticism/suggestions on how I can do it better. First let me review how the old code worked.
I grab a list of blogs. I send this list of blogs to Paragator. Paragator used threads to a) grab the feed, b) massage the feed, and c) join the threads and data into one large Uber query. I then loop over the results and see if the entry is new. If so, it gets inserted into the database. I did this in one CFC method to make things simpler.
So in general this seemed to work ok. But I've been having some issues on the box that just started when I launched the site, so I figure it has to be related. So tonight while driving home from my brother-in-laws (with only two beers in my system, so I was still sensible), I came up with what I thought was a much improved process. Here then is CFBloggers.org V2's process mechanism:
Grab all blogs.
Loop over each blog in a thread.
Download RSS with CFFEED.
Check the URL of the first entry against an Application cache. If not new, assume feed hasn't changed and exit.
If new, run a Massage UDF which is a subset of Paragator. I removed anything I didn't really need. (Well, mostly.)
Loop over massaged query and insert into DB if entry is new.
That's it. What's interesting is that now I don't need to join my threads. I don't have an Uber query. So my process.cfm runs in about 100 MS or so. I think this will really help a lot. The cache means that I don't have to massage(edit) the initial feed query after the first time. It also means that I'm not running a bunch of database calls to see if an entry is new or not. In theory it is possible that someone could generate RSS where the top entry wasn't the first entry, but I'm willing to take that chance.
So - what do people think? Good change? I've updated the code zip file on the site. I will say I think process.cfm is a bit ugly now. I tried to comment a lot but it may not be the nicest thing you see come from me.
Oh - and I made stats.cfm a bit more random. That was totally useless and just for fun.
What are my plans? Well I'm obviously taking this a bit past "Proof of Concept" stage I think. I'd like to work on the admin later this week and use it as a chance to play with cfgrid and editing of records.
Archived Comments
I'm not sure the exact workings of <cffeed but if it involves xmlParse() in any way it can take down a server pretty fast if you are parsing the amount of stuff that you are.
I think any amount of "breaking up" or threading the xmlParse part of the operation will help out alot. One thing, you could do a manual <cfhttp of the filecontent of each feed and save it as a flat file then do a <cffile read and a reFind or some plain text funciton to determine a new url and IF there is something new then perform the <cffeed. I know it seems drastic but xmlParse destroys JVM memory like nothing else. Granted I haven't tried it with <cfthread (I smell little test!) but if there is something I would like more than anything else is for CF to do some sort of 'Line by Line' xml parsing and action instead of having to parse an entire doc into memory.
Justin's comments brought back my nightmare of using xmlParse in the past. If cffeed uses a DOM parser (which is likely, given the task) it's going to eat memory like there's no tomorrow if you're processing nearly 300 feeds with it.
I wrote a proof of concept SAX parser for CF a while ago and I've been tinkering with it lately, providing it as example code to accompany job applications. It's a wonderful, wonderful alternative to xmlParse if you're trying to keep memory consumption low. If I can find some time today I'll plug it into the CFBloggers code today and let you know how I get on.
Interesting. I could store the result of CFHTTP into RAM. It's just a string. Not a small one, but 290 of them shouldn't be too bad. I would then only parse if the string had changed. I need to check and see if CFFEED allows you to pass in a plain string instead of a URL.
Shoot - it only accept filenames or URLs. That kinda stinks. I'll have to file an ER on that.
Ray,
Why not just hash the result of the page request as text, then DB check the prior hash? If changed, re-parse. You could even store the prior page has in application scope memory to eliminate a db check, and hash() should run tons faster than xmlParse!
Right - but the issue is that I'd have to save the file to the file system. By itself that isn't too bad - but I tend to have a fear of writing stuff out to the file system, especially in a case like this. But.... most likely in my 290 feeds only a small portion are updated.
Let me say this. Let's see how well my current change runs and see if we need to make this change as well.
Oh - the hash. I don't think I need to hash. I mean, won't hash make the string even bigger?
Well, I thought you were storing the entire article in memory between pulls and comparing the new pull to the old. I was just thinking that storing a hash in memory (which would be tiny compared to a whole page of text) and comparing the hash of the new page request would allow a very fast check to see if you need to parse the xml and re-process for new posts. The largest type of hash would only generate an 88 character text string to compare.
I didn't know that. Hashes don't get bigger than 88 chars?
Yup, check it (I'm sure you have already, but here anyways)
http://www.cfquickdocs.com/...
I use hashes for quite a bit, even uuEncoding a binary image file and hashing it to ensure it has not been tampered with =) Its a great fixed size. The 88 char hash is the largest one, I use the default most of the time which is only 32 characters I think. This would at least tell you if the page needs to be processed further, as if the return request is identical the hashes would match.
sorry, here:
http://www.cfquickdocs.com/...
@Justice,
You are right to say the hash will not necessarily be as long or longer than the text being hashed. Though it seems that the documentation is incorrect about the sizes of the various algorithms. According to the comments in the documentation:
The docs are wrong about how big the returned strings are. I ran the following code below and these are the results I got.
Hash("This is a string to hash", "MD5") - 32 characters
Hash("This is a string to hash", "SHA") - 40 characters
Hash("This is a string to hash", "SHA-256") - 64 characters
Hash("This is a string to hash", "SHA-384") - 96 characters
Hash("This is a string to hash", "SHA-512") - 128 characters
DW
A bit off topic, but I have a suggestion for the site. One thing that bothers me a lot about most aggregators that I've used, is the duplication of entries. Specifically, the reposting of old entries. This happens a lot on mxna, when somebody does something like upgrade their copy of blogCFC. All of a sudden the aggregator reposts their last 40 entries (this doesn't happen when I upgrade my blog, but that's a different topic). It would be AWESOME if you could always check a post against the blog's entire list of entries, and filter reposts. You might be doing this already...
My logic checks the blog id, title, and link, so unfortunately if you change your host, it will reget, but only the last items in your RSS.
I _could_ just check blog ID and title and date actually. That is unique enough. Of course, then if you change your timezone settings it will all be new to it.
thanks for tip about hash. oh wait, we are talking about CF? But really, for some reason It never occured to me to use hash to compare changes for xml files....
learn something new everyday !
To be a well-behaved RSS consumer, shouldn't you check for an etag or last modified date by the headers first so you don't have to 1) download and 2) process the entire feed. How often do you check each feed? I was working on a similar project and set up a staggered schedule to checking the feeds to offset the load. So basically, you run the task every few minutes but only check X number of feeds, keeping track of which ones have been checked each time and putting them at the end of the queue.
Doug - I don't believe CFFEED supports that. But if I do switch to a CFHTTP approach then I could check that - if the remote feed supports it of course.
I'm currently checking every 5 minutes.
Right, I forgot this is more a proof of concept of CFFEED, but since you packaged it up as an app and people are obviously using it, it would make sense to put in a CFHTTP call for headers before executing the CFFEED. Actually downloading someone's feed every 5 minutes without even checking for freshness is very bad behavior. I would certainly ban you from my feed :-)
It would save you LOTS of overhead on processing those needlessly downloaded feeds too. The hash compare might be ok at best, but if a user made one tiny change to an article, the hash compare will fail and you'll still have to compare the feed content.
Once every 5 minutes is too much? Thats not a lof of traffic. I'd hate to get banned for that. ;)
So right now my plans are to keep things as is - because I'm monitoring the memory use in general. After a week or so, I'm then going to look into rewriting it.
I do have an issue with the header check though - for feeds that don't support it I'd have to grab it twice. Although the header get should be small. I'm surprised HTTP doesn't support a GETIF type operation where you pass in a date field as a header. If the remote url was updated after your ate, you get a full result. If not you get just the headers.