Sometimes it's the small mistakes... (2)

For a few days now the Adobe Groups site has been suffering with an odd issue. It would run as fast as normal, never throwing any odd error, and then would come to a screaming halt. Requests for the CF Admin worked fine but any request for the actual site returned a 503 after a long delay. I assumed it was a load issue and contacted the absolute master of ColdFusion and High Availability, Mike Brunt. From time to time I get pinged for side work, jobs, etc. Any time that request is for something involving load, performance testing, or stability, I always recommend Mike. He is both an expert and an all around good guy, so please consider his services if you have a need! Mike's direction is directly responsible for what I found last night, so I cannot thank him enough. (And let me add that we are still digging into this so there may be more to this then what I blog today.)

I was using the excellent Server Monitor that ships with ColdFusion to watch both machines in my cluster. I noticed a set of requests that were extremely long lived. Groups being a Model-Glue site, most of the requests simply showed up as index.cfm. However, do not forget you can double click on a request to get more information. Here is an example - and as a quick aside - this is one of the "bad" requests before I pushed my fix. (And note I blurred the path and the URL variable.)

Notice that in the detail, I've got my URL parameters. When I ran the same URL on the site I noticed it was completely unresponsive. I did a quick database export/import to my local machine (and once again, I love how easy MySQL makes that), and noticed that on my local environment, I got a visible error. So the question was - why did production hang for the url and development not?

I began to dig into the part of the request that was throwing an error. It made use of CFFEED to parse a RSS url. I looked at the data and entered the URL manually. Bam. I was prompted to authenticate. Ok, so we don't support authenticated RSS feeds. But we do support handling errors. So what's the problem? On a whim I decided to try a poor man's load test. This is a dumb test, don't repeat it, but I was just curious. I hit reload on my dev machine about 30-40 times very quickly. My CPU spiked and I ended up with a 503.

Wow.

This is when I discovered a critical flaw. My code to handle RSS feeds performed a cache based on URL. That way if N people wanted the same feed, they would share the same cache. RSS feeds don't change very often so there is certainly no need to refetch them for every request. However - while I checked to see if a URL was valid, I didn't handle a non-RSS feed properly. At a high level, if the RSS feed was bad, I did handle the error. But I didn't have a granular catch of the error at that level. So I added some quick logic to notice these errors, and here's the big change - I cached that as well. If URL X is invalid, there is no need to keep rerunning a HTTP request to grab the data. It is certainly possible that X will become valid again - and our cache will handle that fine since we only cache for 30 minutes.

So to summarize my mistake here - I had the forethought of caching the HTTP results when the results were good, but not when they were bad. This meant I was still issuing a large number of HTTP requests when I didn't really need to. The error never really got to me since I had error handling at a higher level.

I hope this makes sense, and if not, let me know. And as always, I hope my painful mistakes help others. ;)

Archived Comments

Comment 1 by Jason Fisher posted on 5/8/2010 at 7:43 PM

Nice catch. So often it's those little things, like this one, that are so tough to catch in development. It's easy enough to create large datasets to test long pages and complex data requests and many users and so forth, but it doesn't necessarily occur to someone to pick one small *piece* of an event and bang on it 40 times in quick succession.

Comment 2 by That&aposs what she said posted on 5/8/2010 at 8:13 PM

@Jason Fisher

"but it doesn't necessarily occur to someone to pick one small *piece* of an event and bang on it 40 times in quick succession. "

That's what she said!

Comment 3 by Marc Esher posted on 5/9/2010 at 12:23 AM

Ray mentioned that with Model Glue all the URLs showed up as "index.cfm" in the server monitor. For folks who might not know about server monitor "aliases", Aliases solve that problem. Basically, it's an easy way for front controller frameworks to have URLs that appear distinctly in the server monitor rather than all as simply index.cfm. Charlie wrote about it here: http://www.adobe.com/devnet...

Thanks for the great post, Ray.

Comment 4 by Raymond Camden posted on 5/9/2010 at 1:02 AM

No problem Marc and thanks for reminding us about that feature. With more and more people using frameworks, it's a pretty important tip.

Raymond Camden

Sometimes it's the small mistakes... (2)

Hire Me!

Support this Content!

Archived Comments

Webmentions