The Solved Problem of Thundering Herds

[EDIT: As if fate decided to slap me in the face, the strangest entry decided to get an enormous influx of traffic right as I was running an XCloner backup while updating the wording of an entry, trying to make myself sound like a bit less of a jerk. Egg on my face as this poor little Micro server did exactly what I crowed about it not doing…dying. Mumble mumble excuse diversion]

This blog runs on an Amazon AWS EC2 t1.micro instance (~600MB of RAM and a single, heavily throttled vCPU), an intentionally restrictive choice I made years ago in an effort to practice something I oft preach: That it is unacceptable that sites fall over and die under the slightest bit of attention, failing at what should be their moment of glory.

I’ve had a number of very heavy burst days since with absolutely no issues at all, if even a perceptible slowdown for users.

Last evening into today, for instance, saw a spike of traffic directed here from Hacker News, Reddit, and various Twitter referrals. In all there were some 50,000+ page impressions in the past 20 hours.

Spread out that is hardly impressive (under a page impression a second), however traffic was very bursty, with extended periods exceeding 60 page impressions per second.

That isn’t huge by any measure of the imagination, but this is exactly the same sort of situation that sees the dreaded “Database connection not available” error on so many sites.

Yet there were zero database connection errors. No error 500s. Everything running smoothly without the slightest hint of trouble. Absolutely nothing was stressed at all. A run of top usually showed top or sshd as the top consumer of resources.

On this miserably tiny server. Running WordPress!

The reason, of course, is caching. With W3 Total Cache the vast majority of requests are served as if they are static requests, with the smallest dynamic wrapper to match the request with the corresponding static resource. You could take it a step further and actually generate static file resources, eliminating anything dynamic above the instance of nginx or Apache, as I did with the Name Navigator, however that is optimizing the edges and can be an optimization too far.

The Thundering Herd problem is a solved issue at normal scales, without reactively firing up an army of AWS instances because you made it to the front page of a social news site. It is very unfortunate when sites die under marginal loads, the time of users wasted, and authors of interesting content deprived of their moment of exposure.

As an aside, one of the most rewarding experiences is when I see people who come here via posts on social news sites, but instead of simply bouncing back they continue on to read through various other pages. That it was interesting enough on a net of endless information is very gratifying.