Chaos Monkeys and SSDs

Jeff Atwood has detailed some anecdotal observations regarding the purportedly poor reliability of SSDs.

I’ve long been a flash storage booster, from a consumer perspective and moreso for their incredible enterprise data platform benefits.

We’re currently in production with some FusionIO SLC PCI-E cards. Our I/O platform has gone from being a perpetual chokepoint to now being seemingly unbounded.

Sidenote: While flash-based storage offers enormous performance improvements, it often isn’t as profound as it should be given that many software products make obsolete
assumptions about I/O performance. For example, the SQL Server query cost estimation engine — the part that decides how to execute a query plan including deciding whether to use an index or not — grossly overestimates the cost of IOPS on flash devices and even high-end SANs, with no mechanism to auto-discover or even to allow one to manually override it. It often chooses sequential access over random access when the net result is much slower performance, requiring explicit query hints to strongarm it into doing the right thing.

Flash storage technology, along with the availability of mega-memory at economical prices (144GB low cost servers), has completely changed the game. A single server is capable of servicing impossibly huge needs.

I do think the experiences Jeff describes are highly abnormal. SSDs usually come with lengthy warranties (both OCZ and Intel offer 3-year SSD warranties) in an attempt to offset much of the early reliability concerns. If they really failed every 10 months the issuing organizations would suffer enormous financial losses, and would be unlikely to be as enthusiastic about the market as they are.

I doubt the anecdotal failure rates detailed are typical at all: that business model would not be sustainable for the industry. I have heard failure rates in the 3% range within the first year,
most failures occurring quickly after putting into service — Intel running at a much lower failure rate, being your best bet if you want to improve your odds — but nothing like the seeming 100% that Jeff describes.

Bring on the Chaos Monkeys

It is nonetheless a critical practice to always assume that your storage platform is about to crash at any minute, and to have recovery process prepped and ready to go into action.

If you don’t have it backed up, consider it gone. So many times we’ve heard the same story about the person losing months of work because they had no backup.

At this stage in the industry, losing more than minutes of data to a failure deserves to be ridiculed, meriting no sympathy at all. There are no excuses.

With the proper backup strategy, and the confidence that it is robust, a storage failure just isn’t a big deal and should be at most a minor inconvenience. So many paranoid commentators
desperately hoping to be told that it will never fail before they take the leap, betraying a fundamentally flawed operating principle of assuming reliability is truthworthy.

Assume it will fail. Assume your magnetic drives will fail, because they do. Assume that your RAID controller will fry, your redundant power supplies will take each other out (happened to us
more than a couple of times), your CPUs will commit hari kari, and so on. It all happens.

Which brings up the Netflix Chaos Monkey. Netflix took this to the next level and not only prepares for failures, they actively cause failures to happen. It is a brilliant strategy in proving that you have a recovery/availability plan that meets your needs. Anything less is just living a delusion, one failure away from catastrophe.