Everything You Read About Databases Is Obsolete

Six and a half years ago I wrote (in a piece about NoSQL) –

Optimizing against slow seek times is an activity that is quickly going to be a negative return activity.

This time has long since passed, yet much of the dogma of the industry remains the same as it was back when our storage tier was comprised of 100 IO/second magnetic drives. Many of our solutions still have query engines absolutely crippled by this assumption (including pgsql. mssql has improved, but for years using it on fast storage was as exercise in waste as it endlessly made the wrong assumptions around conserving IOs).

There are now TB+ consumer storage solutions with 300,000 IO/second (3000x the classic magnetic drive, while also offering sequential rates above 3.5GB/s…yes, big G) for under $600. There are enterprise solutions serving 10,000,000 IOPS.

That’s if your solution even needs to touch the storage tier. Memory is so inexpensive now, even on shared hosting like AWS, that all but the largest databases sits resident in memory much of the time. My smartphone has 3GB, and could competently host the hot area of 99.5%+ of operational databases in memory.

For self-hosted hardware, TBs of memory is now economical for a small business, while tens of GBs is operationally inexpensive on shared hosting.

I totally made up that 99.5% stat, but it’s amazing how relatively tiny the overwhelming bulk of databases I encounter in my professional life are, yet how much so many fret about them.

Obviously writes still have to write thru, yet when you remove the self-defeating tactic of trying to pre-optimize by minimizing read IO — eliminating denormalization, combined-storage (e.g. document-oriented), materializations and trigger renderings/precomputing, excessive indexes, etc — in most cases writes narrow to a tiny trickle1.

When writes reduce, not only does practical write performance increase (given that you’re writing much less per transaction, the beneficial yield increases), the effectiveness of memory and tier caches increases as the hot area shrinks, and the attainability of very high performance storage options improves (it’s a lot more economical buying a 60GB high performance, reliable and redundant storage tier than a 6TB system. As you scale up data volumes, often performance is sacrificed for raw capacity).

Backups shrink and systems become much more manageable. It’s easy to stream a replica across a low grade connection when the churn of change is minimized. It’s easy to keep backups validated and up to date when they’re GBs instead of TBs.

Normalize. Don’t be afraid of seeks. Avoid the common pre-optimizations that are in the whole destructive to virtually every dimension of a solution on modern hardware (destroying write performance, long-term read performance, economics, maintainability, reliability). Validate assumptions.

Because almost everything written about databases, and from that much of what you read, is perilously outdated. This post was inspired when seeing another database best practices guideline make the rounds, most suggestions circling a very dated notion that every effort should be made to reduce IOs, the net result being an obsolete, overwrought solution out of the gates.

1 – One of the most trying aspects of writing technical blog entries is people who counter with edge cases to justify positions: Yes, Walmart has a lot of writes. So does Amazon, CERN and Google too. The NY Taxi Commission logs loads of data, being in a city area of tens of millions.

There are many extremely large databases with very specialized needs. They don’t legitimize the choices you make, and they shouldn’t drive your technical needs.