Responding to Joe Stump on the NoSQL Debate

Joe Stump – the former Digg lead architect with the coolest name in tech – posted a peripheral response to my recent entry about SSDs and NoSQL.

Rebuttal in tl;dr; Form

The original post was motivated by claims found on Digg’s technology blog.

  • They say that the RDBMS “mindset” favours writes over reads: BLATANTLY WRONG CLAIM.
  • They show poor index and schema use: WRONG DATABASE USAGE.
  • They show that their database product can’t join: BAD DATABASE SERVER. RED FLAG.
  • They report very poor performance without adequate detail: MEANINGLESS PROPAGANDA.
  • They use this to show that the RDBMS can’t cope: SEE ABOVE.
  • They say that if you don’t use all of an RDBMS’ feature set, you’re essentially using NoSQL: ABSURD.
  • They describe scaling out issues with databases: TRUE FOR MYSQL.
  • They described their move to NoSQL: GREAT FOR THEM. THOUGH
    REALLY THEIR SOLUTION WAS EXTREME DENORMALIZATION.

And on Joe’s post.

  • You need an expensive DBA with the RDBMS, not with NoSQL: SPECIOUS, FLAWED REASONING.
  • Capital expenses suck. Services are better: BUSINESSES GENERALLY LEASE THESE DAYS.
  • $7,500 “just for disks”: FOR A SaaS BUSINESS THIS IS CHEAP.
  • 50 node cluster: 50 NODES IS A COMPENSATION FOR ABHORRENT I/O RATES.
  • SSD drives are expensive: NO THEY AREN’T. YOUR ARGUMENT IS OBSOLETE.
  • Commercial database products are pricey: VIGOROUS AGREEMENT.
  • NoSQL $/read and $/write win: MAYBE, MAYBE NOT. DIGG COULD
    LIKELY DO MORE WITH A COUPLE OF SSDs THAN THEY CAN WITH THEIR MASSIVE DENORMALIZATION

The Non-ADD Version

Joe has been in the Web 2.01 trenches. He built a solution that powers one of the top sites on the net.

Remember when getting Slashdotted was a big deal Getting on the front-page of Digg makes a Slashdotting-at-its-peak look like a little traffic bump. There are probably a hundred PR reps busy trying to botnet their clients onto the front-page of Digg for every one punished into spamming Slashdot these days.

Far more people know Joe’s all-out-of-bubblegum name than will ever know mine, and rightly so.

A Strawman Built on Cliches and Appeals To…

Joe comes out of the gate resorting to the venerable old-versus-new tactic: “It’s just those old-school DBAs upset that us kids are rewriting the rules,” he says in not so many words, while nailing himself and his peers onto a cross, seeking pity for the flames they doth receive for their unconventional, rebellious ways.

This is a bit strange, really. Barely a day goes by lately without Hacker’s News or Reddit’s /r/programming featuring another front-pager about how the Incredible NoSQL is rewriting the rules of, well, everything. The general demeanour is one that, I think, is far more sympathetic to completely unsupported and undemonstrated pro-NoSQL claims than it is to anything that questions the hype.

Countless NoSQL blogs have appeared (though if you browse them looking for actual content you’ll instead find that most feature few facts but lots of zealous punditry. Advocacy seems to be the primary focus right now). Anyone involved with any sort of NoSQL initiative is spinning off their own start-up to capitalize on this sure-win formula, acting like it’s some sort of magic ingredient that will assure them of success.

It is very reminiscent of the XML heyday – I’m a very big fan of XML in its place, as an aside – when countless start-ups appeared with business models that could be boiled down to “something to do with XML”.

The big database vendors have remained quiet, largely because the miniscule-budget operations all clamouring for their piece of the NoSQL pie aren’t worth bothering with.

But what about Google, Amazon, and Twitter!” you say. Joe resorted to that same appeal to authority by incanting the same magical trio (say it three times quickly and your TPS rate will quadruple!). Not really much to bother with there, beyond pointing out what a cargo cult is. Your bamboo headset won’t make you successful like Google. It really won’t.

Unless you are targeting the same problem space as those companies – say like providing very low performance but highly “scalable” database solutions for countless low-value start-ups – their solution choices are utterly irrelevant.

I’m not a DBA (though knowing how indexes work now strangely qualifies one for such a title). I’m just a technically curious solutions guy that has an innate need to keep asking questions and probing deeper until the Want-To-Believe fog that often hides hype dissipates.

On Rinky-Dink Operations

In Joe’s entry he focuses a lot of attention on the costs of RDBMS solutions.

One such argument is that it’s better to use computing hardware as a service than to buy, seemingly implying that while you can buy good hardware to run a RDBMS, it is better to rent less-good virtual hardware to run your NoSQL instances.

Yet leasing is what all the cool kids are doing these days, largely for the same financial reason. Writing it all off beats dealing with depreciation BS, and it makes financial planning a lot easier.

On the leasing front, $600 a month gets you an insanely powerful, makes-an-Extra-Memory-Quadruple-Extra-Large-EC2-Instance-Look-Like-A-Pile-Of-Puke server.

You’ll probably be paying 20x that for every developer you have working on your solutions. Is this really so astronomically high?

That less-than-the-cost-of-the-office-cleaners price tag gets you a server that with a bank of striped SSDs that will almost certainly demolish your impressive-in-count-but-not-in- throughput big scale out cluster, at least with a non-broken RDBMS system.

No really, it will. Of course for any sort of reliable system you’d have to pay for some DB licenses (presuming you aren’t going with PostgreSQL), and then you’ll want to double everything up into mirrors or some other reliable setup, so triple the price.

And really, is the $7,500 spent by 37signals on a disk array really even worth mentioning I suspect that sort of number ends up almost as a rounding error on their expense sheets, and given that it’s pivotal to their operation – it sits under the very foundations of their business – I doubt they spent many sleepless nights over it.

What sort of rinky-dink operations are we talking about here? Does Digg still qualify as a start-up Don’t they have a payroll and all of that, yet they’re clamouring to wire up a collection of discount bin servers?

I posted the SSD entry because SSDs really do fascinate me, and I do think they change a lot of the rules of the game. It just happened to dovetail nicely with my investigation of the Digg scenario, where Digg solved their very real I/O issue by essentially pre-caching every possible query result for a targeted need.

Through extreme denormalization they traded storage to reduce I/O needs.

This is a very important point, because it’s far more pivotal to Digg’s solution that the NoSQL versus RDBMS debate. Call up your old Digg coworkers, Joe, and have them setup a real database server with a couple of SSD drives and see how it compares with their impressive cluster. I’ll bet Dell would happily lend them a real server.

All of this is a bit humorous, really: The whole point of my original entry on this NoSQL topic was simply to say “what is good for Digg isn’t necessarily appropriate for all database needs”, so it’s a bit unfortunate that it has come to this, with Digg’s former architect justifying their decision when they were held as a scenario where it is likely the perfect solution.

Then, after seeing the Digg case-study, I felt obligated to respond to their RDBMS claims because I saw them as flawed, indicative that the movement should really be called NoMySQL instead of NoSQL. It still doesn’t diminish the correctness of their choice.

But really, while I originally entered into this debate believing simply that NoSQL is being oversold (it is grossly inappropriate for the vast majority of non web 2.0 projects), the more I investigate the more I’m coming to think that it is a solution for the rapidly disappearing problem of pathetic I/O rates, at least assuming that you aren’t running on several of the cloud solutions where that is your only choice.

There are many other differences that come with NoSQL (many strongly questionable, like the oft lauded “no schema” claim for some of the solutions), but the I/O restriction is by far what sold it on the high end, and the high end is what convinced the little guy that it’s the way to go.

Oracle, DB2, SQL Server, Teradata, Vertica, Greenplum, Sybase and Friends All Cost Way Too Much

I very strongly agree with Joe about one thing: the licensing costs of the big RDBMS products are way too high.

They know that 2% of their potential customer base have giant budgets, and that they can squeeze more from that 2% than they could ever get from the other 98% who then get relegated to fighting over scraps like MySQL.

Not really sure how to solve that problem, but I concede that it is a non-trivial issue. PostgreSQL is probably the best low-to-no-cost database server, but even then quite a few performance features are missing (like real-time materialized views or SQL Server style clustered indexes).