Getting Real about NoSQL and the SQL-Isn’t-Scalable Lie

Getting Defensive

I work in the financial industry. RDBMS’ and the Structured Query Language (SQL) can be found at the nucleus of most of our solutions.

The same was true when I worked in the insurance, telecommunication, and power generation industries.

So it piqued my interest when a peer recently forwarded an article titled The
end of SQL and relational databases,
 adding the subject line “We’re living in the past”.

[Though as Michael Stonebraker points out, SQL the query language actually has remarkably little to actually to do with the debate. It would be more clearly called
NoACID]

That series focuses on NoSQL as the challenger to the throne.  It isn’t alone as the past year has yielded a bountiful crop of articles and blog entries declaring the imminent
death of the decrepit relational database at the hands of this new innovation.

Most get posted with incendiary, absolute statements against the RDBMS.

The ACIDy, Transactional, RDBMS doesn’t scale, and it needs to be relegated to the proper dustbin before it does any more
damage to engineers trying to write scalable software.

And they usually see later edits that blunt the original euphoria.

postnote: This isn’t about a complete death of the RDBMS. Just the death of the idea that it’s a tool meant for all your structured data storage needs.

Indeed.

Few hold the RDBMS as the only tool for all of your structured or unstructured data storage needs, though that strawman makes an appearance in many NoSQL advocacy pieces, adding some unintentional comedy (“irony”) given that the same entries usually call for the death of the RDBMS, with NoSQL declared the one true way to store and retrieve data.

Page 493 (as labelled by page) of the article  “The Paradoxical Success of Aspect-Oriented Programming” includes an appropriate quote and graphic from an IEEE editorial by James Bezdek in IEEE Transactions on Fuzzy Systems.

[I quote indirectly given that the original source isn’t publicly available]

Every new technology begins with naive euphoria — its inventor(s) are usually submersed in the ideas  themselves; it is their immediate colleagues that experience most of  the wild
enthusiasm. Most technologies are overpromised, more often
than not simply to generate funds to continue the work, for funding is an integral part of scientific development; without it, only the most  imaginative and revolutionary ideas make it beyond the embryonic stage. Hype is a natural handmaiden to overpromise, and most technologies build rapidly to a peak of hype. Following this, there is almost always an  overreaction to ideas that are not fully developed, and this inevitably leads to a crash of sorts, followed by a period of wallowing in the depths of cynicism. Many new technologies evolve to this point, and then fade away. The ones that survive do so because someone finds a good use (= true user benefit) for the basic ideas.

In the case of the NoSQL hype, it isn’t generally the inventors over-stating its relevance — most of them are quite brilliant, pragmatic devs — but instead it is loads and loads of
terrible-at-SQL developers who hope this movement invalidates their weakness.

Some sort of Fight Club ground zero wiping of the records, rewriting the rules of the game.

It doesn’t.

Nonetheless there is indisputably a lot of fantastic work happening among the NoSQL camp, with a very strong focus on scalability.

So what is scalability, anyways?

Scalability is a poorly-defined concept that, more often than not, is twisted to suit the speaker’s agenda. Scalability is often the excuse to engage in absurd hypotheticals to sell a particular blend of fanaticism.

Putting aside wordplay — or perhaps to engage in some of my own — scalability is pragmatically the measure of a solution’s ability to grow to the highest realistic level of usage in an achievable fashion, while maintaining acceptable service levels.

Imagine the scenario that you’ve built an internal help ticket tracking system for your branch office of Money Bags Corporation. If you had to describe the data needs in three points, they would be-

  • Data is highly interrelated (relational)
  • High-value users and transactions
  • Data consistency and reliability is a primary concern

You decide to go against the hype and build it on a classic RDBMS system.

Will it scale to the real-world requirements?

There are some real scalability concerns with old school relational database systems.
Adam Wiggins does a pretty good job of covering the techniques to scale a SQL database, though I strongly disagree with his end assertion.

You face those concerns on that glorious day the CEO calls to tell you that the board is super excited about your team’s help ticket system, built on SQL Server, and they want you to deploy it corporation wide. For data consistency purposes they want a single
instance, instead of alternative deployment scenarios like pushing out an instance (“shard”) for each division.

Can you make it work?

When Money Is No Object

Of course you can. Even on the maligned Windows platform.

From a vertical scaling perspective — it’s the easiest and often the most computationally effective way to scale (albeit being very inefficient from a cost perspective) — you have the capacity to deploy your solution on  unfathomably powerful systems with armies of powerful cores, hundreds of GBs of memory, operating against SAN arrays with ranks and ranks of SSDs.

The computational and I/O capacity possible on a single “machine” are positively enormous. The storage system, which is the biggest limiting factor on most database platforms, is ridiculously scalable, especially in the bold new world of SSDs (or flash cards like the FusionIO).

Such a platform can yield very satisfactory performance for tens or hundreds of thousands of active users in most usage and application scenarios (where generally clients talk to a farm of middleware servers).

Of course if you index poorly or create some horrendous joins you can screw it up, but with competency it will be good times for all. Even with billions upon billions of help tickets.

For the purposes of the application, the scalability requirement is completely satisfied — total scalability is achieved in the context of the application.

But it doesn’t end there.

From a horizontal scaling perspective you can partition the data across many machines, ideally configuring each machine in a failover cluster so you have complete redundancy and
availability. With Oracle RAC and Sybase ASE you can even add the classic clustering
approach.

Such a solution — even on a stodgy old RDBMS — is scalable far beyond any real world need because you’ve built a system for a large corporation, deployed in your own datacenter, with few constraints beyond the limits of technology and the platform.

Your solution will cost hundreds of thousands of dollars (if not millions) to deploy, but that isn’t a critical blocking point for most enterprises.

This sort of scaling that is at the heart of virtually every bank, trading system, energy platform, retailing system, and so on.

To claim that SQL systems don’t scale, in defiance of such obvious and overwhelming evidence, defies all reason.

And you don’t need to spend a million dollars. A mid-level Dell server can easily handle the vast majority of real-world database needs: No, your project likely isn’t going to have the needs of Twitter, Flickr, or Facebook. You can grab a four CPU Dell server hosting a total of 24 cores of latest-tech computing power, with 128GB of RAM, for around $15,000. That is beefier than the systems that ran many enterprises just a few short years ago.

Artificially Limited Scalability

Imagine that you’re a start-up building your big new Social Media site.

Obviously you don’t have your own datacenter, but instead you’re going with cloud servers to host your solution.

You don’t have the option (much less the finances) to buy and install a Unisys 7600R, or even a loaded Dell R905. You don’t have TBs of memory or massive I/O at your disposal.

Instead you have to go with the options available on a host like Amazon’s EC2, where the
most powerful choice available is the High-Memory Quadruple Extra Large (!) option at $2.40 / hour (at writing), or about $21,024 a year, which is a fairly reasonable rate given that an equivalent purchased server would run you about ten thousand dollars up
front.

This is very powerful compared to their historic maxed-out image — the puny large image that used to represent the top end — and is large compared to the max of many other cloud hosts, yet it is entry level in the RDBMS database world.

I/O on the EBS has been measured with a throughput in the 30MB/second range  with
about 72 IOPS per volume, which is one-half the speed that my Atom-based home NAS achieves. You can stripe multiple volumes into a software RAID array, but you quickly limit the I/O available to your instance.

For comparison we’re currently looking at an entry level $8K 36TB iSCSI device that would offer our database a dedicated 400MB/second throughput and about 1500 IOPS, and this is for a pretty humble low-criticality need with low-end magnetic drives.

As a speculative start-up you don’t want to commit $20K/year to have a single instance hanging around, especially given that your traffic is extremely variable and most of the time it will sit idle. You want to run the smallest database layer possible, ramping
up if the need (fingers crossed) arises.

In an ideal world you could float along on a small instance economically until that big day when you get mentioned on Digg, at which point you spool up ten extra large instances, turning them off when the need passes.

These financial and artificial limits explain the strong interest in technologies that allows you to spin up and cycle down as needed. It’s why the old guard has largely remained quiet
(because it solves a problem that they don’t have, notwithstanding any manufactured “my friend has a super-duper 512CPU Sun box and it is always overloaded!” scenarios), while a million hopeful start-ups with their small EC2 instances are loudly bleating about
the limits of scalability with SQL systems.

The Needs of a Bank Aren’t Universal

The world of financial firms and retailers and other RDBMS users is very different than the popular social media scenario usually played out.

If you had to describe your social media data needs in three points, they would be-

  • Largely unrelated islands of data
  • Very low value user/transaction value
  • Data integrity is not critical. If you lose a Status Update, or several thousand of them, it will likely go unnoticed, or at least won’t cause a major situation.

MySQL originally lacked many traditionally mandatory RDBMS elements, such as transactions, without which it is extremely difficult to maintain a high level of data integrity. That didn’t dissuade many of its boosters who declared that it was an
unnecessary cost for the purposes that they used it. They were right.  As MySQL has moved towards the values of traditional databases, it has moved away from its original bag-of-data values.

The truth is that you don’t need ACID for Facebook status updates or tweets or Slashdots comments. So long as your business and presentation layers can robustly deal with inconsistent data, it doesn’t really matter. It isn’t ideal, obviously, and preferrably you see zero data loss, inconsistency, or service interruption, however accepting data loss or inconsistency (even just temporary) as a possibility, breaking free of by far the biggest scaling “hindrance” of the RDBMS world, can yield dramatic flexibility.

This is the case for many social media sites: data integrity is largely optional, and the expense to guarantee it is an unnecessary expenditure. When you yield pennies for ad clicks after thousands of users and hundreds of thousands of transactions, you start to
look to optimize.

The same efficiency applies to highly relational schemas — if you can just serialize object graphs and that’s all you need, why bother normalizing? Many would argue that it’s a premature optimization, but if it’s all you need it might be the best choice.

Both of those decisions would be outrageously negligent in many other industries, but the rules that apply for a banking system have woefully little applicability to a social media site.

SQL is Scalable and NoSQL Isn’t For Everyone

The point is one that I think all rational people already realize: The ACID RDBMS isn’t appropriate for every need, nor is the NoSQL solution.

A social media site is not an inventory system. A banking account management system is not a social news aggregator.

Picking and choosing database terminology from the Wikipedia entry on RDBMS’ doesn’t equip the speaker with an expert level of knowledge to declare the truth about the database industry.

Scalability noise based upon the limitations of a cloud vendor’s offerings needs to be put into context: They don’t apply to most of the users of relational databases.

MySQL isn’t the vanguard of the RDBMS world. Issues and concerns with it on high load sites have remarkably little relevance to other database systems.

And of course the SQL/RDBMS world is changing (sidenote: Few love SQL, but I’ve yet to see a viable replacement). Wouldn’t it be a grand world where every desktop (platforms that spend about 99% of their time completely idle) in a corporation was a part of the
corporate cloud, all seamlessly acting as a part of the corporate information system in a reliable, redundant way? A simple SQL statement silently and transparently fulfilled by hundreds of distributed systems?

We’ll get there.

Aside: I’m currently building a solution (to fill this space) that significantly leans on Project Voldemort. I have somehow managed to remain rational.

Postnote

This is one of those rants that strangely gets attention, with several taking it as anti-NoSQL, or even pro-RDBMS, I assume because positions so often seem to be polarized. It is neither, which is quite evident if read with an unbiased mind: Defending the real world practical scalability of the maligned RDBMS merely brings accuracy to the debate. Several have asked if I’m merely attacking a strawman: Aside from several specific links that I gave above (I am remiss to add more as I’ve engaged in the blog-to-blog arguments too many times before), I find it hard to believe that these people take part in any technology discussion forum or group, where NoSQL is being quite widely, and often without question, held as successor to the RDBMS…the new evolution of database systems.

The motivation of the post is that the discussion is, by nature of the venue, hijacked by people building or hoping to build very large scale web properties (all hoping to be the next Facebook), and the values and judgments of that arena are then cast across the entire database industry — which comprises a set of solutions that absolutely dwarf the edge cases of social media — which is really…extraordinary. It’s a bit like moving to the bottom of the ocean and declaring that everyone should start using submarines to commute.

There have been edge conditions in the database world for as long as there has been an industry. High performance logging/data acquisition (often distributed), for instance, has always been a case where traditional RDBMS systems aren’t suited, and thus should
be jettisoned. The industry didn’t rewrite the rules because of those fringe cases, however, for good reason.