The Impact of SSDs on Database Performance and the Performance Paradox of Data Explodification

Fighting The NoSQL Mindset, Though This Isn’t an anti-NoSQL Piece

Shocked by the incredibly poor database performance described on
the Digg technology blog, baffled that they cast it as demonstrative of performance issues with RDBMS’ in general, I was motivated to create a simile of their database problem.

While they posted that entry six months ago, they recently followed up with more statements on the NoSQL / RDBMS divide, and are now being heavily used as a citation of sorts.

For instance Dare Obasanjo held Digg’s moves as a rebuttal of my prior entry on SQL scaling (though my entry actually explicitly excluded incredibly rare edge cases like Digg’s, and my core point was that the majority of database uses don’t have the needs of a site like Digg, I’m always one to take on a challenge), which then got picked up in other blogs.

Digg’s case is an example of an entry-level RDBMS product used arguably suboptimally on under-powered hardware, and it seems questionable whether it proves anything of substance about either database technology. Yet it’s held as demonstrative of something — in particular the failing of the RDBMS — which is why I focus on it. They are different tools in the toolbox, arguably for different purposes, and that isn’t the focus of this entry.

So let’s take a look at Digg’s scenario.

I do this to evaluate their performance claims, to confirm my previous statements about indexing improvements, and to determine the impact that SSDs have on the problem space, because I strongly believe that SSDs (and cheap memory) completely change the equation.

The focus on this entry is not to question or answer whether NoSQL is the right choice for Digg — though there are some ramifications as SSDs take over, which is, I think, an interesting aside — or whether Google or Amazon or anyone else should use it, etc.

SQL Server 2008 Developer Edition, itself viewed as almost a training-wheels RDBMS by many, on Windows 7 was the most convenient platform for me when I ran this test, so I created a quick script to create what I think is a pretty accurate reproduction of the database described in their blog entry—

  • 500,000 users, having…
  • 10,000,000 friend relationships (using a power law distribution)
  • ..and 500,000,000 “Diggs”, randomly distributed among 500,000 virtual “items” (which might be comments, submissions, etc) with a date range covering four years.

The database weighed in at a svelte 30GB.

I ran this on a two-year-old desktop machine with a Q6600 processor and 6GB of RAM, on a standard 7200 RPM consumer drive. You can easily find laptops with more processing and I/O power.

I opted against running it on a real server (you know, like a 24-core, 128GB, banks-of-SSDs monster than many real databases run on) simply because I knew it wasn’t necessary, and went contrary to the demonstration that even a mediocre machine can beat their results.

DISCLAIMER: This is not a high-fidelity reproduction of Digg’s situation, as is pointed out many times in many ways in this post. However Digg took the time to post metrics to support their claims that they are some sort of extreme case, at the edge of database limits, and I simply don’t believe that is true. Digg’s data quantities are relatively small and lend themselves to sharding. The second point, which again is hammered home many times, is that SSDs present a solution that changes the equation, and, I think, provides some interesting inputs to the situation.

The First Clue That Something Isn’t Right: You Can’t Do a Simple Join

The Digg blog entry detailed how they had to manually build an IN clause given their selected database product’s inability to adequately run a trivial join, with the resulting query taking 14 seconds to find the Diggs for a given user’s friends against a single selected item.

This yielded a results return rate of 0.07 per second.

If you can add an IN clause that solves a database join issue that functionally should achieve the same, there is a much larger underlying issue that needs to be dealt with. I’m not a MySQL user, but apparently it offers minimal plan investigation tools, so there aren’t the options to fully flesch-out what the query engine is doing. Nonetheless, it is a warning sign of a foundational product issue.

I ran a similar query in SQL Server, albeit without the hand-coded SQL builder, looking up friend Diggs for randomly selected combinations of users and items. It returned so quickly relative to Digg’s experience, even from a cold cache, that I had to up the iteration count to 1000 to get good test durations.

SQL Server was returning a fairly constant 36 result sets per second, probing the friend table’s ten million relationships to find the selected user’s friends, and then probing the five hundred million Diggs for the pertinent records, sorting it in the manner that Digg sorted their results. The query needed to draw data from all over each of the respective table populations, ensuring that it wouldn’t benefit from localized hot-spot caching. To prove this, limiting SQL Server to only have access to 1GB of memory had a negligible impact on the performance. CPU usage was marginal, with the limiting factor being the slovenly I/O of the lowly magnetic disk. The iterations were run sequentially, as parallel runs yielded no net benefit, the magnetic disk moving as quickly as it possibly could already.

Already we’re running at close to 500x the rate reported by Digg, without doing anything beyond using an arguably better database product, at least in that it can join properly. MySQL’s many weaknesses are well, well known, so the core point from that is not to question Digg (though their indexes were suboptimal), but to put their database product under a cloud, as they themselves often do when posting about their move (usually openly declaring their restricted option set given that they limited themselves to open-source products, obviously eliminating from consideration many of the clusterable, very high performance RDBMS options, even if that were a better choice which is completely uncertain).

Their dataset distribution may be entirely different, however even if I doubled or quadrupled or octupled the count in every table it would only marginally impact performance.

At this point I implemented the indexing changes I described in my prior entry – removing the surrogate keys and cluster-indexing on the unique columns – and the lookup rate jumped to 71 result sets per second, or around 1000x the speed reported by Digg. If I massively increased the data quantity and return counts, the difference between their poor indexing and proper indexing would dramatically widen, with the proper indexed solution showing little difference with significantly increased data counts.

If the database was cached in memory those index changes would have had a much more profound impact.

What If Localized Data Isn’t Your Primary Optimization Strategy?

I had been meaning to get an SSD for Eclipse Android development, so when my new 100GB SSD arrived (it’s an MLC unit that did well on an Anandtech review, though I won’t mention specifics as it isn’t pertinent – any decent SSD will perform at a similar level. Of course for real-world production use you would want an SLC drive) I detached and moved the database files over.

A quick reattach later and the 30GB was very amply hosted in the 100GB MLC SSD.

I fired up the benchmark to be pleasantly surprised to find it returning results at a rate of 4100 result sets per second. The write performance, while not a focus of this test, also hit extraordinarily high levels (which would conveniently lubricate the use of copious indexes).

Correcting the indexes and moving the database to a single inexpensive consumer-grade SSD, running on a dated desktop, had results coming back at a rate 60,000x what Digg reported.

None of this is intended to be a serious benchmark of SQL Server (I don’t wish to fall on the wrong side of a DeWitt clause), or even Digg’s use of MySQL: This is not a disciplined benchmark, and during parts of it I hopped into some windowed online matches of Battlefield:Bad Company 2 while tests ran, after seeing that it had a limited impact on the results. I knew that the primary weakness was simply the movement of the hard drive head, and different technology choices (NoSQL versus RDBMS, normalized versus denormalized, clustered versus heap, etc) primarily impact how often and how far that head has to move.

And of course I don’t have Digg’s data, so it is completely speculative on my part based upon some rough descriptions given in the Digg posting. Maybe he hugely underestimated their data counts, or their data entropy is vastly different.

This is a macro-benchmark: Digg’s claimed results were so poor that I went in knowing that the difference would be very large. Their described data quantities are small in the world of large databases. Most decent relational database products don’t even start to sweat with tens or hundreds of millions, or billions, of rows.

The key, of course, is proper indexing, trading write performance for read performance targeting your specific needs. Indexes could be viewed as ways of creating “virtual tables” that are maintained in lock-step with your base table. Decent database products like SQL Server even allow you to include unordered but included columns in your index to ensure that you have a covering index (the best kind) for all scenarios. And that’s before you even get to the magical world of materialized views.

So either MySQL is an atrociously bad product at the larger limits, which ample evidence seems to point as a truism, or the Digg staffers simply weren’t getting the most out of their systems, but it’s hard to take their statements about the RDBMS field with seriousness, and their arguments more correctly invalidates MySQL more than it invalidates the RDBMS.

The fact that Digg is a large site says nothing to their technical leadership or mastery. Their site has sped up by leaps and bounds over the past year, so I suspect that they know what they are doing, but I’m wary of any cargo-cult like “they did {x} and they’re a big site therefore it must be a good option” appeals to example.

On The Role of the DBA

What is most disturbing about this whole database debate are the number of commentators who excuse horrific database usage (not in relation to Digg’s issue, but as a general conversation point whenever people make comments about proper database use in virtually any online discussion), ascribing rudimentary database performance design and knowledge as something that is limited to the elusive “DBA”.

This is ignorant and frightful.

You don’t know what a b-tree is Don’t know how indexes work? Don’t know what a red-black tree is Please get away from the compiler and save the world from your monstrosities until you have some knowledge of these basic concepts.

This is not esoteric knowledge, and instead is rudimentary comp. sci. knowledge.

DBAs are the guys who setup user accounts and monitor security, schedule backups and determine macro-optimizations like how to allocate file groups on the SAN arrays. They might probe lowest-hanging fruit performance issues and flag offenders or offer up suggestions.

Rudimentary database design and proper usage is the basic responsibility of developers, and if you don’t know it then it is your responsibility to learn it. Alternately you can just clutch onto NoSQL and bleat about how it changes all of the rules anyways, which is the route quite a few have decided to pursue (I fully expect to get the standard angry responses from those who take this like a religion).

Is NoSQL a Solution for Yesterday’s Problems?

Database servers really like having a lot of RAM. Ideally you should have more RAM than you have data, allowing it to cache the entirety of your DB (or at least the working-set quantity of DB on that partition) making incredible read performance achievable.

Joining rows is not a hard activity for database servers. It can do it at unfathomable rates if the data can be fed to it at the appropriate pace and in the right form. Even heavily normalized databases can be high performance.

What normally makes joins a performance issue is data locality: if you have to load two rows from different places on the disk, that’s two seeks instead of just one (or three, four, five or more instead of one). When seeks are as costly as they are on a magnetic disk, you avoid it (either by striving for a database that fits in memory, which paradoxical often calls for heavy normalization, or by de-normalizing).

Writes are obviously important too, yet on a site like Digg I would guess that reads outweighs writes — from a user interaction perspective — by a factor of 10000:1 or more outside of logging (which usually goes to a log-specific technology anyways).

In contrast to all of the “everyone is a publisher and the internet changes everything” bluster that is used to herald the wave of change that NoSQL brings, the reality is that it’s a very small percentage of users that post submissions and add comments, or even that do the simplest possible action of clicking an arrow.

Users overwhelmingly simply consume data, whether it’s the latest tech news, Asthon Kutcher’s tweets, or just browsing through the comments on a Slashdot article to see if they add any additional insight.

Despite Digg’s recent claim that they are “write intensive” (maybe because they’ve decided to dramatically explode the number of writes a simple action causes?), at its root their platform is primarily read focused, which is why they pursued Cassandra in the first place. Take note that their NoSQL solution for friend Digg lookups is to take every Digg and massively explode the number of writes it causes to happen (in the case of a Digg by Kevin Rose, a single write becomes 40,000+ large writes).

Hardware Is Cheap. Manpower is Expensive

If I had 48GB of RAM in the test machine (which is fairly pedestrian outside of gerbil-sized cloud instances. Note that you can now add 128GB of RAM to servers for around $4000 in some cases), outside of the initial caching period the select rates would be stratospheric regardless of storage medium, though SSDs would still come in a very, very strong lead when it came to write performance.

For the same $4000 you could chain five Intel X25-E drives for 320GB of intensely high performance – and persistent – storage. Just keep going up until you have more throughput, I/O and storage than you could dream of.

Some high-end enterprise solutions now tier storage and automatically place data as appropriate, choosing between magnetic, SSD, and memory caching systems. The pages of the table that are never touched end up on the magnetic storage while the hot area – say Diggs within the last 6 months – are moved to SSDs or to huge banks of memory caching.

There are bountiful options to achieve incredible performance, even on a budget, balancing memory and high performance storage systems.

Throwing Storage at the Problem

I didn’t waste the disk space, but as mentioned before I could do a simple join between the tables, materialize the view, and the performance would be very high even on magnetic disk, although it would add a serious cost to writes: When a user with a large number of people who befriended them dug something, their record creation would branch out into the write of potentially thousands of records.

That was the route that Digg took: They are pre-computing the sets of data that a user might possibly want, even apparently for reams and reams of long abandoned accounts.

They do this because looking up data that can’t be cached in memory is an expensive operation. Yet as has been shown, SSDs, which are getting faster and cheaper regularly, completely flip the I/O equation.

SSDs change everything.

Turning a small amount of data into a massive amount of data to improve performance paradoxically makes SSDs much less attainable (because the cost per GB is so much higher), and humorously may thwart the end goal. It also reduces the ability to memory-cache the relevant data.

By pursuing this solution, Digg has limited their ability to choose other solutions that are clearly hitting the mainstream.

Coming Next – PostgreSQL versus Cassandra

There is a complete absence of objective measures of the performance of Cassandra. In place of real performance comparisons and load metrics are a lot of hand-waving and comparisons against completely broken database products (never, ever hold MySQL as the vanguard of the RDBMS world. It is comical to do that) running horrendously malignant queries.

Not anymore. I’m on the case.

My goal is not to belittle the product (which I think is elegant, beautiful and concise, and serves a very important role), but simply to bring some rationality to the argument, as it is currently missing.

[EDIT: The following statement has been proven to be a wrong interpretation, but I leave it here out of humbled shame]Digg claims that Cassandra brings them “linear scalability”, yet every one of their Cassandra nodes is 100% replica of the other, meaning that a write (or 40,000 writes) on one is communicated and then replicated on every single other instance.

Response to Criticisms – 2010-03-25

This entry got picked up on a couple of excellent tech-oriented sites: Hacker’s News and Reddit r/programming. Included in the comments of a lot of very smart people are a couple of common criticisms that I thought worthy of specific response.

“Your benchmark stinks. How about you…”

My benchmark, if you can even call it that, was focused on O(n) complexity and the difficulty of joins among very large tables with a half-decent database product, with the core take-away being “it’s a solved problem. With proper indexing and a decent database system most datasets are `small’.”

On the topic of concurrency, I mentioned that in the entry, noting that executing many parallel runs of the test yielded the same net output on the magnetic disk, while it actually significantly improved performance on the SSD and then leveled off. Database servers are fairly smart about concurrency and task queing.

The top result of 4100 resultsets per second, which was achieved using many simultaneous runs, still wasn’t fully exploiting the I/O capabilities of the SSD, owing to the tuned-for-magnetic-disks nature of the database server that I didn’t bother resolving.

However the Digg case study lacked significant details beyond a couple of spurious size details there to indicate, I believe, that “we think our data is large and the RDBMS can’t service our needs”. What I based my run on was quantity-of-data (which is not large in the land of databases) and key phrases like “from a cold-cache” (which can be reasonably interpreted as “on a test instance”). There is a lack of details in the Digg benchmark, given that I don’t think they were intending it to be a industry standard metric, so it isn’t reasonable to expect so much more regimented discipline from mine. However let me say that I did take the meager stats that were given and, where possible, erred to the high end — where “hundreds of million” appeared, I went with 500 million (if I went with a billion it would have barely impacted the results, but I was impatient and didn’t want to wait for the data setup script to run that long). Where they said “millions” I went
with 10 million. Some of the responses are demonstrative of how fact-free the
debate has become, so it’s not particularly surprising that NoSQL blogs group-hug around it.

This is not a replication of Digg’s runtime environment, and any illusions that it pretended to be intentionally misinterprets. Though if it were a serious apples-to-apples comparison I would have run it on a serious server with serious load simulations, where the only orders-of-magnitude would be the difference between the results and what I achieved on a dated desktop.

In fact, 36 or even 71 results per second is still far too slow for Digg’s use (especially given that they are stuck with a web technology that forces synchronous database calls), and I’m not even purporting that to be a viable option for them as they add out a lot of data-intensive personalization options. It’s simply to contrast against their abhorrent performance number which I think are grossly misleading.

“But Google and Facebook and…”

Sure. That has nothing to do with this.

“So you’re an RDBMS guy who hopes SSD prevents change…”

I’m not an “RDBMS guy”. In a prior outing I was declared a DBA because I didn’t just roll over for the NoSQL propaganda, and now I’m cast as a guy who holds himself as a database expert. Actually neither are my primary competency, and I think that’s the point: I don’t purport to be Joe Celko, or even a remote approximation, yet even I can see some massive issues in Digg’s case study.

I’m just a solutions guy that looks at technologies and tries to digg (har har) through the cruft and get to the truth, which can be tough amidst tech religions: Warp back to 2001 and try to have a rational discussion about XML. In the case of Cassandra (and many NoSQL solutions) there is stunning ease with which many make absolute statements about RDBMS, such as the many “relational databases can’t handle large amounts of data, just look at Digg” claims that litter the web, while cheering on vaguery and unsubstantiated hype about NoSQL solutions.

Show me real performance numbers for NoSQL solutions: They are disturbingly rare. Instead the argument is dominated by noise comments and hand-waving about how grand NoSQL is because it just simply solves everything and makes everything great.

Digg’s NoSQL performance advantage is achieved by localizing all of the data necessary for a given request — in this case “tell me all of my friends who Dugg this item/parent item” which they had precomputed and cast out — ignoring the problem of MySQL not competently doing joins (apparently it has troubles sorting as well).

That is overwhelmingly a storage seek issue, and Digg’s solution was to turn many seek actions into one or two by massively exploding their core dataset so the data for every need is repeated and persisted for every possible use. I can say right now that there is no question that if I performed the same benchmark on Cassandra, drawing randomly distributed user-item buckets from the same magnetic disk, my performance would max out at the number of seeks per second of the disk, which in the case of a normal desktop drive is somewhere in the range of 100-200 seeks per second.

Of course NoSQL yields the same massive seek gain of SSDs, but that’s where you encounter the competing optimizations: By massively exploding data to optimize seek patterns, SSD solutions become that much more expensive. Digg mentioned that they turned their friend data, which I would estimate to be about 30GB of data (or a single X25-E 64GB with room to spare per “shard”) with the denormalizing they did, into 1.5TB, which in the same case blows up to 24 X25-Es per shard.

This is interesting, is it not Maybe it rains a little on the NoSQL parade, but to me it’s a pretty fascinating development.

“Why do you have to insult the Digg crew?”

I don’t intend to insult them, but at the same time I don’t fall in the camp that gives them credibility simply because they’re behind a large site. Many of the largest sites on the internet made technology mistake after mistake, yet succeeded regardless because they have a good product: These are some serious examples where ideas beat out execution. PHP somehow formed the base of a good number of the internet’s largest sites, yet are there many that will seriously argue for its technical superiority?

“These are two different tools. I’m sick of this argument! Let’s get back to the NoSQL Propaganda Parade.”

In many cases they are used to solve the same problem. In the specific entry I refer to for this post the whole point was “we were using an RDBMS, and now we’re using a NoSQL and it’s so much better”, so is it really rational to claim that they’re two completely different worlds?

“NoSQL solves different problems like scaling out, data centers, etc.”

Orthogonal. Cassandra solves problems that you can’t as easily do with MySQL. MySQL != the RDBMS industry.