Digg And Cassandra, sitting in a B-Tree
Digg recently started transitioning parts of their platform to the Cassandra open-source, Facebook-originated NoSQL solution.
They’re the perfect customer for NoSQL: The value per user and transaction is very low, demanding solutions that allow them to scale at minimal cost; some data loss or inconsistency can be accepted; and a lot of the data can be effectively siloed into islands.
Nonetheless, the article they posted about the move is filled with the sort of thinking that has littered the web with misinformation about the relational database.
The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes
The relational database “mindset” imposes no such burden.
It’s All About Finding Balance
Indexes, for instance, are a rudimentary tool of every competent database user. Each additional index adds an expense to every write to the table, forcing row changes to update every index in addition to the base table, in return easing certain read scenarios.
You apply as appropriate, striving for the perfect balance between read and write performance. I posted parts I and II of a very simple “introductory to databases” article back in 2005 (never getting around to finishing part III), and I strongly encourage it for anyone who doesn’t understand how indexes work, or how important concepts like covering indexes are (which I’ll touch upon later in regards to the Digg scenario).
Many relational database users make heavy use of triggers and cascade activities that slow writes while lubricating reads. While many are wary of triggers in general (especially where business logic gets embedded in the data layer), this is common in the relational database world and makes an appearance in most solutions.
For Digg’s particular scenario, however, the RDBMS analogy to their NoSQL approach is a basic materialized view (aka indexed view), which is a feature of most RDBMS products, from big to small.
Implementing materialized views adds a sometimes substantial cost to writes in return for supercharged reads. If I have a particular set of joins and functions that are queried often, I can materialize the view with the appropriate indexes and every change to any of the source tables automatically, as an added cost of the DML, updates the materialized view as well.
Some RDBMS systems support deferred materialized view updates where it automatically queues up the view changes without adding cost to the origination tables.
This is very old hat for virtually anyone competent with relational databases, though real-time materialized views need to be used judiciously because they fall under the auspices of ACID and can front-load write operations significantly.
Digg Don’t Do Indexing (properly)
Ignoring that obvious solution of materialized views (which, to be fair, aren’t natively supported by MySQL despite being a basic feature of most other database products), it is revealed that they aren’t using the database in the appropriate manner — or that MySQL is simply a broken platform and is turning people against the RDBMS when really they should be against MySQL — when they note that they are manually performing the joins in PHP, claiming that the join takes too long to run as a simply query.
A likely contributor to their poor performance is that while they’ve made the artificial key “id” their clustered index, their userid/friendid index is only a unique index, and I suspect, from the operation of their site, that they are likely making use of the denormalized friendname column in their consumer as well, forcing a full row lookup for every match.
If they retrieve columns outside of a non-clustered index (the most common mistake is doing a “SELECT *” when you don’t actually care about all of the columns), on every lookup match the database server pulls the row id (in this case the primary key) and then has to do another lookup for the actual row data. In their case — given that the relationship is unique — they should have made the compound key of user_id/friend_id the clustered index and eliminated the id column altogether.
This oversight means that instead of doing a simple partial index scan by user_id and pulling the limited set, the query engine is forced to pull the list of rows, and then lookup each and every row individually. So someone with 400 friends yields 400 IO cycles, versus 1 with a proper index. The same problem exists in the Diggs table, but is made worse.
The userid index is of limited value given that again it only helps them look up the surrogate record key (again, why not a primary key on itemid/userid with a secondary index of userid/itemid Surrogate keys are usually a mistake if there’s another unique key on the table, though of course it depends upon the scenario: foreign-keys or numerous secondary indexes might make such a simple key the best choice). The query engine is forced to lookup the records by either the itemid or the userid (by the friendid) and then lookup the root record, and then compare the corresponding value.
So many developers are so blissfully ignorant of how databases work, quick to ascribe their own shortcomings to the platform. Most will wave their hands and talk about how hard to come by a “good DBA” is, which is akin to pushing brutal bubble-sort algorithms and just distributing them across a MapReduce deployment, claiming that a good “sort algorithm guy” is hard to find and “scaling out” is what the big boys do.
So they could see a major performance improvement by indexing properly (I’m allowing that maybe they just gave a bad example, though their atrocious query performance seems to validate its accuracy), but even then looking up hundreds of seemingly randomly distributed records can be a costly exercise.
Change Is In The Air
Let’s step back for a minute and ignore materialized views and appropriately created and used indexes and look at the core performance issue that Digg faced — looking up several hundred rows in the Friends table, and interrogating the Diggs table by userid/itemid for the same. Presume that the dataset is very large and it can’t be cached in memory, which should be a normal design assumption.
Why is looking up several hundred randomly distributed records such a big deal?
Magnetic disks. Most hard drives can only manage to seek to different locations on the disk about a hundred times per second. If you’re relying on Amazon’s EBS you have it even worse, with an esimated 72 IOPS per second.
Imagine that the query engine has a hundred row locations in hand; It would take it a full second to jump over the disk to gather up the data necessary to retrieve the contents of those rows. That’s a best case scenario because in the real world it usually has to walk the index b-trees, find the matching data, and in Diggs’ inappropriately indexed table case do yet another lookup to find the actual row itself.
This is why database systems often completely ignore indexes if the estimated match count exceeds a relatively small percentage of the data, anemic storage systems forcing them to do expensive operations like full scans because in the end it’s a cheaper choice. Why it often just reads and filters a burst of MBs of data rather than select a few sparse records from an index.
It’s why it’s desirable to have the data in RAM, and why database servers should be loaded with copious memory. [Sidenote: It’s also why denormalizing can paradoxically slow down a database in many scenarios because it grows a database beyond RAM unnecessarily. In the Digg case note the username and friendname fields in the Friends table]
The IOPS weak-point is why most enterprise databases add SANs with ranks and ranks of hard drives, ganging them together in such a way that many seeks occur simultaneously, vastly increasing the I/O rate.
A more attainable and far more disruptive advance is moving into reality, however, and that is SSDs.
Take a look at the some recent SSD benchmarks. In particular look at the 4KB random read – MB/s results. Near the bottom are magnetic disks, including the esteemed VelociRaptor, which are absolutely decimated by the SSDs.
That is the test that is most applicable to the Digg scenario, and it is clearly evident how big of an impact it would have on their situation.
Instead of 100 IOPS, they would be looking at 15,000 IOPS. Put 6 of these in a RAID-10 array and you’d have a yield of 45,000 IOPS and reliability. Even without learning how to properly index they could see an easy 5000x performance improvement in that class of RDBMS queries. Add a materialized view and…the speed would be so obscene it would get banned from the App Store.
Those units are just $400 a piece, and the technology keeps getting bigger and faster and cheaper. SSDs are a deeply, deeply disruptive change, especially to the large-scale database world.
Soon we’ll have even faster, larger drives that are cheaper, and so on. The nature of flash technology is that they can keep making it more and more parallel, so the IOPS are going to keep going up and up and up.
Optimizing against slow seek times is an activity that is quickly going to be a negative return activity. Many who embrace NoSQL are seeking a solution to yesterday’s problem. Digg, for instance, yields their entire NoSQL benefit from optimizing data locality — that all data for a given need is nicely bunched together, which of course is what materialized views do as well.
There Are Incredible MPP Options in the RDBMS World…But They’ll Cost You
The people who really demand high levels of database performance usually have a lot of money. Which is why many of the products that deliver options like column-oriented storage (an implementation detail of a RDBMS that is primarily suited to very large-scale column aggregations. It isn’t suitable for a OLTP DB), or MPP (Massively Parallel Processing), cost absurdly high amounts.
Greenplum, Vertica, TeraData, parAccel, Oracle RAC, Sybase ASE, DB2 MPP…these things are often priced out of all but the largest enterprise’s reach.
Look at the pricing of the upcoming release of SQL Server 2008 R2, in particular the Parallel Data Warehouse product that brings MPP to that server. $58K per processor, which obviously excludes it from contention for the vast majority of applications.
If there is one thing that I would like to see come out of the NoSQL advocacy movement, it would be that mainstream databases feel the pressure to push down the functionality that they currently limit to the people with the biggest bank accounts (which they sell using the “how much do you have?” pricing model).