A reader wrote me regarding a performance issue they were having with PostgreSQL, and I thought the case study would make an interesting follow-up note on the whole SQL/NoSQL debate.
The scenario was that they needed to look up batches of geo-locations by postal code, passing in sets of 100 postal codes and retrieving the corresponding set(s) of latitudes and longitudes.
It is a real-world scenario, whether for mail processing system, census analysis, sales tracking, or many other common data processing needs.
You could simulate such a scenario with data like this. I did just that with the Canada Postal Code data.
In the reader’s situation they were finding that batches of 100 postal code lookups took over a second.
That’s sub-optimal, and not ideal for any system that needs to perform a large number of rapid lookups.
It is not the sort of task that I would normally do in a database. Hard to believe, perhaps, after the prior entries, but this is a process that I consider highly specialized – a unique snowflake, if you will.
The data is extremely static, and the usage is very specialized. Performance, rather than generalization, is paramount.
To validate the performance assumption I built a simple .NET test app that populated a Dictionary<string,List<Location>> with all 765,344 Canadian postal codes (there are, in some cases, multiple entries for a single postal code, so each dictionary element contains a list that contains 1..n results), and then looked up random sets of 4000 postal codes (hint: Create randomly sorted recordsets in SQL Server by ordering the results by NewId()).
It could lookup results at a rate of some 3,000,000+ lookups per second, with no parallelization running on a single mid-range core. Adding parallelization (extremely easy in .NET 4.0 using Parallel.ForEach) was of limited benefit as the reduce stage and thread safety efforts ate up any savings for all but the most unrealistically large test set, though I could modify the code for full parallelization easily.
That was an ultra simple solution with very few lines of code, specialized for the purpose. It did consume some 140MB of memory, but memory is bountiful and cheap.
Doing the same lookups in SQL Server, with optimized fill factors and a perfectly covering index — even after priming the cache (by doing a full select of the covering index) — yielding a return rate of approximately 5000 lookups per second on the same hardware, per core. The generalized execution engine simply isn’t optimized for such a trivial lookup usage, and imposes significant overhead that isn’t beneficial in this case.
3,000,000 versus 5,000 (per core) = an incredible reason to use a specialized solution, especially given that it’s the long solved problem of KV pairs.
The reader, after some correspondence, mentioned that they had settled on Redis, which is a solution that is midway between the custom in-application hashing solution and a generalized SQL solution (leaning much further to the former than the latter). The performance with such a solution will almost certainly be incredibly high, albeit bound by the overhead of IPC. Redis is a highly optimized solution for that task, and is quickly proving itself to be a viable part of most solutions.
It is the right solution for that problem. In no way is it a “new world order of social media and intraconnected graphs realigning the stars to herald the new way of using data”, but instead is a very appropriate use of the right tool. Redis, like Memcache, has a lot of metrics on its side, much unlike many of the other NoSQL solutions.
Using the right tool is what we should all strive to do.