Micro-benchmarks as the Canary in the Coal Mine

I frequent a number of programming social news style sites as a morning ritual: You don’t have to chase every trend, but being aware of happenings in the industry, learning from other people’s discoveries and adventures, is a useful exercise.

A recurring source of content are micro-benchmarks of some easily understood sliver of our problem space, the canonical example being trivial web implementations in one’s platform of choice.

A Hello World for HTTP.

package main

import (

func handler(w http.ResponseWriter, r *http.Request) {
   fmt.Fprintf(w, "Hello world!")

func main() {
   http.HandleFunc("/", handler)
   http.ListenAndServe(":8080", nil)

Incontestable proof of the universal superiority of whatever language is being pushed. Massive numbers of meaningless requests served by a single virtual server.

As an aside that I should probably add as a footnote, I still strongly recommend that static and cached content be served from a dedicated platform like nginx (use lightweight unix sockets to the back end if on the same machine), itself very likely layered by a CDN. This sort of trivial type stuff should never be in your own code, nor should it be a primary focus of optimizations.

Occasionally the discussion will move to a slightly higher level and there’ll be impassioned debates about HTTP routers (differentiating URLs, pulling parameters, etc, then calling the relevant service logic), everyone optimizing the edges. There are thousands of HTTP routers on virtually every platform, most differentiated by tiny performance differences.

People once cut their teeth by making their own compiler or OS, but now everyone seems to start by making an HTTP router. Focus moves elsewhere.

In a recent discussion where a micro-benchmark was being discussed (used to promote a pre-alpha platform), a user said in regards to Go (one of the lesser alternatives compared against)-

it’s just that the std lib is coded with total disregard for performance concerns, the http server is slow, regex implementation is a joke”

total disregard. A jokeSlow.

On a decently capable server, that critiqued Go implementation, if you’re testing it in isolation and don’t care about doing anything actually useful, could serve more requests than seen by the vast majority of sites on these fair tubes of ours. With a magnitude or two to spare.

100s of thousands of requests per second is simply enormous. It wasn’t that long ago that we were amazed at 100 requests per second for completely static content cached in memory. Just a few short years ago most frameworks tapped out at barely double digit requests per second (twas the era of synchronous IO and blocking a threads for every request).

As a fun fact, a recent implementation I spearheaded attained four million fully robust web service financial transactions per second. This was on a seriously high-end server, and used a wide range of optimizations such as a zero-copy network interface and secure memory sharing between service layers, and ultimately was just grossly overbuilt unless conquering new worlds, but it helped a sales pitch.

Things improve. Standards and expectations improve. That really was a poor state of affairs, and not only were users given a slow, poor experience, it often required farms of servers for even modest traffic needs.

Choosing a high performance foundation is good. The common notion that you can just fix the poor performance parts after the fact seldom holds true.

Nonetheless, the whole venture made me curious what sort of correlation trivial micro-benchmarks hold to actual real-world needs. Clearly printing a string to a TCP connection is an absolutely minuscule part of any real-world solution, and once you’ve layered in authentication and authorization and models and abstractions and back-end microservices and ORMs and databases, it becomes a rounding error.

But does it indicate choices behind the scenes, or a fanatical pursuit of performance, that pays off elsewhere?

It’s tough to gauge because there is no universal web platform benchmark. There is no TPC for web applications.

The best we have, really, are the TechEmpower benchmarks. These are a set of relatively simple benchmarks that vary from absurdly trivial to mostly trivial-

  • Return a simple string (plaintext)
  • Serialize an object (containing a single string) into a JSON string and return it (json)
  • Query a value from a database, and serialize it (an id and a string) into a JSON string and return it (single query)
  • Query multiple values from a database and serialize them (multiple queries)
  • Query values from a database, add an additional value, and serialize them (fortunes)
  • Load rows into objects, update the objects, save the changes back to the database, serialize to json (data updates)

It is hardly a real world implementation of the stacks of dependencies and efficiency barriers in an application, but some of the tests are worlds better than the trivial micro-benchmarks that dot the land. It also gives developers a visible performance reward, just as Sunspider led to enormous Javascript performance improvements.

So here’s the performance profile of a variety of frameworks/platforms against the postgres db on their physical test platform, each clustered in a sequence of plaintext (blue), JSON (red), Fortune (yellow), Single Query (green), and Multiple Query (brown) results. The vertical axis has been capped at 1,000,000 requests per second to preserve detail, and only frameworks having results for all of the categories are included.

When I originally decided that I’d author this piece, my intention was to actually show that you shouldn’t trust micro-benchmarks because they seldom have a correlation with more significant tasks that you’ll face in real life. While I’ve long argued that such optimizations often indicate a team that cares about performance holistically, in the web world it has often been the case that products that shine at very specific things are often very weak in more realistic use.

But in this case my core assumption was only partly right. The correlation between the trivial micro-benchmark speed — simply returning a string — and the more significant tasks that I was sure would be drown out by underlying processing (when you’re doing queries at a rate of 1000 per second, an overhead of 0.000001s is hardly relevant), is much higher than I expected.

  • 0.75 – Correlation between JSON and plaintext performance
  • 0.58 – Correlation between Fortune and plaintext performance
  • 0.646 – Correlation between Single query and plaintext performance
  • 0.21371 – Correlation between Multiple query and plaintext performance

As more happens in the background, outside of the control of the framework, invariably the raw performance advantage is lost, but my core assumption was that there would be a much smaller correlation.

So in the end this is simply a “well, that’s interesting” post. It certainly isn’t a recommendation for any framework or the other — developer aptitude and suitability for task reign supreme — but I found it interesting.