Benchmark Driven Development – Lies, Damn Lies, and Benchmarks

“What gets measured gets done.”

I decided to take the new SunSpider benchmarks for a spin, generating the pretty graphs found down below. Benchmarks are always entertaining, and it was enjoyable comparing the numbers yielded under various conditions (turning SpeedStep on and off [none of the benchmarks loaded the CPU long enough for it to bother
raising up from its lowered power, 66% performance relaxation state], setting CPU affinity, running it on different PCs, trying different build options on my Firefox build, etc.)

“Why benchmark at all?” one might ask. Simple: If you find the right measures, the common wisdom goes, the inputs to the measure will improve as the various players work to improve the metrics.

Whether you’re measuring bugs per developer, lines-of-code, widgets per hour…whatever: Start measuring it and invariably it’ll start moving in the desired direction, whether this actually serves your end goal or not. Often such initiatives come at the cost of the unmeasured, but over time it adapts and starts serving as a beneficial feedback.

The Assembly Line Benchmark – Widgets per Hour

The WidgetDuring the summer in my late teens I worked on an assembly line building car parts (pieces that played some sort of role in the air conditioning system – basically widgets): Put a little bracket in a
metal cylinder, add a circle of fiberglass, inject some desiccant beads using a machine, add another fiberglass circle and another metal bracket to hold it all in place and then put it in another machine that squashed another cylinder onto the top. Then I sent it down the line to the welder.

Atop my machine sat a little counter that monitored my progress, carefully recording every piece assembled. While this was a less advanced era — being the prehistoric early 90s and all — and I had to manually transfer the final count to my timecard for submission, every worker was kept somewhat honest by the metrics submitted by the other workers on the line.

Clearly I couldn’t have done 2000 parts in a day if the people before me and behind me in line only reported 1000, for instance, and vice versa.

Coupled with continuous, careful QA tests and random inspections (performed by people who had their own metrics to work towards), this struck me as an excellent system because it was difficult or impossible to game, and the onerous checks ensured that it didn’t come at the expense of quality.

It certainly worked wonders on me, as I wiled away the endless summer days performing the most awesomely brainlessness of tasks by competing with my own personal productivity “records”, trying to push out more quality parts per hour day after day.

I was there and had nothing better to do, and that little counter sat looking down on me, mocking me. It dared me to do just a couple more pieces per hour, and I willingly complied.

Somewhere a paper pusher and cackling middle manager would sum up the part counts and rub their hands together in giddy glee, eager for my zombie-state quest for worth to pad their bonus cheques.

It’s good I was a summer employee, because my pace didn’t make me friends on the line.

Test Driven Development tries to create a similar spirit of metrics, giving you a goal to strive for as you build out your product. It’s a comforting bit of feedback when all of the TDD tests come back with green checkmarks. The more tests you create, the higher the absolute count of passes you can brag about when the
product sails through with flying colors, easily passing 497 of the 497 tests.

Performance benchmarks serve the same purpose for the performance and efficiency domain.

Consider the initial hardware-accelerating video cards for Windows. Early on they seemed to have little or no purpose, and were almost abstract to users. Then benchmarks started appearing, giving the manufacturers something to strive towards while also providing end users with an easy way to compare and choose amongst
the options. “Card {A} can only do 10,000,000 accelerated rectangles per second, while card {B} can do 12,000,000. Clearly we need to get card {B} for our rectangle displaying needs.”

Gaming the MetricsDiamond Speedstar 24x

Of course some vendors started gaming the metrics in various creative ways (see Joel’s excellent
on poorly thought out metrics). Several created products that actually recognized running benchmarks in hardware, “optimizing” (by any means possible, including simply discarding many of the benchmark commands, knowing the end user will never notice if every second rectangle or rendered text of the millions
per second isn’t being rendered). Worse, the benchmarks were so atrociously artificial, bearing little similarity to actual everyday use that the direction of progress was to optimize the performance of benchmarks, often to the detriment of everyday use.

Eventually the benchmarks matured, getting better and more realistic, and the gaming was prevented or embarrassingly exposed, and it became a hugely beneficial tool in the march forward in the field. Various games have served the benchmarking role, the Doom and then Quake series being the most influential.

In the browser market, the growing interactivity of the web and the renewed competition amongst the big competitors has seen a flurry of benchmarks being widely discussed and debated, stereotyping each of the browsers into performance ghettoes. “Firefox is sooooo slow….” “IE is garbage. Opera is super speedy!”

Having some real tests is of obvious benefit to “set the record straight”, not to mention that it provides a carrot for the competitors to chase. Exactly that happened with me a while back when I came across a string concatenation benchmark, so I went in and streamlined the piece of Firefox code specifically impacted by that benchmark. My change in place, Firefox indeed did much better on that specific benchmark, though the real-world benefit was negligible.

In many ways the various web benchmarks available reminds me of the early accelerated video card benchmarks: Crude, having little or no correlation with the pain points of real-world use, and opportune for gaming and false evangelism.

WebKit’s SunSpider

Which brings me to the recently released SunSpider benchmark (which is a credible contender for the widely coveted “most poorly chosen project name” award: It’s bad enough that an Apple project uses “Sun” in their product name, but it’s thrice as bad when it’s a project related to JavaScript – JavaScript being another nominee).

SunSpider is very easy to run and gives quick feedback, so quite a few charts and graphs have been sprouting on blogs across the land.

JavaScript/DOM performance is a huge concern right now, as web applications are growing in richness by leaps and bounds, so there is definitely a need to be filled.

Will SunSpider be what we’ve all been looking for?

Here’s just such a graph, charting the stacked benchmark runtime for the current tier-1 browsers for Windows.

SunSpider Benchmark Results

Benchmarks were performed on a 4GB, Q6600 quad-core Core 2 processor machine running Vista x64. Firefox 3 was built from the current CVS (as of this morning). The Y-axis represents milliseconds.

Such a benchmark provides immediate feedback regarding the biggest bang for the buck optimization, at least in regards to improving the runtime of this particular benchmark. For IE 7 it is pretty clear that the benchmark killer is the bizarre and repeated use of string concatenation throughout the benchmark tests,
particularly evident in the string-base64 and string-validate benchmark.

Naive String Concatenations

After approximately 20 seconds (okay, maybe 22 seconds) “optimizing” the base64 and validate tests to use the extremely common Array push/toString idiom that is used on pretty much any page that does more than the most trivial of string operations (my changes were rash and very simplistic, though if I were motivated — if this were production code — I could do a much better job with it), the performance had changed rather dramatically, as seen in the following graph (scroll up and down for dramatic flair).

SunSpider Benchmark Results

It’s late and I’m tired, but I’d guarantee that I could dramatically decrease the remaining largest test — string tagcloud — but I think the point is proven.

Some will naturally draw from this the presumption that I’m just an Internet Explorer 7 fan, desperately manipulating the benchmark to best fit the strengths and avoid the weaknesses of my favourite browser.

They’d be wrong.

My browser of choice is Firefox. Not only do I not find the featureset of Internet Explorer 7 uncompelling and anemic compared to a naked copy of Firefox (not even considering the enormous functionality offered by add-ins, such as the extraordinary Firebug), I find the performance of Microsoft’s offering to be atrocious on real-world websites.

I don’t like Internet Explorer on technical grounds, and I like it even less given the concerning conflict of interest it represents.

Perhaps I’m just bearing a grudge.

We’re currently implementing a very rich, advanced web application, and one thing that we’ve found, in case after case, is that in real-world situations with extensive DOM manipulation and production JavaScript, Internet Explorer stumbles and groans under the load, while competing browsers complete the task with gusto
(just rendering a dynamically loaded complex table takes 20x or more time on IE 7 than it does in Firefox 2, rendering the former almost unusable. The disparity grows greater with Firefox 3). It’s to the point that I can’t help but wonder if Microsoft is trying to undermine the whole web thing intentionally, hoping to encourage the middle-grounders to hoard to the boards proclaiming the deficiencies of web apps, manipulated into begging for some XAML goodness.

So if I wasn’t looking to defend IE7, what was my point?

Lies, Damned Lies, and Benchmarks

Maybe the motivations of the team behind this benchmark were noble, and they weren’t blinded into naturally biasing the benchmark towards their own project, but I can’t help but see this benchmark as an entirely artificial, naive, unrealistic benchmark that adds little to the benchmarking landscape. A cursory glance through the benchmark sees bizarre oddities that would never appear in real-world code, and a variety of implementation choices that are questionable for a benchmark (for instance test/sample data is often constructed within the timed scope of the benchmark in the SunSpider tests, as if a production website needs to create 4000 random email addresses and ZIP codes, for instance. Normally such data is constructed outside of the timed loop, for obvious reasons).

The lack of weighting, the lack of realistic test scenarios… I’m just not convinced that it holds much utility (though I do like the way they have the “driver”, and the elegant and clean client-side way they aggregate the test values, and do the same for comparison. The framework is a great foundation) for cross-browser comparison. I can see use in analyzing performance differences for a single browser (the results turning Firebug on and off, for instance, were very surprising), just not as a valid comparison between different browsers.

Just as I dramatically changed the IE results in less than a minute of code changing, I’d guarantee that I could do the same with the other outliers (in particular the longer Firefox tests).

I’m still waiting for a good, real-world benchmark. Something that simulates sites like Digg, Slashdot, Facebook, interacting with them in a way that a real world user really would.