Using Go for High Performance Code

An interesting series of posts regarding Go have percolated up over the past while, starting innocently enough with a fascinating story about a business card raytracer.

If you haven’t seen it, take a moment to read about that shockingly concise and elegant raytracer, of such brevity that it can be reasonably typeset on the back of a business card.

In any case, this led to someone translating this code over to Go to demonstrate that it has become a reasonably high performance platform, first porting it as closely as possible, and then doing some rudimentary optimizations, documenting their progress along the way.

Don’t go looking at their Go code for any best practices as there are some pretty wild veers from the idioms and best practices of the language (with a raw palpable scent of hack permeating the work), but it is an interesting exercise nonetheless. It’s also impressive to see the performance improvements between each version of Go, with even 1.1 – 1.2RC1 showing significant gains, and that isn’t even looking at gccgo, which is the supposed go-to if you want to really squeeze every bit of performance out.

Now let me take a moment and be very clear about something: I do not recommend Go for bleeding edge performance code. In a few prior entries where I’ve mentioned Go, it is as a remarkably productive, robust, intuitive platform for building high-concurrency systems, and for the tape that binds systems and solutions. In those places Go is a gem.

In my case I’ve levered it in concert with very high performance C(++) code built using the Intel Compiler and all of the fantastic processor vectorization technology that brings (I’ve gotten pretty good at generating solutions that it will auto-vectorize with ease…don’t believe the disbelievers who doubt the efficacy of such optimizations, as compilers such as Intel’s will hold your hand as it illustrates how to achieve coding Zen).

To step back for a moment, vectorization is taking a series of instructions and compressing them into a single instruction with multiple data, aka SIMD (this exists on all modern platforms now. As NEON in ARM, for instance). For instance if you have two arrays of floats and you need to generate the product of each offset, you can iterate one by one and do the multiplication, or you can load eight at a time into each of two AVX registers and do a single instruction that performs the eight multiplications. This can significantly improve performance in some cases, and allows processors to do more without having to increase the clock speed.

But that doesn’t really reduce my advocacy of Go because the hugely overwhelming majority of code you and I write does not need to squeeze every iota of performance out of your CPU. It is vastly more likely that optimal algorithms and basic practices like memoization or zero-copy, and simply avoiding doing stupid things, will have magnitudes more of an impact on the performance of your solutions than leveraging the latest SIMD instructions. This includes for the overwhelming majority of the system platform itself, where again gaps in algorithms are the cause of so much pain.

In any case, the story about Go raised the ire of quite a few. This cannot be, they rallied. And really, it was surprising that it was so close given the years of refinement and progress that have happened in C(++) compilers. Having Go neck and neck with C++ on code that is the bread and butter of the latter was surprising.

But here’s where things get ugly: in the search for superiority, the exercise turned into a lie. I’m not even talking about the egregious errors like the convenient transition to single-precision floating point from the double-precision of the Go variant.

The first step down the road of lies was the inline use of SSE, and then AVX.

Make an argument about the superiority of C compilers to auto-vectorize (they’ve been evolving this technology for years), and the suitability of the language for the same, and you’ve got a good argument. Stick a bunch of inline SIMD in your code and you’re foisting a lie (that particular piece was nonsensical through and through. It uses leading language to imply things that aren’t supported, , for instance noting the increased performance of C++ when given nothing-to-do-with-C++-AVX-steroids, it notes “on my not-very-fast-at-all Core i3-2100T“, as if that somehow makes the C++ variant even better than demonstrated despite the fact that Go was benchmarked on exactly the same not-very-fast-at-all processor) .

It no longer has anything to do with C versus Go, but instead is trying a cheap trick to advance some advocacy.

Because, of course, you can use inline SIMD in Go as well. I’ve talked about import “C” before, and you have the full gamut of ability to then utilize the full power of SSE to AVX. Now there are some caveats (speaking specifically of structure alignment), but in my case I threw together a demo that used the most superficial of C surrogates, with all of the advantages of either SSE or AVX, as you will (the latter doubling the pleasure and the fun. AVX-512, coming from Knights Landing, will double it yet again). Launch all zigs, performance is turned up to 11.

You can even leverage assembly in your Go solutions if you really want to.

Now it might seem odd to promote the performance of Go through its inclusion of C, but that’s exactly what I’ve done before and will continue to do — it is the first-class support of C and its rudiments that makes Go so viable so quickly. It is that ability to tune the edges that makes me so confident in leveraging that rich, robust platform.