Rust Faster Than C? Not So Fast…

On Hacker News a few days ago –

That submission links to the Computer Language Benchmark Game leaderboard for k-nucleotide, where the top performer is a Rust entrant at the time of this post.

Which is impressive. Rust is a relatively young language (although it heavily leverages LLVM on the back-end, standing on the shoulder of a giant), and is making waves while bringing a lot of needed concurrency and memory safety benefits.

For some classes of problems, Rust is proving itself a compelling option even in performance critical code.

This is a very simple test of rudimentary functionality, however, so if C is lagging comparatively, some part of its implementation is sandbagging or focusing on another optimization.

Fired up VTune And Profiling the Laggards

A quick profile and the overwhelming outlier were identified as the set of instructions geared to bit mangling to test and set occupied and deleted flags: In an effort to optimize memory usage, khash stores flags — one for empty or used, and another to indicate deletion — for each hash bucket in a bit array, 16 buckets sharing a single 32-bit integer courtesy of bit packing.

While memory efficient, the process of determining the host integer for a bucket, and then the bit offsets for each constituent, adds significant overhead to the process: It’s a tiny cost by itself, but when iterated hundreds of millions of times becomes substantial. This was made worse by the code style (shoehorning generic-type functionality in a C header file) making it difficult for the compiler to optimize to bit test instructions.

This sort of optimization is an interesting one because the overhead of bit mangling could theoretically be balanced out by a better cache hit rate — more data in less memory — though this is unlikely to hold for an open addressing, double-hash implementation, as they’re generally cache unfriendly.

Curious to the impact, I switched the flag type to an unsigned char and eliminated the bit shifting (there is still fixed bit testing for the first and second bits). The trivial modification to khash.h can be found here (edit: I updated that version slightly to add another small optimization and to improve the readability of the flags).

C Takes The Lead Again

This yielded identical calculations/results, and the same max memory usage (130,592 per time -v), but completed some 32% faster than the original khash implementation, the hashing being significantly faster with the lagging points moving to input parsing and other overhead of the test. My test device is obviously not the same as the one they’re running (though the results are very similar), but assuming the same speedup it would push the C implementation to 4.32 seconds —  a sizable lead — courtesy of a trivial, and arguably optimal change. Memory overhead is slightly worse, with 6 wasted bits per bucket, but the change dramatically reduces the overhead of bucket operations.

GCC Yielded Faster Code Than Clang

As an aside, several of the comments on that discussion questioned why clang wasn’t used, the hypothesis being that it’d yield better performance courtesy of more optimization. In my quick tests, including generating profile outputs and using profile-guided optimizations, GCC generated better optimized code, always winning the performance battle for this test. This was between GCC 6.2.0 and clang 3.9.1. And as an interesting aside, the GCC and clang built binaries were significantly faster running in a VirtualBox Linux VM1 than the binary built with either Visual Studio 2015 or 2017, running in the host environment, even when factoring out the slow stdin handling of Windows.

The C/C++ compiler in Visual Studio has gotten very good (and surprises me with its standards compliance, once its biggest weakness), but falls behind in this test.

Conclusion

There is no conclusion: Rust is great, C is great, and profilers are an awesome tool to lean on. It was a fun exercise in leveraging a performance profiler (in this case VTune). It took seconds to identify the “culprit”, which was a memory optimization that comes at the cost of a significant number of CPU cycles. Shifting the focus to performance, and away from optimizing the memory profile, and performance improves significantly.

These game/toy benchmarks don’t mean a tremendous amount, and no nights should go without sleep over them, but they are a fun exercise and an entertaining distraction.

1 – Totally irrelevant for this test (this is not a candidate for vectorization), but it’s notable that you can enable AVX2 availability in VirtualBox VMs via-

VBoxManage setextradata "$vm_name" VBoxInternal/CPUM/IsaExts/AVX2 1