Everything is static through the right time window

When I first envisioned the limited-time side project Name Navigator a couple of weeks back, I had just downloaded the Social Security state-level data containing millions of sparse records for name incidents per year per state.

I wasn’t entirely sure what I would or could do with it, so I didn’t yet know what the cacheability of the data was. I hadn’t yet made decisions on the richness that the interface would provide.

Could you search intra-name through regular expressions, for instance? Could you pull whole sets of high-correlation (or anti-correlation) names in one go? What would provide clearly obvious usage and value, while minimizing server load and maximizing performance?

So I built a simple API in Go (ridiculously simple while featuring very high performance) that pulled the cleaned data in and scrubbed and organized it as appropriate, and then served it up by request. That sat behind an nginx instance featuring proxy caching, meaning that if those Go responses set expiry headers allowing for caching, the nginx proxy cache would save the response to disk and then use it for functionally identical responses until the expiry date or it was forcefully invalidated. The proxy cache even defends against the thundering herd problem by grouping requests for the same data and only forwarding one request to the back-end server.

Even the process of serializing to json is “expensive” to some degree, so saving the pre-serialized data is a win, especially given that I didn’t have to load up the Go code with caching overlays everywhere.

Adding explicit caching to code is one of the quickest ways to yield unmanageable, ugly code. The external caching of the nginx proxy is extremely clean and configurable.

Given that the SS data only changes once a year, this means that at some point the vast majority of responses would be served from the nginx cache.

However after pushing the product out there and watching it used for a while(*1), I decided to take it one step further. Using the same Go code, which had already been abstracted such that the results were pushed to an abstract stream, I simply iterated through every possible query given the intentional restrictions put on the client, pushing the results out to the filesystem.

The filesystem of course being a very robust, hierarchical key/value database.

Those interface and implementation restrictions were intentional choices that greatly simplified the server side. When you search for names it only pulls a JSON for the first letter for that gender, then re-using that for more discrete searches. For instance if you were searching for female names starting with L, it is served via http://names.yafla.com/d/j/f/n/L.json.

If you pull up three names simultaneously, it makes three simultaneous requests, rather than increasing the entropy of requests by bundling three names together (the API could of course handle this, but early on I saw the downside). For instance if one of them were Mary it would pull http://names.yafla.com/d/j/f/c/Mary.json, Mary.json being the “key” to that database look-up.

This allows me to shut down the Go process entirely until I decide to change data formats (unlikely — I am extremely satisfied with the efficient transfer and use). As a second benefit, it allows me to ultra-gzip-compress all of the files on the server, and then using the HttpGzipStaticModule, to stream-serve those files, needing the server process to do no on-demand compression — just pushing bytes from disk (or, more likely, the in-memory disk cache) to the wire).

All of this comes together to provide a service that, while admittedly simple, cannot possibly be more efficient or faster. It could be hosted on the most meager of server and would still serve as fast as the network pipe would allow. Requests are served in timeframes measured in nanoseconds, dwarfed by simple network latency (though the high-level of compression absolutely minimizes fragmentation, coupled with server-side network settings that allow for many responses to come in a single packet).

Of course all of this seems obvious. Yet history has shown that most developers and teams, faced with the same task, would start by building a database system (sometimes no database is better), then their ORM, then the business logic that would serve the serialization, that would then go to the web layer. It’s how you end up with an intolerably slow web, where the most simple requests take 100’s of milliseconds, and the most trivial of project needs to scale to dozens of machines to serve a basic load.

The web should be fast. When simple blogs can’t serve several hundred requests a day without falling over, something has gone seriously awry.

Anyways, I thought that was pretty cool. I strive for this sort of least-possible-overhead design where possible (in many cases it isn’t possible, such as highly secure data that is deeply unique by user, though there are still high performance strategies possible), and the result is extremely high performance systems by design.

*1 – One interesting thing I’ve noticed as I’ve watched it posted on several social media sites — the vast majority of users don’t vote, leading to a situation that is perilously ripe for vote manipulation. I have watched certain sites send thousands of users to the Name Navigator, and most of them have a surprisingly long time on the site, with many if not a majority going through multiple names over many minutes, some trying dozens of variations. Clearly they found it interesting and engaging. But then for all of those people it sees but a couple of votes/likes/up arrows/etc on the source site. I have long known this (having engaging content has a negative impact on your social media response rate because people actually get sidetracked and engaged. Your best bet is having a polarizing headline that people can agree with, getting countless votes of confidence by people who never actually followed the link). This isn’t griping or sour grapes, and the fun side project has already gotten significantly more attention than I expected, but I’ve been tempted to engage one of the many “buy votes” organizations just to see how different things are.