The Instagram Connundrum
Many of the people I’ve spoken to about Gallus — as I awkwardly try to inject it into completely unrelated conversations, and doesn’t video stabilization have a context in a discussion about soft cheeses? — have naturally started questioning why such a product is difficult on Android: Why Instagram never brought their offering to the platform, and why, in their absence, not a single alternative appeared. At least none appeared until I finally took up the challenge and, heroically and through great sacrifice, made it happen….or rather saw an opportunity to do something that I thought might yield some hopefully easy and wide press that I could selfishly exploit.
While the press side has been pretty sedate thus far — I perhaps naively thought that if I committed the time and effort and created a great solution, attention would naturally follow — I’m still holding out hope that it catches some wind, even if my enthusiasm has been somewhat deflated now that we’re two weeks past release and the app still sits unnoticed.
In any case, since this question keeps coming up, I’ll go into detail a bit as it lets me essentially unload a bit of a rant. As much as I might adore Android — and I’ve been a fan since gearing up with a HTC Dream G1 (I also had a G2) when it was first introduced — I pull absolutely no punches when criticisms are deserved, and for all of the good parts of the system, there are a lot of parts that weren’t well considered, and the entire platform and userbase pays the debt.
There are a lot of fundamental technical difficulties to such a solution. From the complexities of OES textures to camera previews and differences, to the computed field of view and focal depth, and its impact on the corrective movements applied, calculating fitting splines and managing high volumes of data collection, and doing this all on a managed runtime while minimizing garbage (and with that, deadly garbage collection pauses. As an aside, the Samsung Galaxy S3 would run this application with gusto, but it starts hyper aggressively collecting garbage despite having literally hundreds of megabytes free and unused) and distributing it across a number of cores in a layered series of synchronized queues.
Even the ability to play a compressed video file at high speed is much more difficult than it might first appear.
One of the reasons I set Android 4.4 as the baseline is that it was the first version that offered all of the encoding and “muxing” (basically interleaving the encoded video and audio and metadata into a file, or when decoding doing the reverse — multiplexing) functionality that Gallus needed, at a granular enough level, that I could avoid using third-party h.264 libraries, such as libVLC or FFmpeg most of which come with complex licensing requirements. By using the system image variants I could also be more confident that any hardware assists of the platform should be optimally used, whether on a Snapdragon or Exynos or Tegra or x86, whatever hardware encoding or decoding functionality it offered would fully be leveraged.
It wasn’t easy. But I generally don’t look for easy projects (whether in databases, financial computation systems, or a mobile app), and predictable complexity is a fun challenge, and is one of the greatest motivators. It is fantastically rewarding taking on a difficult challenge, analyzing all of the parts and problems, and overcoming it.
The part that just drove me a little bit batty, on the other hand, is Android’s utter insanity when it comes to anything involving timing — the parts that should be the easiest part of the application, but on Android become the most difficult, sloppy part of the application, full of estimations and approximations.
For something that so easily could have been rationalized and standardized at the system layer, it instead is left as an exercise for app developers, and this confusion continues even with brand new APIs (as I’ll explain later).
That Weighs 457.926 Units
Take the example of motion events — tracking the gyroscope and accelerometer.
Consider how iOS does it. For almost all such events, your event handler is given some data in an instance derived from CMLogItem. That instance provides, across all motion events, a simple double precision timestamp that represents the seconds since device booted.
That’s it. I know that the “shutter” was open between 11.02 and 11.03 seconds, as an example, so I find the closest logged motion events to 11.025, or interpolate between surrounding events. Straightforward, and 100% consistent across devices, because they all honor this precise contract.
Android timestamps motion events as well. See, there’s a timestamp value right there. This is important because there can be a considerable delay between when the platform logs the events, and when the event actually percolates up to your application, sometimes in batches. This is even more important on Android than iOS, given that Android adds GC pauses and generally more underlying system services and applications that compete for clock cycles, and you’re unlikely to be holding the time slice when the event occurs. Every event is going to be delayed to some degree.
You are dealing with the world in batches.
Regardless, “Time in nanoseconds” looks promising. So what time in nanoseconds do you think that is?
Could it be System.nanoTime() (a timer that only counts while the device is awake)? SystemClock.elapsedTimeNanos() that also counts while the device is asleep? What if someone was really inebriated when they implemented it, so it’s some random epoch of unknown lineage, which you are neither told nor can interrogate through any mechanism?
It can actually be any of the above, and thus far I’ve come across all three on different devices. Worse, it may actually vary across sensors on the same device. I’ve never come across such a worst case, but from the guarantees, or lack thereof, of the platform, you have to defensively program as if it might because any device has free reign to define the timestamp (nanoseconds from, or to, something) however they want, and to refuse to actually share what that definition is.
So you’re grabbing frames of video and you have figured out that the frame of video came from what the elapsedTimeNanos says is 2929292838373, which was probably about 30ms ago. You have some gyroscope data, the most recent tagged with the random epoch number 8573363283838, and then various values going back from there.
Now match them up. And do it within a millisecond or two variance because anything greater will be unacceptable. The system provides zero mechanism to either determine what the basis of the sensor is, or what the offsets are, if any. It’s obviously easy enough to figure out that going back 30ms in nano seconds is -30000000, however going back from what?
You can assume that the moment you received the most recent event equals “now”, and try to compute based upon that, but you have absolutely no hopes at adequate accuracy — you cannot know whether the system and or sensor delayed events for 10ms, or 200ms, or anywhere in between.
It’s fine for a simple game. It presents a grossly unnecessary complexity for more precise uses, which is exactly the issue that Gallus, and most certainly Instagram, faced.
I think I achieved a pretty good solution via a lot of codes and conditions, but it’s just so entirely unnecessary. If the system simply defined a time basis for these events, and took responsibility for that time basis, apps would be more consistent and higher quality (if you come across a developer who claims that it’s concretely one of the above, they’re probably putting out bad code because the single device they have conforms to that rule, when the actual platform provides no guarantees). For the system to guarantee that the events were in some consistent, tightly defined time basis is absolutely fundamental, and is trivial at the source, where all of the information is available, but becomes much more of a nuisance at the consumer end.
What’s Done Is Done
That issue with the sensors has been there from the beginning, even though it remains curiously vaguely defined, and many developers continue to operate on wrong assumptions. But what about brand new interfaces? Gallus can optionally use a super fresh new design via the camera2 interface, a huge advantage being that it offers a simple “starting capture” event at the beginning of every frame, adding an obvious advantage over having to essentially guess. That event is timestamped.
And, in a great improvement, it might be elapsedTimeNanos. Or it might not.
On the Nexus 5, for instance, it inexplicably is “UNKNOWN”. So you can take that unknown time frame from the camera and compare it with the unknown time values from the sensors.
But only it is actually nanoTime (I would certainly not wager on this long term, and calculate their correlation on each wake). So why didn’t they just figure out the nanoTime -> elapsedTimeNanos offset when waken, and include that with every event, making life better for every developer? Simply make the spec “this always includes elapsedTimeNanos” instead of adding this additional bit of farce?
And there is that classic issue. I honestly cannot comprehend why this recurring pattern continues through Android. Where really basic things that are simple problems in the system software are instead turned into actual serious issues because they can’t just concretely define things with a single unit of measure, and commit to sticking with it. It is absolving responsibility, and making a tiny issue (normalizing times) into a big issue.