Facebook Instant Articles

Both Google and Facebook introduced their own lightweight HTML subsets: AMP and Instant Articles, respectively. I mentioned AMP on here previously, and via an official WordPress plugin it’s accessible by simply appending /amp on any page’s URL. Both impose a restrictive environment that limit the scope of web technologies that you can use on your page, allowing for fewer/smaller downloads and less CPU churn.

The elevator pitch for Facebook’s Instant Articles is an absolutely monster, bringing an i5-4460 to its knees by the time the page had been scrolled down. There’s a bit of an irony in the pitch for a lightweight, fast subset of HTML being a monstrous, overwrought, beastly page (the irrelevant background video thing is an overused, battery sucking pig that was never useful and is just boorish, lazy bedazzling).

I’m a user of Facebook, with my primary use being news aggregation. As many content providers all herded in line to the platform, I stopped visiting their sites and simply do a cursory browse of Facebook periodically: BBC, CBC, The Star, Anandtech, The Verge, Polygon, Cracked, various NFL related things, and on and on. On the whole I would wager that these sites engaged in a bit of a Tragedy of the Commons racing to the Facebook fold, though at some point the critical mass was hit and it became necessary to continue getting eyeballs.

The web is ghetto-ized and controlled.

More recently Facebook started corralling mobile users to their own embedded browser (for the obvious overwatch benefits). And now they’re pushing publishers to Instant Articles.

But the transition isn’t clean. Many sites are losing most embedded content (Twitter embeds, social media). Lots of pages are pulling up, curiously, malformed XML errors. It is accidentally acting as an ad blocker on many web properties, the sites unintentionally squashing their own ads, filtered out as non-Instant Article compliant.

It’s interesting how quietly this is all happening. This once would make for pretty big tech news (especially Facebook embedding the browser). Now it’s just a quiet transition.

The Reports of HTML’s Death Have Been Greatly Exaggerated…?

Feedback

Yesterday’s post titled “Android Instant Apps / The Slow, Inexorable Death of HTML” surprisingly accumulated some 35,000 or so uniques in a few hours. It has yielded feedback containing recurring sentiments that are worth addressing.

it is weird the article trying to sell the idea that apps are better posted XKCD images stating otherwise

While there are situations where a native app can certainly do things that a web app can’t, and there are some things it can simply do better, the prior entry wasn’t trying to “sell” the idea that apps are inherently better (and I have advocated the opposite on here and professionally for years where the situation merits). It was simply an observation of Google’s recent initiative, and what the likely outcome will be.

Which segues to another sentiment-

The reverse is happening. Hybrid apps are growing in number. CSS/JS is becoming slicker than ever.

The web is already a universal platform, so why the ████ would you code a little bit of Java for Android instead of writing it once for everything?

In the prior entry I mentioned that some mobile websites are growing worse. The cause of this decline isn’t that HTML5/JS/CSS or the related stack is somehow rusting. Instead it’s that many of these sites are so committed to getting you into their native app that they’ll sabotage their web property for the cause.

No, I don’t want to install your app. Seriously.

Add that the mobile web has seen a huge upsurge in advertising dark patterns. The sort of nonsense that has mostly disappeared from the desktop web, courtesy of the nuclear threat of ad blockers. Given that many on the mobile web don’t utilize these tools, the domain is rife with endless redirects, popovers, the intentionally delayed page re-flows to encourage errant clicks (a strategy that is purely self-destructive in the longer term, as every user will simply hit back, undermining the CPC), overriding swipe behaviors, making all background space an ad click, and so on.

The technology of the mobile web is top notch, but the implementation is an absolute garbage dump across many web properties.

So you have an endless list of web properties that desperately want you to install their app (which they already developed, often in duplicate, triplicate…this isn’t a new thing), and who are fully willing to make your web experience miserable. Now offer them the ability to essentially force parts of that app on the user.

The uptake rate is going to be incredibly high. It is going to become prevalent. And with it, the treatment of the remaining mobile webfugees is going to grow worse.

On Stickiness

I think it’s pretty cool to see a post get moderate success, and enjoy the exposure. One of the trends that has changed in the world of the web, though, is in the reduced stickiness of visitors.

A decade or so ago, getting a front page on Slashdot — I managed it a few times in its hey-day — would yield visitors who would browse around the site often for hours on end, subscribe to the RSS feed, etc. It was a very sticky success, and the benefits echoed long after the initial exposure died down. A part of the reason is that there simply wasn’t a lot of content, so you couldn’t just refresh Slashdot and browse to the next 10 stories while avoiding work.

Having a few HN and Reddit success stories over the past while I’ve noticed a very different pattern. People pop on and read a piece, their time on site equaling the time to read to the end, and then they leave. I would say less than 0.5% look at any other page.

There is no stickiness. When the exposure dies down, it’s as if it didn’t happen at all.

Observing my own uses, this is exactly how I use the web now: I jump to various programming forums, visiting the various papers and entries and posts, and then I click back. I never really notice the author, I don’t bookmark their site, and I don’t subscribe to their feed. The rationale is that when they have another interesting post, maybe it’ll appear on the sites I visit.


This is just the new norm. It’s not good or bad, but it’s the way we utilize a constant flow of information. The group will select and filter for us.

While that’s a not very interesting observation, I should justify those paragraphs: I believe this is the cause of both the growing utilization of dark patterns on the web (essentially you’re to be exploited as much as possible during the brief moment they have your attention, and the truth is you probably won’t even remember the site that tricked you into clicking six ads and sent you on a vicious loop of redirects), and the desperation to install their app where they think they’ll gain a more permanent space in your digital world.

Eat Your Brotli / Revisiting Why You Should Use Nginx In Your Solutions

Google recently deployed brotli lossless transport compression in the Canary and Dev channels of Chrome. This is the compression algorithm that they introduced late last year, hyping up compared to competitors.

If your Chrome variant is equipped, you can enable it via (in the address bar)-

chrome://flags/#enable-brotli

It is limited to HTTPS-only currently, presumably to avoid causing issues with poorly built proxies.

Brotli is already included in the stable releases of Chrome and Firefox, albeit only to support the new, more compressible WOFF 2.0 web font standard. The dev channel updates just extend the use a bit, allowing the browser to declare a new Accepts-Encoding option, br (it was originally “bro”, but this was changed for obvious reasons), and has authored support for servers to serve up brotli compressed data in the form of an nginx module (itself a very lightweight wrapper around the brotli library. Nginx really is a study in elegant design).

One of the great things about these on-demand extensible web standards is that they enable incremental progress without disruption — you aren’t cutting anyone out by supporting them (browsers that don’t support this can remain oblivious, with no ill effect), but you can enhance the experience for users on capable devices. This is true for both HTTP/2 and brotli.

Overhyped Incremental Improvements

Most of the articles about the new compression option are over the top-

Google’s new algorithm will make Chrome run much faster” exclaims The Verge. “Google Chrome Is Getting a Big Speed Boost” declares Time.

Brotli will not reduce the size of the images. It will not reduce the size of the auto-play video. It can reduce the size of the various text-type resources (HTML, JavaScript, CSS), however the improvement over the already widely entrenched deflate/gzip is maybe 20-30%. Unless your connection is incredibly slow, in most cases the difference will likely be imperceptible. It will help with data caps, but once again it’s unlikely that the text-based content is really what’s ballooning usage, and instead it’s the meaty videos and images and animated GIFs that eat up the bulk of your transfer allocation.

Other articles have opined that it’ll save precious mobile device space, but again brotli is for transport compression. Every single browser that I’m aware of caches files locally in file-native form (e.g. a PNG at rest stays compressed with deflate because that’s a format-specific internal static compression, just as most PDFs are internally compressed, but that HTML page or JavaScript file transport compressed with brotli or gzip or deflate is decompressed on the client and cached decompressed).

In the real world, it’s unlikely to make much difference at all to most users on reasonably fast connections, beyond those edge type tests where you make an unrealistically tiny sample fit in a single packet. But it is a small incremental improvement, and why not.

One “why not” not might be if compression time is too onerous, and many results have found that the compression stage is much slower than existing options. I’ll touch on working around that later regarding nginx.

But still it’s kind of neat. New compression formats don’t come along that often, so brotli deserves a look.

Reptitions == Compressibility

Brotli starts with LZ77, which is a “find and reference repetitions” algorithm seen in every other mainstream compression algorithm.

LZ77 implementations work by looking some window (usually 32KB) back in the file to see if any bits of data have repeated, and if they have replacing repetitions with much smaller references to the earlier data. Brotli is a bit different in that every implementation lugs along a 119KB static dictionary of phrases that Google presumably found were most common across the world of text-based compressible documents. So when it scans a document for compression, it not only looks for duplicates in the past 32KB window, it also uses the static dictionary as a source of matches. They enhanced this a bit by adding 121 “transforms” on each of those dictionary entries (which in the code looks incredible hack-ish. Things like checking for matches with dictionary words and the suffix ” and”, for instance, or for capitalization variations of the dictionary words).

As a quick detour, Google has for several years heavily used another compression algorithm – Shared Dictionary Compression for HTTP. SDCH is actually very similar to Brotli, however instead of having a 119KB universal static dictionary, SDCH allows every site to define their own, domain-specific dictionary (or dictionaries), then using that as the reference dictionary. For instance a financial site might have a reference dictionary loaded with financial terminology, disclaimers, clauses, etc.

However SDCH requires some engineering work and saw extremely little uptake outside of Google. The only other major user is LinkedIn.

So Brotli is like SDCH without the confusion (or flexibility) of server-side dictionary generation.

The Brotli dictionary makes for a fascinating read. Remember that this is a dictionary that is the basis for potentially trillions of data exchanges, and that sits at rest on billions of devices.

Here are a couple of examples of phrases that Brotli can handle exceptionally well-

the Netherlands
the most common
background:url(
argued that the
scrolling="no"
included in the
North American
the name of the
interpretations
the traditional
development of
frequently used
a collection of
Holy Roman Emperor
almost exclusively
" border="0" alt="
Secretary of State
culminating in the
CIA World Factbook
the most important
anniversary of the
style="background-
<li><em><a href="/
the Atlantic Ocean
strictly speaking,
shortly before the
different types of
the Ottoman Empire
under the influence
contribution to the
Official website of
headquarters of the
centered around the
implications of the
have been developed
Federal Republic of

Thousands of basic words across a variety of languages, and then collections of words and phrases such as the above example, comprise the Brotli standard dictionary. With the transforms previously mentioned, it supports any of these in variations such as pluralization, varied capitalization, suffixed with words like ” and” or ” for”, and a variety of punctuation variations.

So if you’re talking about the Federal Republic of the Holy Roman Emporor against the Ottoman Empire, Brotli has your back.

For really curious readers, I’ve made the dictionary available in 7z-compressed (fun fact – 7z uses LZMA) text file format if you don’t want to extract it from the source directly.

Should You Use It? And Why I Love Nginx

One of the most visited prior entries on here is Ten Reasons You Should Still Use Nginx from two+ years ago. In that I exclaim how I love having nginx sitting in front of solutions because it offers a tremendous amount of flexibility, at very little cost or risk: It is incredibly unlikely that the nginx layer, even if acting as a reverse proxy across a heterogeneous solution that might be built in a mix of technologies (old and new), is a speed, reliability or deployment weakness, and generally it will be the most robust, efficient part of your solution.

The nginx source code is a joy to work with as well, and the Google nginx module — a tiny wrapper around Mozilla’s brotli library, itself a wrapper around the Google brotli project, is a great example of the elegance of extending nginx.

In any case, another great benefit of nginx is that it often gains support for newer technologies very rapidly, in a manner that can be deployed on almost anything with ease (e.g. IIS from Microsoft is a superb web server, but if you aren’t ready to upgrade to Windows Server 2016 across your stack, you aren’t getting HTTP/2. The coupling of web servers with OS versions isn’t reasonable).

Right now this server that you’re hitting is running HTTP/2 for users who support it (which happens to be most), improving speeds while actually reducing server load. This server also supports brotli because…well it’s my play thing so why not. It supports a plethora of fun and occasionally experimental things.

Dynamic brotli compression probably isn’t a win, though. As Cloudflare found, the extra compression time required for brotli nullifies the reduced transfer times in many situations — if the server churns for 30ms that could have been transfer milliseconds, it’s a wash. Not to mention that under significant load it can seriously impair operations.

Where brotli makes a tonne of sense, however, and this holds for deflate/gzip as well, is when static resources are precompressed in advance on the server, often with the most aggressive compression possible. At rest the javascript file might sit in native, gzip, and brotli forms, the server streaming whichever one depending upon the client’s capabilities. Nginx of course supports this for gzip, and the Google brotli module fully supports this static-variation option as well. No additional computations on the server at all, the bytes start being delivered instantly, and if anything it reduces server IO. Just about every browser supports gzip at a minimum, so this static-at-rest-compressed strategy is a no-brainer, the limited downside being the redundant storage of a file in multiple forms, and the administration of ensuring that when you update these files you update the variations as well.

Win win win. Whether brotli or gzip, a better experience for everyone.

Ten Reasons You Should Still Use Nginx

A mostly accurate retelling of a discussion I had with a peer yesterday-

Them – “Any new projects going on?”

Me – “Well aside from {redacted} and {also redacted…good try!}, still that big project I talked about before.”

Them – “Awesome! What technology stack for this one?”

Me – {database} as the database, nginx in front of {platform} for the app layer, the presentation built on {whatever is cool}.”

Them – “Nice…but why nginx? Doesn’t {framework} give you a web server? Aren’t they redundant?”

I’ve had discussions just like this many times before for a variety of projects. Nginx in front of node.js. Nginx in front of a hydra of Windows hosted IIS and Linux hosted PHP, all magically meshed into one coherent front-end.

The assumption often being that nginx is rendered unnecessary if you have another technology that serves up some fresh HTTP.

Of course that isn’t true, and Nginx will remain the doorman for my projects for the foreseeable future, even where it sits in front of other HTTP servers that may themselves sit in front of other HTTP servers.

Theoretically you could sub in Apache and yield many of the same benefits, though Nginx has some architectural advantages that make it my choice.

nginx

The Top 10 Reasons You Should Still Use Nginx

(even in front of other HTTP servers)
  1. SSL – Restrict your SSL key to the greatest extent possible, so if you don’t have a hardware SSL offload appliance, allow the gatekeeper nginx instance to handle it securely and efficiently.
  2. HTTP/2 – It’s unlikely that your in-app HTTP library supports HTTP/2 or will anytime soon, much less supporting changes in the spec. Given that there are some HTTP/2 detractors out there, note that it primarily adds value for high latency users, and is of particular value to small web services that can’t economically deploy GeoDNS targeted servers around the world.
  3. Static Content Serving — Don’t clutter your app layer with code or artifacts for static content — limit that world to dynamic logic. In the long-term you’ll probably move to a CDN service, but projects seldom start there. Make use of precompression to instantly serve up compressed resources to clients that support it (ergo, all clients) with minimal overhead.
  4. Making Dynamic Content Static — nginx can be configured to cache dynamic proxied content per the expiration headers, allowing you to cache efficiently at the gatekeeper without adding error-prone caching code to your application layer. Again, keep your application layer as simple as absolutely possible, leaving the long-solved problems to well proven solutions.
  5. Abstract your implementation — A single exposed host can sit in front of a number of different underlying technology platforms (on one machine, or on many), the published structure being nothing more than some simple configuration points. /service/users may point to a PHP implementation, while /service/feed utilizes a Go host, and /service/api/ calls out to node.js. With Nginx in front, everything becomes flexible and abstract, your implementation amorphous and unconstrained. It also lubricates updates as you can spin up new back-end servers on different ports, update and reload the config, take down the old servers and the upgrade was entirely transparent to end users, all done through a simple deployment script.
  6. Load balancing — Nginx has built in support for load-balancing. Outside of the ability to actually load-balance (at any layer), this also facilitates removing failed backend services from any node in the structure. The flexibility and power is enormous.
  7. Rate Limiting — This is, again, one of those oft ignored functions that often catches out web apps when an avalanche occurs and there are no ways to stop it short of writing a lot of emergency code. The need is often malfunctioning clients rather than malicious actors, and it’s liberating to have a bouncer that can rate limit detrimental callers with ease, changeable at a moment’s notice.
  8. Geo Restrictions — You’ll likely have high-privilege management calls that you know will only be legitimately called from specific geographical regions. While it provides negligible actual security, adding gatekeeper geo-ip restrictions allows you to eliminate the enormous number of brute-force attacks you’ll inevitably see from the Ukraina, China, etc. By filtering out that noise, not only do you eliminate a lot of attack processing overhead, you’re left with logs that make it more likely you can actually detect and extract targeted attacks.
  9. Authentication Restrictions — Covering the same ground we’ve talked about in the prior points, having this gatekeeper allows you to, on a moment’s notice, implement authentication on any resource(s). Whether this is due to a realization of an exploit in a particular part of your code, or during a internal beta period, it’s simply flexibility that may not want to build into your application.
  10. Battle Link Rot — To give a personal experience on this, some time back I maintained a fairly popular blog on a completely different domain, with a different directory structure and technology platform (it was on a custom blog platform built with .NET/Windows, while this is WordPress on Linux). This became less important to me as I focused more on hidden-from-the-world proprietary things, so I let it sit unloved, yet somehow retained thousands of RSS subscribers (most of whom probably forgot I was in their subscription list), and frequent search engine forwards. Later I felt it (the technology) was a bit of baggage, and I wanted to use the domain for something else, so I cast it off and simply moved to something entirely different, simply demolishing what existed in the process.I became a major contributor to link rot. Many existing links throughout the web simply stopped working, countless users navigating to 401s. I did that deconstruction over a year ago, yet still the logs showed an endless procession of users trying to access no longer available content.Something had to be done. Think of the poor users!

    Okay, add that I regained a professional interest in having the internet’s ear, so to speak. In having a venue where I can get enough exposure that good ideas that are executed well might get that initial kick that allows it to succeed. I wanted to take advantage of all of that link love that I had gained. And all that was needed was a couple of basic nginx rewrite rules, transforming once derelict URLs into new, transformed URLs as permanent redirects. Upper-case to lower-case, a new directory structure, hyphens instead of underscores, and so on.

    And the traffic returned almost immediately Google transferred that old link rank to the new domain, and I’m seeing search queries come through again. Because nginx provided the flexibility.

    The same happens with real apps as teams evolve the API structure. Again, and this is a recurring theme, you don’t want to bake this into your application logic: Remove everything extraneous that a system like nginx can provide.

Such is why I have and will continue to layer nginx in front of other solutions. It adds enormous flexibility and deployment opportunity, and it would be a serious mistake to eschew it, or to litter your code with reinventions of the wheel when such a good wheel already exists.

 

Pet Store 2011 – Metrics Instead of Evangelism

A Blueprint-Driven Implementation

Sun’s “Pet Store” eCommerce application, released back in the early-2000s, was intended to be a blueprint of a best-practice conforming J2EE application. Microsoft took that blueprint and created a .NET simile, using it to highlight purported performance advantages of .NET.

Much ink was wasted arguing the merits of the comparison, with various players optimizing and re-engineering the entrants. Eventually a more equal comparison was created by a J2EE consulting company (The Middleware Company, which has since discontinued and folded into other operations), who published an optimized Java implementation, while Microsoft did the same with the .NET variant, the two facing off in the cage. Microsoft’s platform still came up the winner, though many argued that the Middleware Company acted as a stooge for Microsoft.

There was still a lot of complaints — no one is ever satisfied with a benchmark unless their favorite wins — but it was a somewhat fairer fight.

We need to revisit that model with the current available stack of technologies. A Pet Store in Ruby on Rails atop MongoDB, versus a .NET over SQL Server and a PHP over MySQL, and so on, all featuring the same public RESTful interface and APIs, accommodating the data needs however is seen fit.

Empirically measuring actual efficiency and performance rather than simply taking a lot of hot air and evangelism as a surrogate, which sadly is where we are today.

PHP Is Fast Since When?

A CEO recently exposed their ignorance to the world in an essay explaining why they don’t hire .NET programmers [EDIT 2016- I posted a recent essay that seems to support what this CEO said, though nuances differ significantly, and exists in a world where the options and varieties available to all developer have exploded since this original discussion]. To say that it was universally condemned is a fair assessment. Even amongst anti-Microsoft camps there is a lot of appreciation that .NET is actually a pretty good platform, definitely holding its own among the top contenders, and that the author was simply a misled bigot (who, I suspect, thought that such bigotry would play to the crowd. Maybe in 2005 when Microsoft was the boogeyman and everything they touched pervaded evil, but not in 2011 when they’re the underdogs).

I was pleased with the response he got, but some off-the-cuff comments surprised me. Some offhandedly commented on .NET’s supposed poor performance relative to PHP, more than a few calling PHP fast.

PHP is fast? Since when? I’ve been dealing with PHP and .NET code (among others) in the stack for years, and the one adjective that I would never use for PHP is fast. With various accelerators and hacks it can be made workable with a big enough scale-out, but it is not an efficient platform. Choosing PHP as a general rule means slower page generation times, and with growth a larger and larger scale-out need.

It still powers many of the top sites today (and enabled their growth), so clearly it has a lot going for it, but speed is not one of those things, unless compared to the worst possible alternatives.

The problem with PHP is that it suffers from a Schmlemiel the Painter inefficiency, with a “start the world” processing model that, in a reasonable sized application, does an astounding amount of work for even trivial requests as your application base grows (see the SugarCRM). It’s for this reason that a “Hello World” PHP demonstration is so terribly misleading and has little applicability to a real-world use.

PHP offers a platform that is difficult to optimize because of fundamental implementation decisions early on, where any include of an include can, at any point in the execution flow, change the state of the world, making it difficult to devise any strategy that shares the pre-executed state among requests.

PHP is slow.

Of course there is Facebook’s “HipHop” initiative, which is a code generation utility that uses a subset of PHP as an input, the output being C++ (Wasabi!), which is then compiled into native code. To do this Facebook had to follow a number of practices that limited the ability of PHP to sabotage its own performance, making it lifecycle compatible with such a transformation. The end result is not PHP, however, and it does not practically carry over to any other site.

It’s worth noting that several very large sites used C++ authored ISAPI modules for the bulk of their processing, so it can and has been done directly many times before.

We Need Some Model Benchmark

When I engaged in the whole NoSQL debate previously ([1][2] [3]), one of the primary complaints I had about the NoSQL movement was the shocking lack of actual empirical metrics. Lots of broad claims were being made, but remarkably little was actually demonstrated. Just lots of unsupported claims about performance that didn’t pass basic skepticism.

Of course it isn’t all about per request performance or runtime efficiency. Development efficiency is critically important, and a product like MongoDB can be a hugely efficient development target. But isn’t it better to work with the real numbers, making decisions on fact instead of emotion?

It would be ideal to agree to some common interface, functionality and API patterns of some model applications (eCommerce Pet Store 5.0, social news Reddit simile, Twitter simile), and then let the evangelists loose to create test-passing implementations in their platform of choice.

The benchmark platforms can then be evaluated for performance and for efficiency (Reddit runs a monthly cloud server charge of something like $35,000, with a result that they have a terribly unreliable, often unresponsive site. Efficiency is *hugely* important even if it seems cheap at the outset to throw hardware at the problem), and operational resiliency. Most importantly they could be evaluated from a security perspective, which is a grossly ignored aspect that has reared its head time and time again, in some cases destroying organizations because they eschewed basic security practices.

Demonstrate that your advocated solution serves up pages quickly and efficiently, on a resilient, secure platform, and win the argument.