This entry is a follow-up to Interesting Facts About Domain Names.
It’s time for some bubble chart fun in the data analysis of the .COM domain space (analyzing the 47.7 million+ .COM domains, distilled from the 150 million+ row zone file). While my original
entry was purely to satisfy personal curiousity, and as a test-bed of a publicly obtainable mid-sized dataset, the surprising interest has me revisiting this topic (while finalizing another,
much more interesting comparison against logical sequences and dictionary values). This outing is far more obscure than the first entry, and the charts are nowhere near as instantly informative,
but I found the results fascinating nonetheless. The next entry on this subject will be much more immediately consumable.
This chart needs a bit of an explanation — usually a bad sign as charts should normally be self-explanatory, but in this case it’s graphing something a bit more complex — so some
clarifications are in order.
Length is of course the length of the domain. While 0 is plotted on the axis, only domains 3 or more characters long are charted. For instance yafla.com is a 5-character domain, as I’m excluding the TLD (top-level domain, which in this case is .com) portion.
Diversity is a measure of how repetitious a domain name is, with the vertical scale going from those domains comprised of a single repeating character (e.g. aaaaaaaaaaaa.com) at the bottom, to domains where every character is unique (abcdefghijkl.com) at the top (in this case the diversity calculation was implemented as a C# .NET scalar function, used directly from the SQL set operations). The bubble sizes vary based upon the number of samples that match a particular diversity and domain length, and of course bubbles that are too small are not displayed.
For instance reddit.com has a calculated diversity of 80%, while the shorter yafla.com has a calculated diversity of 75%.
The bubbles have been normalized such that the bubbles are sized relative to the total count at that length, so less popular lengths are intentionally disproportionately large (otherwise they would be drowned out). At the smaller lengths the logical, “legitimate” domains vastly outnumber the repetitious or random domains, whereas at the longer lengths a larger percentage of the domains are
repeating characters or random sequences, and this is evident on the chart.
As the length of the domain increases, the probability of character collisions mathematically increases, explaining why the diversity declines at a fairly predictable rate. The highly diverse
domains at longer lengths are usually seemingly nonsensical domains (such as 9876543210ZYXWVUTSRQPONMLKJIHGFEDCBA.com), as are the low diversity domains (e.g.
401K-401K-401K-401K-401K-401K-401K-401K-401K-401K-401K-401K.com, A————————————————————-A.com, FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE-FREE.com).
Using a bubble chart again, this chart details the clusters of domains starting with various numeric characters at differing lengths. The more domains of a given starting character and length,
the larger the bubble.
What intrigued me about this chart was the fact that some numbers have odd distribution patterns. For instance 8 as a domain starting character sees a generally declining prevalence as the
length increases, with 3,352 domains starting with the character 8 having a length of 9 characters, but then suddenly there are 8,940 domains starting with 8 at a length of 10. Looking at the actual matching data made it instantly clear — 1-800 numbers. Dropping the 1 and dashes, 1-800 numbers are 10 characters in length (e.g. 8004INJURY.com).
Similarly, 1 holds steady on a gradual decline as the length increases, but then suddenly at 11 characters it spikes (from 18,328 instances at 10 characters long, to 24,993 instances at 11 characters). This is for the same reason that 8 spiked, but in this case with the 1 prefix.
6 spiking at 8 characters long is an oddity, but I discovered that Netflix registered a huge array of largely sequential 8 character values starting with 6 (e.g. 60142240.com,
60155520.com, etc), letting them sit as parked pages. Not sure what the speculation is on these (SKUs perhaps?). Maybe they’re going to give every customer their own domain by customer
On the same theme, this bubble chart shows the population distribution of domains starting with alpha-characters (from A to Z) at various lengths (A to Z from left to right. The charting tool
completely disallowed characters on the X-axis, and I haven’t had time to image them in). This is pretty much as expected. S is the fattest teardrop, 8 in from the right.
International Domain Names have been filtered out of both results.
The Letter S
Speaking of S, many speculated that the reason S was the dominant starting character for domains was due to sex related domains. While that’s a reasonable guess, domains starting with sex actually only comprise 80,277 of the 4,330,172 such domains (of course there are more that mask it in variations like S-E-X, but they’re relatively few in comparison). Instead S is just a popular starting character, particularly among domains starting with STA, SAN, SOU, SHO, STE, SHA, SUP, STO, and STR (which together comprise 1 million of the domains).
Prevalent Starting 3-Letter Sequences
Of course that chart naturally begs the question of which 3-letter sequence is most prevalent.
Just some cute charts while I find time to complete the more interesting, human-interest domain name analysis.