The Tag Overpopulation — It’s Time for a Cull

One of the continuing trends of the Web 2.0revolution is tag-mania — sticking tags on everything andanything, hoping that it somehow improvesthe flow, digestion, and utility of information. Fromadding tag clouds to yourblog, to slashdot, tophotos, to bookmarks, tags have continued tospread across the web landscape.

Burlington Skyway

As with every tech “revolution”, in corporations across theglobe eager employees are embracing the trend, advocating addingtags to documents and directories and files, and embracing theconcept of metadata.

As a bit of an explanation for those who haven’t been followingTechCrunch in morbidcuriousity — wondering what dubious business came out ofsuper-secret stealth alpha invite-only mode today — and thusaren’t up on their Web 2.0 lingo, tags are, in essence, a setof words that one or more users apply to something tocategorize it — what we historically called keywords,albeit sometimes (thought not always) with a “democratic” processdetermining the rendered tag set.

For instance the tags of this post might be “Web 2.0,tags“. Ten visitors might add “tripe“, making it thedominant tag in the tag cloud.

Getting a variety of people adding tags to the same content, orbuilding a common directory of information loosely categorized bytags, is what’s commonly called a folksonomy. Consider,for comparison, a formal taxonomy of a system like Yahoo’s classiccategorization, where a submitter would choose exactly where in thehierarchy a link went, and the Yahoo overlords would validate it,and insert it if appropriate. Instead the loose addition of tagsadapts to have multiple categorizations over time.

[Web 2.0 aware readers will probably shudder seeing anexplanation of something so “basic”, yet discussions in the fieldhave led to me to believe that much of this great revolution hasgone unnoticed by the bulk of society, including even the majorityof technology workers. I regularly converse with people who’venever seen del.icious, don’t know who 37signals are, and haven’tbeen to Reddit or Digg or Flickr or Furl. Much like bloggers havegrossly overestimated the impact of blogs on the generalpopulation, there seems to be a presumption that the Web 2.0 lingoand dogma is more universal than it actually is]

While many of the Web 2.0 aficionados declare there to bea fundamental religious difference between the venerablekeyword and tags, the difference is superficial at best(democratically selected keywords are still just keywords). Thesame keywords that have always existed as a data block in the JPEGfile format, and exists in virtual every document format (Word, forinstance), form the foundation of tags. Metadata has been aroundsince we first started storing data, and tags are a continuation ofthat trend.

Many of the foundations of modern tagging, the evolution of thekeyword, were first demonstrated widely by thesuperlative web photo organizing and sharing applicationFlickr.

Given the primitive state of image recognition, this was aperfect fit: Without tagging your photo with keywords such as”bridge, burlington skyway, qew“, there was no waysearches could find that photo if asked, for instance, for picturesof the Burlington Skyway bridge — We aren’t yet at a stage wheresoftware can reliable figure out what the subjects of a pictureare, and mechanical metadata is still incomplete (althoughit’s getting there), so keywords/tags/folksonomies fills acritical gap if the photography data process.

Outside of photos the use of tags is often much moredubious.

To go back in history a bit, when search engines first appearedthey largely relied upon meta keywords.This was a compromise due to limits in the “comprehension” ofcontent — search engines got confused easily, and even whenthey could parse the content properly they couldn’t truly figureout what the content was about

Keywords came along, offering a simple, condensed, human-createdsubset of the data, categorizing the important attributes ofthe content. Search engines embraced and utilized keywords as animportant element of fulfilling search requests.

The honeymoon didn’t last for long. It turned out thatkeywords were a prime stomping ground for search engine spammers,not to mention that it was a horribly limited methodof searching through data: Not only were the choicesof keywords entirely subjective — often grossly incompleteand inconsistent — but by design it was limited to a very,very small subset of the content. If you really wanted contentabout metal railings, you might have missed my extensive discussionon that topic in my Burlington Skyway Bridge article because Ididn’t feel that metal railings made the cut for the keywords.

Metatags are largely dead now.

Lake Ontario

In its place search engines have become much better atdetermining what a given page is about (or at least simulating areasonable promixity thereof). By analyzing content, having adirectory of similar and derivative words, and by derivinginformation by context (such as links and related pages, and howthey word links) and layout (noting that heading text, title, andearly text holds more importance in classifying the page, though itstill is used in concert with the rest of the content), searchengines have come a long way it understanding content, and incorrelating searches with appropriate results.

The loss of the keyword has proven to be very beneficial forsearch. Now it’s the actual data that classifies the content,rather than artificial metadata.

With improvements in language processors and context associativecorrelations (e.g. where the content parser understands that theparagraph on boxers is talking about the boxer breed ofdog, determined by its correlation with other documents coupledwith other details of the language, using language trees toclassify probable meaning), things will only get better.

Content search has a very bright present, and a brighterfuture.

Yet tags continue tospread in woefully inappropriate domains, even where it’sserving as nothing more than the modern day equivalent of thevenerable META keyword. Instead of building reliable, feature-richsearch tools into product, appropriately determining relationshipsand context to understant content, product vendors are just tossingin a hack-job tag infrastructure and calling their jobcomplete.

Worse still, users are accepting it and calling it afeature.