String Pooling in SQL Server

Several times over the past couple of years, in my role as adatabase consultant, I’ve come across very, very large databases,where a large percentage of the data is redundant. For instanceconsider the following two abbreviated tables

FormsID (PK) nvarchar(255)

HitsID (PK) int identity(1,1)
FormsID (FK) nvarchar(255)
Time datetime

Imagine that there are only a dozen forms values, each of themaveraging about 30 characters in length (so 60 bytes or so, givingthat it’s unicode). If you have a million records in hits, that’s60MB just for the form value itself. If you have one hundredmillion records, and a dozen large FKs like this, well you get thepicture. It vastly increases the amount of I/O to do searches inthe Hits table, and even if Forms is indexed it’s still much slowerthan it could be if Forms had an integer primary key.

While I personally wouldn’t layout tables this way, it is anentirely credible and justifiable design – the designer simplydecided to use a natural key rather than a surrogate key.Simplicity of design, and clarity of relationships when looking atthe data, outweighed I/O concerns for this person/group. Such adesign is not a question of normalization.

When you have a million+ records it suddenly becomes a concern,though. There are ways to refactor this design, including”normalizing” the original table a bit and hiding it behind a view,and then adding INSTEAD OF triggers on the view, however that is aleaky abstraction. SQL Server does not completely mimic a realtable, and operations like INSERT FROM fail, not to mentionoddities with @@IDENTITY and SCOPE_IDENTITY().

Given all of this, I would love if SQL Server had a behind thescenes method of collapsing redundant large field values into ahidden behind the scenes lookup table, similar to what VisualStudio does with string pooling. e.g. In this case it could replace FormID witha internal value to lookup against a tiny relational table.Obviously this should be manually configured, but it would be arelatively easy change that could tremendously improve a lot ofexisting database designs where a redesign isn’t a priority, butI/O costs are onerous.

Apparently mySQL has something similar by way of enums, howeverit is a fixed set (what I’d like is that new values inserted intothe table are automatically added into the behind the scenes set),and again there is some leakiness with the abstraction.

Improved Indexing in SQL Server

[EDIT: 2005-09-08 – There seems to be a bit of confusionregarding this post, so I should clarify – These are neither uniqueor unsolvable problems. One could, for instance, achieve both ofthese tasks via hackery, as I mentioned below. I could of courseuse triggers and procedural logic and duplicitous columns to storereversed sets and decomposed strings. However that is thewrong solution, and pollutes my database withhacks to get around fundamental limits inFull-Text indexing]

Recently I’ve come across the need to do partial string searcheswithin very large sets of data (e.g. 10 large columns of both ntextand nvarchar types, in a table contains hundreds of thousands,or even millions, of records).

For example I’d like to look for 0505 within a table,returning rows containing that value in any of thesearched columns. For instance a row where one of the valuescontains REC995850505293. I could do thisthe bulk force way by doing a LIKE ‘%0505%’ against all ofthe columns, however that’s terribly inefficient, and will bringthe largest of servers to their knees with the volumes of data thatI’m talking about.

Of course the immediate solution one might imagine would be SQLServer’s Full-Text Indexing, or even a third-party tool likeApache’s Lucene full-text search. The problem is that both of thesesearch engines can only match from the beginning of a word onwards(or, with a thesaurus, word variants). For instance they can searchform REC9958*, returning REC995850505293,but they cannot search for *50505293. Because the index isordered based upon the beginning of the word, it can only matchnon-beginning partial words through a full-scan, which is of nohelp at all.

From a technology perspective this is understandable, howeverthere are a couple of pretty simple improvements to full-textindexing that would greatly improve their usability (albeit at thecost of storage and additional search maintenance processing, butthat should be a choice that a user can decide).

  • Allow for the configuration of columns to be stored bothforwards and backwards. For instance on the back-end the columnREC995850505293 would be stored as REC995850505293,but also as 392505058599CER. The value, of course, is that nowif I search for “*05293“, it can reverse thesearch phrase to 39250* and compare it against the reversed set,quickly finding a match. Double the storage space, but a vastlymore usable index. As it is you can hack this sort of thingyourself, storing mirrors of all of your fields and then runningyour queries through a processor to know what to invert, but it’s alot of hackery for something that should be a back-endsolution.
  • Allow for the configuration such that all sub-parts of a wordare also indexed. e.g. REC995850505293 is also stored asEC995850505293, and C995850505293, and995850505293, and so on. Clearly the consumption in spacewould be tremendous, but there are scenarios where this would beextremely valuable. Now a search for *0505* and it can find thepartial word matches, linking back to the primary word fromthere.

If anyone has any ideas of places where I might look forsolutions to this sort of problem, please drop me a line.

The Inflationary Pressures of Links

Over the years people have asked me why I maintain a Slashdotting a few years ago, a reader wrote to askif stood for “Yet Another F’n Lame”. I gotquite a kick out of that, and I considered replacingYet Another FiveLetter Acronym in my mind withthis more cynical variant). While it is a legitimate company that Ido work under, basically I’m by design a one man crew and haveno lack of work, so I don’t actively solicit for business.Nonetheless I’ve always wanted to maintain a credible internetpresence just in case I think up something that would be .COMbrilliant.

To serve this desire, one of my goals with was tomaintain it at a middling ranking, publishing enough interestinginformation that people would link to it and visit, and when I dopost something interesting about a non-mainstream topic, it atleast has a chance in heck of appearing somewhere near the front ofthe links returned by search engines (given that people whoactually care are most likely to get here via a searchengine. People coming from blog-of-the-day or discussion links aremuch more likely to be fly-bys who pad the hit-count but don’tactually value from the content. I get no pleasure from emptyhits).

Of course there’s also the personal credibility angle: Alongwith published print articles, I also post informational tools orpapers on to maintain some karma in the industry, andalso as a goal – a destination – that drives me to investigatetopics that otherwise I might not so thoroughly consider. I’veplaced Google Adsense ads on a couple of papers as a test, but theyyield a pittance: I could “make” far more than the Google ads yieldby getting a regular coffee instead of a large in the morning.

One thing I have noticed, however, is that the number of hitscoming from search engines like Google has been rapidlydeclining over the past couple of years, basically charting as aninverse of the number of blogs filling the medium. It seems that asmore and more blogs are coming online, all of thempromiscuously cross-linking and trackbacking, the value ofgetting a couple hundred links for a neat domain tool, or a dozenlinks from highly specialized sites concerning a specific topic,has declined to the point of being irrelevant.

With the new inflationary pressures, it seems that nothing lessthan thousands of blog swarming links will really get you searchengine credibility. This is doubled by the fact that most (orall) of the major search engines are terriblydumb, in that a million generalist blog linkings to a guyfor his xbox game tips will yield him top results for SQLqueries the day he posts his first Hello World SELECTstatement. To my knowledge there is no search engine that separateslinks out into areas of expertise (Google pseudo-does this byanalyzing the context of each link, but it is terribly deficient),eliminating this useless global ranking for all searches. Manyblogs are earning credibility by association (due to demonstratableweaknesses in algorithms like Pagerank), such as the huge rise entries for virtually every search term (even wherethe individual blog itself has few or no direct links from theoutside world, but has credibility by being linked within the wholeof This is weakness I wrote about several years agoon this very site, though in that iteration it was GeoCitiesaccounts that were disproportionately being ranked).

The question I am pondering, then, is whether the only way onecan remain internet credible (in search engine terms) is tointegrate heavily within the blogging community, quid-pro-quoingendless links and trackbacks, ingratiating oneself with otherbloggers, posting meaningless comments about every posting everyother blogger makes (which they will of course do in turn). It’s asort of super-pyramid scheme, but with no bottom level.


Several years of critical power shortages here in Ontario, alongwith a sense that excessive resource consumption is morally wrong,have led me to power down my PCs when they’re not in use. While I’dprefer a partial sleep solution, even standby mode consumes aconsiderable amount of power (measured not with a watt meter, butrather just feeling the heat of the air coming out of the stillrunning power supply). While Windows XP and Windows 2003 havevastly improved start-up times, once you couple ina large number of services such as SQL Server and various desktopsearch utilities, along with tools like Visual Studio, getting backto where you were before the shutdown can be very timeconsuming.

As such, over the past years I’ve been relying upon theexcellent feature called Hibernation. Enabled in the Power Options(as shown below), this gives you a new “Shut Down” option(available by configuring a key in the advanced section of powermanagement, or when holding shift using the XP theme shutdown menu)that basically freezes the state of your PC and then spools theentire memory contents out to a file. On restart it spools thestate back exactly where it was, resets the state on the CPU, andthen you’re off and running again. Getting back to exactly where Iwas takes just a few seconds.

Back in about 1986/87 a revolutionary product, I believe calledSnapBack, for the Atari ST came out that did exactly this, spoolingout the state, compressed, to a disk file. Of course, in that caseit was generally spooling out 512KB or 1MB, rather than 1GB+, butthe idea was the same. At that time people often used it to add”Save Game” functionality to games that intentionally orunintentionally didn’t offer the same. Other people used it forpiracy, spooling out a running game (after the copy protectionchecks had occurred), and then giving the file to others.

Just had to mention this as it’s remarkable how many peopledon’t know about, and thus don’t use, this excellent feature. Itisn’t perfect, however, and several times it has failed to recoverto where it was, so you probably shouldn’t hibernate with thatdocument you’ve worked on for the past three weeks sittingunsaved.

Opportunities from Hurricane Katrina

The title of this entry will likely get some people up in arms.I assure those people that I do not mean to draw attention from thetragedy, or to diminish it in any way. Nor do I think this blogentry will stop a single rescuer from going about theirbusiness.

However, what has happened has happened, and many of us havepledged our monetary donations and are really left twiddling ourthumbs at what else we can do.

As software developers and technology experts, I think there isplenty we can do. For instance, there were obviously technologicalgaps in information management (knowing who and what was wherewhen, and sharing that information with everyone. The lack of thissort of knowledge led to some of the chaos that horribly delayedthe response). After the disaster technology was necessary forcommunications, with many of the emergency personelle and victimshaving no means of communicating. There were gaps in batterystorage, with basic infrastructure dying quickly. There were gapsafter the pieces began to be cleaned up, coordinatingcommunications amongst the victims in various municipalities.

Technology can’t stop a category 5 hurricane (yet), but it canhelp ameliorate the damage and to help society get back on track asquickly as possible.

Given this, invariably this tragedy will be followed by billionsof taxpayer dollars going into various strategies to preventoccurrences like this from happening again, or to at least have abetter grasp on responding to it. Many of those dollars will begoing towards IT projects. Something to keep your mind open to ifyou have ideas for solutions that would avoid this sort ofnightmare scenario from happening again.

Something to think about.