String Pooling in SQL Server

Several times over the past couple of years, in my role as adatabase consultant, I’ve come across very, very large databases,where a large percentage of the data is redundant. For instanceconsider the following two abbreviated tables

Forms
FormsID (PK) nvarchar(255)

Hits
HitsID (PK) int identity(1,1)
FormsID (FK) nvarchar(255)
Time datetime

Imagine that there are only a dozen forms values, each of themaveraging about 30 characters in length (so 60 bytes or so, givingthat it’s unicode). If you have a million records in hits, that’s60MB just for the form value itself. If you have one hundredmillion records, and a dozen large FKs like this, well you get thepicture. It vastly increases the amount of I/O to do searches inthe Hits table, and even if Forms is indexed it’s still much slowerthan it could be if Forms had an integer primary key.

While I personally wouldn’t layout tables this way, it is anentirely credible and justifiable design – the designer simplydecided to use a natural key rather than a surrogate key.Simplicity of design, and clarity of relationships when looking atthe data, outweighed I/O concerns for this person/group. Such adesign is not a question of normalization.

When you have a million+ records it suddenly becomes a concern,though. There are ways to refactor this design, including”normalizing” the original table a bit and hiding it behind a view,and then adding INSTEAD OF triggers on the view, however that is aleaky abstraction. SQL Server does not completely mimic a realtable, and operations like INSERT FROM fail, not to mentionoddities with @@IDENTITY and SCOPE_IDENTITY().

Given all of this, I would love if SQL Server had a behind thescenes method of collapsing redundant large field values into ahidden behind the scenes lookup table, similar to what VisualStudio does with string pooling. e.g. In this case it could replace FormID witha internal value to lookup against a tiny relational table.Obviously this should be manually configured, but it would be arelatively easy change that could tremendously improve a lot ofexisting database designs where a redesign isn’t a priority, butI/O costs are onerous.

Apparently mySQL has something similar by way of enums, howeverit is a fixed set (what I’d like is that new values inserted intothe table are automatically added into the behind the scenes set),and again there is some leakiness with the abstraction.