Unintuitive number sequences

char *string_value = new char[32];

In computer science we’re quite accustomed to using powers of 2whenever a numerical limit is required. e.g. The string can be 32characters long, the filename can be 64 characters, while thenumber of entries in the listbox can number 1024.

These uses seldom require powers of 2 (e.g. while it makes sensefor an ASCII string to be multiples of 4 bytes if it’s long alignedand you care about that, it could just as efficiently be 28 or 36characters long), but nonetheless it’s ingrained into mostdevelopers’ minds.

I chuckled seeing the commercial for some overpricedtimed-interval airfreshener. It allows you to select 9, 18, or 36 minuteintervals between sprays. While not exactly compliant (I’ll betthat it was originally 8, 16 or 32 minutes, but they added some lagto the minute counter to avoid it seeming computeresque), and inthis case I can understand why the microcontroller developer chosepowers, the spirit of the power of 2 lives on.

Jamie’s School Dinners

Finally got around to watching this show, while it was doing a marathon showing on the FoodTVchannel here in Canada. Quite apart from the food (which isactually almost an overlooked element of the show – it is not acooking show), this show is an excellentlesson in management. The lessons learned in dealing with peers,”employees” (the dinner ladies), and the kids is absolutelybrilliant human nature stuff that everyone should watch and absorb.Very highly recommended.

 

String Pooling in SQL Server

Several times over the past couple of years, in my role as adatabase consultant, I’ve come across very, very large databases,where a large percentage of the data is redundant. For instanceconsider the following two abbreviated tables

Forms
FormsID (PK) nvarchar(255)

Hits
HitsID (PK) int identity(1,1)
FormsID (FK) nvarchar(255)
Time datetime

Imagine that there are only a dozen forms values, each of themaveraging about 30 characters in length (so 60 bytes or so, givingthat it’s unicode). If you have a million records in hits, that’s60MB just for the form value itself. If you have one hundredmillion records, and a dozen large FKs like this, well you get thepicture. It vastly increases the amount of I/O to do searches inthe Hits table, and even if Forms is indexed it’s still much slowerthan it could be if Forms had an integer primary key.

While I personally wouldn’t layout tables this way, it is anentirely credible and justifiable design – the designer simplydecided to use a natural key rather than a surrogate key.Simplicity of design, and clarity of relationships when looking atthe data, outweighed I/O concerns for this person/group. Such adesign is not a question of normalization.

When you have a million+ records it suddenly becomes a concern,though. There are ways to refactor this design, including”normalizing” the original table a bit and hiding it behind a view,and then adding INSTEAD OF triggers on the view, however that is aleaky abstraction. SQL Server does not completely mimic a realtable, and operations like INSERT FROM fail, not to mentionoddities with @@IDENTITY and SCOPE_IDENTITY().

Given all of this, I would love if SQL Server had a behind thescenes method of collapsing redundant large field values into ahidden behind the scenes lookup table, similar to what VisualStudio does with string pooling. e.g. In this case it could replace FormID witha internal value to lookup against a tiny relational table.Obviously this should be manually configured, but it would be arelatively easy change that could tremendously improve a lot ofexisting database designs where a redesign isn’t a priority, butI/O costs are onerous.

Apparently mySQL has something similar by way of enums, howeverit is a fixed set (what I’d like is that new values inserted intothe table are automatically added into the behind the scenes set),and again there is some leakiness with the abstraction.

Improved Indexing in SQL Server

[EDIT: 2005-09-08 – There seems to be a bit of confusionregarding this post, so I should clarify – These are neither uniqueor unsolvable problems. One could, for instance, achieve both ofthese tasks via hackery, as I mentioned below. I could of courseuse triggers and procedural logic and duplicitous columns to storereversed sets and decomposed strings. However that is thewrong solution, and pollutes my database withhacks to get around fundamental limits inFull-Text indexing]

Recently I’ve come across the need to do partial string searcheswithin very large sets of data (e.g. 10 large columns of both ntextand nvarchar types, in a table contains hundreds of thousands,or even millions, of records).

For example I’d like to look for 0505 within a table,returning rows containing that value in any of thesearched columns. For instance a row where one of the valuescontains REC995850505293. I could do thisthe bulk force way by doing a LIKE ‘%0505%’ against all ofthe columns, however that’s terribly inefficient, and will bringthe largest of servers to their knees with the volumes of data thatI’m talking about.

Of course the immediate solution one might imagine would be SQLServer’s Full-Text Indexing, or even a third-party tool likeApache’s Lucene full-text search. The problem is that both of thesesearch engines can only match from the beginning of a word onwards(or, with a thesaurus, word variants). For instance they can searchform REC9958*, returning REC995850505293,but they cannot search for *50505293. Because the index isordered based upon the beginning of the word, it can only matchnon-beginning partial words through a full-scan, which is of nohelp at all.

From a technology perspective this is understandable, howeverthere are a couple of pretty simple improvements to full-textindexing that would greatly improve their usability (albeit at thecost of storage and additional search maintenance processing, butthat should be a choice that a user can decide).

  • Allow for the configuration of columns to be stored bothforwards and backwards. For instance on the back-end the columnREC995850505293 would be stored as REC995850505293,but also as 392505058599CER. The value, of course, is that nowif I search for “*05293“, it can reverse thesearch phrase to 39250* and compare it against the reversed set,quickly finding a match. Double the storage space, but a vastlymore usable index. As it is you can hack this sort of thingyourself, storing mirrors of all of your fields and then runningyour queries through a processor to know what to invert, but it’s alot of hackery for something that should be a back-endsolution.
  • Allow for the configuration such that all sub-parts of a wordare also indexed. e.g. REC995850505293 is also stored asEC995850505293, and C995850505293, and995850505293, and so on. Clearly the consumption in spacewould be tremendous, but there are scenarios where this would beextremely valuable. Now a search for *0505* and it can find thepartial word matches, linking back to the primary word fromthere.

If anyone has any ideas of places where I might look forsolutions to this sort of problem, please drop me a line.

The Inflationary Pressures of Links

Over the years people have asked me why I maintain yafla.com(after a Slashdotting a few years ago, a reader wrote to askif yafla.com stood for “Yet Another F’n Lame Ass.com?”. I gotquite a kick out of that, and I considered replacingYet Another FiveLetter Acronym in my mind withthis more cynical variant). While it is a legitimate company that Ido work under, basically I’m by design a one man crew and haveno lack of work, so I don’t actively solicit for business.Nonetheless I’ve always wanted to maintain a credible internetpresence just in case I think up something that would be .COMbrilliant.

To serve this desire, one of my goals with yafla.com was tomaintain it at a middling ranking, publishing enough interestinginformation that people would link to it and visit, and when I dopost something interesting about a non-mainstream topic, it atleast has a chance in heck of appearing somewhere near the front ofthe links returned by search engines (given that people whoactually care are most likely to get here via a searchengine. People coming from blog-of-the-day or discussion links aremuch more likely to be fly-bys who pad the hit-count but don’tactually value from the content. I get no pleasure from emptyhits).

Of course there’s also the personal credibility angle: Alongwith published print articles, I also post informational tools orpapers on yafla.com to maintain some karma in the industry, andalso as a goal – a destination – that drives me to investigatetopics that otherwise I might not so thoroughly consider. I’veplaced Google Adsense ads on a couple of papers as a test, but theyyield a pittance: I could “make” far more than the Google ads yieldby getting a regular coffee instead of a large in the morning.

One thing I have noticed, however, is that the number of hitscoming from search engines like Google has been rapidlydeclining over the past couple of years, basically charting as aninverse of the number of blogs filling the medium. It seems that asmore and more blogs are coming online, all of thempromiscuously cross-linking and trackbacking, the value ofgetting a couple hundred links for a neat domain tool, or a dozenlinks from highly specialized sites concerning a specific topic,has declined to the point of being irrelevant.

With the new inflationary pressures, it seems that nothing lessthan thousands of blog swarming links will really get you searchengine credibility. This is doubled by the fact that most (orall) of the major search engines are terriblydumb, in that a million generalist blog linkings to a guyfor his xbox game tips will yield him top results for SQLqueries the day he posts his first Hello World SELECTstatement. To my knowledge there is no search engine that separateslinks out into areas of expertise (Google pseudo-does this byanalyzing the context of each link, but it is terribly deficient),eliminating this useless global ranking for all searches. Manyblogs are earning credibility by association (due to demonstratableweaknesses in algorithms like Pagerank), such as the huge rise ofblogs.msdn.com entries for virtually every search term (even wherethe individual blog itself has few or no direct links from theoutside world, but has credibility by being linked within the wholeof blogs.msdn.com. This is weakness I wrote about several years agoon this very site, though in that iteration it was GeoCitiesaccounts that were disproportionately being ranked).

The question I am pondering, then, is whether the only way onecan remain internet credible (in search engine terms) is tointegrate heavily within the blogging community, quid-pro-quoingendless links and trackbacks, ingratiating oneself with otherbloggers, posting meaningless comments about every posting everyother blogger makes (which they will of course do in turn). It’s asort of super-pyramid scheme, but with no bottom level.