Success Through Communities

Over the past couple of days I’ve noticed hundreds upon hundredsof hits in my logs coming from www.skyscrapercity.com. Aftersome analysis I determined that a user there rather rudely embeddedan image on this site – a rather large picture of the Scotia Bankoffice tower in Toronto – in a discussion thread. Quite apart fromthe fact that the picture is being used unattributed (if it’s goodenough to use, then it’s good enough to attribute), it’s basicallysilently stealing my bandwidth quota. Very rude.

When people have done this in the past I’ve surprized them withdelightful and entertaining image alterations, but in this case I’mjust going to ignore it and let the thread die down. After lookingat the source of the traffic, however, I’ve been reminded of themost common, and most successful, pure-.com internet play – put upa site about some sort of fly-by information (for instanceskyscraper diagrams), and then add discussion links. Soon enoughyou’ll have a robust community ofusers who share that interest, spending hours a day debatingwhether Chicago isa better looking city than Dubai. It seems like a prettytenuous foundation for a community, but there it is.

Unintuitive number sequences

char *string_value = new char[32];

In computer science we’re quite accustomed to using powers of 2whenever a numerical limit is required. e.g. The string can be 32characters long, the filename can be 64 characters, while thenumber of entries in the listbox can number 1024.

These uses seldom require powers of 2 (e.g. while it makes sensefor an ASCII string to be multiples of 4 bytes if it’s long alignedand you care about that, it could just as efficiently be 28 or 36characters long), but nonetheless it’s ingrained into mostdevelopers’ minds.

I chuckled seeing the commercial for some overpricedtimed-interval airfreshener. It allows you to select 9, 18, or 36 minuteintervals between sprays. While not exactly compliant (I’ll betthat it was originally 8, 16 or 32 minutes, but they added some lagto the minute counter to avoid it seeming computeresque), and inthis case I can understand why the microcontroller developer chosepowers, the spirit of the power of 2 lives on.

Jamie’s School Dinners

Finally got around to watching this show, while it was doing a marathon showing on the FoodTVchannel here in Canada. Quite apart from the food (which isactually almost an overlooked element of the show – it is not acooking show), this show is an excellentlesson in management. The lessons learned in dealing with peers,”employees” (the dinner ladies), and the kids is absolutelybrilliant human nature stuff that everyone should watch and absorb.Very highly recommended.

 

String Pooling in SQL Server

Several times over the past couple of years, in my role as adatabase consultant, I’ve come across very, very large databases,where a large percentage of the data is redundant. For instanceconsider the following two abbreviated tables

Forms
FormsID (PK) nvarchar(255)

Hits
HitsID (PK) int identity(1,1)
FormsID (FK) nvarchar(255)
Time datetime

Imagine that there are only a dozen forms values, each of themaveraging about 30 characters in length (so 60 bytes or so, givingthat it’s unicode). If you have a million records in hits, that’s60MB just for the form value itself. If you have one hundredmillion records, and a dozen large FKs like this, well you get thepicture. It vastly increases the amount of I/O to do searches inthe Hits table, and even if Forms is indexed it’s still much slowerthan it could be if Forms had an integer primary key.

While I personally wouldn’t layout tables this way, it is anentirely credible and justifiable design – the designer simplydecided to use a natural key rather than a surrogate key.Simplicity of design, and clarity of relationships when looking atthe data, outweighed I/O concerns for this person/group. Such adesign is not a question of normalization.

When you have a million+ records it suddenly becomes a concern,though. There are ways to refactor this design, including”normalizing” the original table a bit and hiding it behind a view,and then adding INSTEAD OF triggers on the view, however that is aleaky abstraction. SQL Server does not completely mimic a realtable, and operations like INSERT FROM fail, not to mentionoddities with @@IDENTITY and SCOPE_IDENTITY().

Given all of this, I would love if SQL Server had a behind thescenes method of collapsing redundant large field values into ahidden behind the scenes lookup table, similar to what VisualStudio does with string pooling. e.g. In this case it could replace FormID witha internal value to lookup against a tiny relational table.Obviously this should be manually configured, but it would be arelatively easy change that could tremendously improve a lot ofexisting database designs where a redesign isn’t a priority, butI/O costs are onerous.

Apparently mySQL has something similar by way of enums, howeverit is a fixed set (what I’d like is that new values inserted intothe table are automatically added into the behind the scenes set),and again there is some leakiness with the abstraction.

Improved Indexing in SQL Server

[EDIT: 2005-09-08 – There seems to be a bit of confusionregarding this post, so I should clarify – These are neither uniqueor unsolvable problems. One could, for instance, achieve both ofthese tasks via hackery, as I mentioned below. I could of courseuse triggers and procedural logic and duplicitous columns to storereversed sets and decomposed strings. However that is thewrong solution, and pollutes my database withhacks to get around fundamental limits inFull-Text indexing]

Recently I’ve come across the need to do partial string searcheswithin very large sets of data (e.g. 10 large columns of both ntextand nvarchar types, in a table contains hundreds of thousands,or even millions, of records).

For example I’d like to look for 0505 within a table,returning rows containing that value in any of thesearched columns. For instance a row where one of the valuescontains REC995850505293. I could do thisthe bulk force way by doing a LIKE ‘%0505%’ against all ofthe columns, however that’s terribly inefficient, and will bringthe largest of servers to their knees with the volumes of data thatI’m talking about.

Of course the immediate solution one might imagine would be SQLServer’s Full-Text Indexing, or even a third-party tool likeApache’s Lucene full-text search. The problem is that both of thesesearch engines can only match from the beginning of a word onwards(or, with a thesaurus, word variants). For instance they can searchform REC9958*, returning REC995850505293,but they cannot search for *50505293. Because the index isordered based upon the beginning of the word, it can only matchnon-beginning partial words through a full-scan, which is of nohelp at all.

From a technology perspective this is understandable, howeverthere are a couple of pretty simple improvements to full-textindexing that would greatly improve their usability (albeit at thecost of storage and additional search maintenance processing, butthat should be a choice that a user can decide).

  • Allow for the configuration of columns to be stored bothforwards and backwards. For instance on the back-end the columnREC995850505293 would be stored as REC995850505293,but also as 392505058599CER. The value, of course, is that nowif I search for “*05293“, it can reverse thesearch phrase to 39250* and compare it against the reversed set,quickly finding a match. Double the storage space, but a vastlymore usable index. As it is you can hack this sort of thingyourself, storing mirrors of all of your fields and then runningyour queries through a processor to know what to invert, but it’s alot of hackery for something that should be a back-endsolution.
  • Allow for the configuration such that all sub-parts of a wordare also indexed. e.g. REC995850505293 is also stored asEC995850505293, and C995850505293, and995850505293, and so on. Clearly the consumption in spacewould be tremendous, but there are scenarios where this would beextremely valuable. Now a search for *0505* and it can find thepartial word matches, linking back to the primary word fromthere.

If anyone has any ideas of places where I might look forsolutions to this sort of problem, please drop me a line.