.NET, Garbage Collection, and Reference Counting

Garbagecollection in .NET has always rubbed me the wrong way. As aquick recap, garbage collection in .NET (as in Java) works bybasically halting the application and scanning all references fromthe root reference on. It then looks on its heap to determine thatevery object has someone else pointing to it, and if not the objectis freed (through a long process). The heap memory is thencompacted and any references are rebased. The program then restartsuntil garbage collection happens again at some point in the future.This means, for instance, that if you create a System.IO.Fileobject in a short method that opens a file in exclusivemode, and you fail to use the Dispose pattern or explicitlycall Dispose (Dispose being a sort of “garbage collection has somegaps, so here’s something that you can do to expedite at least partof the process”), the file will be locked until some unpredictablepoint in the future that garbage collection runs. You can, ofcourse, force garbage collection, but that throws off the entirelifecycle management and can cause other resource managementissues.

Ultimately it seems like the sort of solution that worksfor relatively small or isolated applications (where itworks admirably – for web apps and web services, isolatedservices, and relatively small Windows Forms apps .NET is afantastic technology, primarily insofar as it reducesdevelopment time), but not as a technology that scales up tolarge scale, highly responsive systems (where you want the loading,resource usage and response times to be predictable andconsistent). This case seemed to be somewhat proven by many of thedelays and issues with Longhorn (Windows Vista), and the backtrackingand reduced reliance on .NET as a system pervasive technology (TheRegister isn’t the most credible source, but it’s just a referenceto the sort of thing I’ve heard throughout the industry). Entirelypredictable.

One change that I would like to see added to .NET – optionalreference counted references, with a second heap specifically forclassic, fragmeted allocation. Reference counting, the oft malignedtechnology behind COM (mostly because people didn’t know how to use it properly),is a completely reliable and extremely predictable and usefultechnology in a completely managed environment for most uses (thereare exceptions where reference counting breaks, but you don’t haveto throw out the baby to clean the bathwater). Python, forinstance, works largely based upon reference counting.

Some might note that Visual C++ 2005 has added“stack” reference types. This really is a bit of syntactical sugar- basically it’s just a variant of the Dispose pattern that, whencompiled, adds an automatic call to dispose when the object leavesthe scope. Not the same thing at all.

 

Success Through Communities

Over the past couple of days I’ve noticed hundreds upon hundredsof hits in my logs coming from www.skyscrapercity.com. Aftersome analysis I determined that a user there rather rudely embeddedan image on this site – a rather large picture of the Scotia Bankoffice tower in Toronto – in a discussion thread. Quite apart fromthe fact that the picture is being used unattributed (if it’s goodenough to use, then it’s good enough to attribute), it’s basicallysilently stealing my bandwidth quota. Very rude.

When people have done this in the past I’ve surprized them withdelightful and entertaining image alterations, but in this case I’mjust going to ignore it and let the thread die down. After lookingat the source of the traffic, however, I’ve been reminded of themost common, and most successful, pure-.com internet play – put upa site about some sort of fly-by information (for instanceskyscraper diagrams), and then add discussion links. Soon enoughyou’ll have a robust community ofusers who share that interest, spending hours a day debatingwhether Chicago isa better looking city than Dubai. It seems like a prettytenuous foundation for a community, but there it is.

Unintuitive number sequences

char *string_value = new char[32];

In computer science we’re quite accustomed to using powers of 2whenever a numerical limit is required. e.g. The string can be 32characters long, the filename can be 64 characters, while thenumber of entries in the listbox can number 1024.

These uses seldom require powers of 2 (e.g. while it makes sensefor an ASCII string to be multiples of 4 bytes if it’s long alignedand you care about that, it could just as efficiently be 28 or 36characters long), but nonetheless it’s ingrained into mostdevelopers’ minds.

I chuckled seeing the commercial for some overpricedtimed-interval airfreshener. It allows you to select 9, 18, or 36 minuteintervals between sprays. While not exactly compliant (I’ll betthat it was originally 8, 16 or 32 minutes, but they added some lagto the minute counter to avoid it seeming computeresque), and inthis case I can understand why the microcontroller developer chosepowers, the spirit of the power of 2 lives on.

Jamie’s School Dinners

Finally got around to watching this show, while it was doing a marathon showing on the FoodTVchannel here in Canada. Quite apart from the food (which isactually almost an overlooked element of the show – it is not acooking show), this show is an excellentlesson in management. The lessons learned in dealing with peers,”employees” (the dinner ladies), and the kids is absolutelybrilliant human nature stuff that everyone should watch and absorb.Very highly recommended.

 

String Pooling in SQL Server

Several times over the past couple of years, in my role as adatabase consultant, I’ve come across very, very large databases,where a large percentage of the data is redundant. For instanceconsider the following two abbreviated tables

Forms
FormsID (PK) nvarchar(255)

Hits
HitsID (PK) int identity(1,1)
FormsID (FK) nvarchar(255)
Time datetime

Imagine that there are only a dozen forms values, each of themaveraging about 30 characters in length (so 60 bytes or so, givingthat it’s unicode). If you have a million records in hits, that’s60MB just for the form value itself. If you have one hundredmillion records, and a dozen large FKs like this, well you get thepicture. It vastly increases the amount of I/O to do searches inthe Hits table, and even if Forms is indexed it’s still much slowerthan it could be if Forms had an integer primary key.

While I personally wouldn’t layout tables this way, it is anentirely credible and justifiable design – the designer simplydecided to use a natural key rather than a surrogate key.Simplicity of design, and clarity of relationships when looking atthe data, outweighed I/O concerns for this person/group. Such adesign is not a question of normalization.

When you have a million+ records it suddenly becomes a concern,though. There are ways to refactor this design, including”normalizing” the original table a bit and hiding it behind a view,and then adding INSTEAD OF triggers on the view, however that is aleaky abstraction. SQL Server does not completely mimic a realtable, and operations like INSERT FROM fail, not to mentionoddities with @@IDENTITY and SCOPE_IDENTITY().

Given all of this, I would love if SQL Server had a behind thescenes method of collapsing redundant large field values into ahidden behind the scenes lookup table, similar to what VisualStudio does with string pooling. e.g. In this case it could replace FormID witha internal value to lookup against a tiny relational table.Obviously this should be manually configured, but it would be arelatively easy change that could tremendously improve a lot ofexisting database designs where a redesign isn’t a priority, butI/O costs are onerous.

Apparently mySQL has something similar by way of enums, howeverit is a fixed set (what I’d like is that new values inserted intothe table are automatically added into the behind the scenes set),and again there is some leakiness with the abstraction.