Tidy Your HTML with The Pragmatic Impurity of .NET

When Java first hit the development ecosystem, to many it wasn’tjust a method of doing efficient, high-level development,but rather it became a new religion: You couldn’t only useJava as the glue between existing code, or even as theoverwhelming bulk of your solution. A partial-Java solution simplywasn’t good enough.

Instead your product had to be 100% Pure Java.The still sought-after eventual goal was a complete Java solution,from applications right down to the operating system, with only thesmallest possible binary kernel, if even that. All of this would berunning on a Java-aware processor, engineered specifically forJava.

Sun created a “100% PureJava” campaign to push this philosophy, includingbanners and designations for appropriately certified software,and advocated it as a very desired moniker. Users were led to feelthat mixed solutions were impure and somehow dirty: Are you somesort of nut running an impure solution, dirtied with some pointermunging, buffer overflow vulnerable C code? While there were (andremain) methods to call native code, they were discouraged.

Of course there is a lot of validity to this agenda. Primarybeing the fact that pure Java solutions are theoreticallycross-platform, with no ties to external technologies. Compare thisto a solution leveraging C libraries, which would require a rebuildor binary available for every distinct target platform.Additionally Java could only impose its sandbox and extensivesecurity constraints if you stayed in the world of Java, and thuscallouts to native code represented a risk.

In the real world, though, it often meant that developerswere constantly solving long-conqueredproblems, redundantly reinventing solutions in Java thatlong existed elsewhere, or waiting until adequate librarieseventually appeared: Developers were pressured to use Java aloneeven when it was a hammer and the solution really needed achisel.

Thankfully .NET hasn’t been pushed in sucha single-minded way (even if some of its champions havefoolishly taken up such a misled cause, including some atMicrosoft. Instead of a justified part of the solution, it becomesa religion. .NET! .NET! .NET! .NET!), and indeed Microsoftthemselves has always facilitated, and even advocated, “impure”solutions. The majority of the .NET Framework, for example, isactually a very thin veneer over the existing Win32 facilitiesand libraries — it was either that, or version 1.0 would have comewith a much smaller, much less efficient library.

The “orchestration layer over native code” implementation is thereason .NET hasn’t suffered the performance difficulties that Javahas.

DSC02580Microsoft chose to leverage whatthey’d already done, to maximize both performance, and to maximizethe breadth of the library. 

This advantage isn’t limited to Microsoft, though, and thedeveloper can utilitize this functionality as well. .NET offersvery simple COM and P/Invoke functionality to leverage “legacy”code (or even new code developed in a best-solution, non-.NETtechnology), allowing you to easily use your existing DLLs and/orCOM libraries as first class partners in your .NET solutions. Evenif they’re created in “dirty” languages.

I take advantage of this functionality regularly, utilizingexisting best-solution libraries and functions, regardless ofwhether they’re pure .NET or not. For instance in creating thestatic version ofthe “best of” blogentries, I quickly — maybe 2 hours — wrote a quicktransformation tool that basically imported the “bestof” RSS feed (it isn’t included in the normal category lists),then doing some XSL transformations (using extension objects in theXSL given that XSLT alone wasn’t adequate for some special purposes– for instance HTMLDecoding the description block of the RSS) tothe resulting XHTML, as well as creating an index page.

One goal when creating this solution is that the resulting pagesare all fully XHTML compliant, and they pass the W3C validity checks. While I couldeasily see how the pages rendered in Mozilla/Firefox/IE/Opera, andof course they all rendered fine, technically there were a coupleof deviations from the spec. Some of these errors and warnings werecaused by unavoidable transformation issues, while others werecaused by minor mark-up errors in the original blog entries (bothbecause of my own errors when doing it by hand, but also because ofRadio Userland’s “helpful” auto-“cleanup” of HTML. It is remarkablehow often auto-formatting is detrimental).

HTML Tidy to therescue.

I had several options with HTML Tidy, the easiest of which wouldbe to ShellExecute out to the EXE, telling it to process anexisting file. I could have taken more time and tried to make amanaged C++ version of Tidy, but I really didn’t want to spend thatmuch time.

I decided to have a bit more fun, not to mention building a moreintegrated, higher performance solution, and use the Tidy dllfrom the micro-.NET utility. I grabbed the Tidy source code(Tortoise CVS is agreat solution for this, in this caseusing:pserver:anonymous@cvs.sourceforge.net:/cvsroot/tidy),updated the included MSVC projects to Visual Studio 2005, and addedthem to the transformation utility solution. I set the Tidy dllproject output to the build directory of my .NET utility (in thiscase $(SolutionDir)\blogStatic\bin\$(ConfigurationName)).The MSVC build worked perfectly right away, which is amazing giventhat Win32 isn’t an officially supported build.

To reference the Tidy dll methods, of course I had to add theDLL import signatures, in this case adding only the ones I had aneed for.

 [StructLayout(LayoutKind.Sequential)]
  struct TidyBuffer
  {
    public IntPtrbp;          /**< Pointer to bytes */
    public uintsize;         /**< #bytes currently in use */
    public uint allocated;    /**<# bytes allocated */
    public uintnext;         /**<Offset of current input position */
  };

  class FileClean
  {
    [DllImport(“libtidy.dll”)]
    public static extern IntPtrtidyCreate();

   [DllImport(“libtidy.dll”)]
    public static extern int tidyParseFile(IntPtrtidyPointer, [MarshalAs(UnmanagedType.LPStr)]stringfileName);

   [DllImport(“libtidy.dll”)]
    public static extern int tidyParseBuffer(IntPtrtidyPointer, ref TidyBuffer tidyBuffer);

   [DllImport(“libtidy.dll”)]
    public static extern inttidyCleanAndRepair(IntPtr tidyPointer);

   [DllImport(“libtidy.dll”)]
    public static extern int tidySaveFile(IntPtrtidyPointer, [MarshalAs(UnmanagedType.LPStr)]stringoutFileName);

   [DllImport(“libtidy.dll”)]
    public static extern int tidyRelease(IntPtrtidyPointer);

   [DllImport(“libtidy.dll”)]
    public static extern inttidySetCharEncoding(IntPtr tidyPointer,[MarshalAs(UnmanagedType.LPStr)]string encoding);

   [DllImport(“libtidy.dll”)]
    public static extern int tidyOptSetBool(IntPtrtidyPointer, int value, int Bool);

    public staticbool CleanFile(System.String outputfileName, System.IO.MemoryStreamdocDataStream)
    {

     int result = -1;

     IntPtr tidyPointer = tidyCreate();
      try
      {
        // We want the resultingfile to be UTF8 encoded
       tidySetCharEncoding(tidyPointer, “utf8”);

       byte[] docDataArray = docDataStream.ToArray();

       TidyBuffer tidyBuffer;
        tidyBuffer.size =(uint)docDataArray.Length;
        tidyBuffer.allocated =(uint)docDataArray.Length;
        tidyBuffer.next =0;

       GCHandle pinHandle = GCHandle.Alloc(docDataArray,GCHandleType.Pinned);
        try
        {
         tidyBuffer.bp =Marshal.UnsafeAddrOfPinnedArrayElement(docDataArray, 0);

         if (tidyParseBuffer(tidyPointer, ref tidyBuffer) >= 0)
          {
           tidyOptSetBool(tidyPointer, 29, 1);
           tidyOptSetBool(tidyPointer, 23, 1);
           if (tidyCleanAndRepair(tidyPointer) >= 0)
           {
             result = tidySaveFile(tidyPointer, outputfileName);
           }
          }
        }
        finally
        {
         pinHandle.Free();
        }
      }
      finally
      {
       tidyRelease(tidyPointer);
      }

     return (result == 0);
    }
  }

Most of this should be self-evident, however the twotidyOptSetBool calls may be a little cryptic. For the sakeof brevity I haven’t used the constants, but 29 is theTidyMakeClean value of TidyOptionId enum (seetidyenum.h), and 23 is the TidyXhtmlOut value.Together these indicate that I want to clean the documenting,converting it to XHTML. Note that I’ve also set the encodingto UTF8.

Voila, after transforming the RSS to the memory stream asquasi-conformant HTML, I passed the stream to this function, alongwith the desired output filename, and out went a cleaned-up, validXHTML document. Pedants everywhere were thwarted from pointing outminor deviances from the standard. I could have processed toanother buffer, and then done follow-up processing in .NET as well,but this was sufficient.

This is a trivial example, but it really exemplifies the greatvalue of the easy interoperation of .NET. With it I could instantlyleverage existing code, without having to search out bastardizedported versions, and instead could go right to the source.