[EDIT: 2005-09-08 – There seems to be a bit of confusionregarding this post, so I should clarify – These are neither uniqueor unsolvable problems. One could, for instance, achieve both ofthese tasks via hackery, as I mentioned below. I could of courseuse triggers and procedural logic and duplicitous columns to storereversed sets and decomposed strings. However that is thewrong solution, and pollutes my database withhacks to get around fundamental limits inFull-Text indexing]
Recently I’ve come across the need to do partial string searcheswithin very large sets of data (e.g. 10 large columns of both ntextand nvarchar types, in a table contains hundreds of thousands,or even millions, of records).
For example I’d like to look for 0505 within a table,returning rows containing that value in any of thesearched columns. For instance a row where one of the valuescontains REC995850505293. I could do thisthe bulk force way by doing a LIKE ‘%0505%’ against all ofthe columns, however that’s terribly inefficient, and will bringthe largest of servers to their knees with the volumes of data thatI’m talking about.
Of course the immediate solution one might imagine would be SQLServer’s Full-Text Indexing, or even a third-party tool likeApache’s Lucene full-text search. The problem is that both of thesesearch engines can only match from the beginning of a word onwards(or, with a thesaurus, word variants). For instance they can searchform REC9958*, returning REC995850505293,but they cannot search for *50505293. Because the index isordered based upon the beginning of the word, it can only matchnon-beginning partial words through a full-scan, which is of nohelp at all.
From a technology perspective this is understandable, howeverthere are a couple of pretty simple improvements to full-textindexing that would greatly improve their usability (albeit at thecost of storage and additional search maintenance processing, butthat should be a choice that a user can decide).
- Allow for the configuration of columns to be stored bothforwards and backwards. For instance on the back-end the columnREC995850505293 would be stored as REC995850505293,but also as 392505058599CER. The value, of course, is that nowif I search for “*05293“, it can reverse thesearch phrase to 39250* and compare it against the reversed set,quickly finding a match. Double the storage space, but a vastlymore usable index. As it is you can hack this sort of thingyourself, storing mirrors of all of your fields and then runningyour queries through a processor to know what to invert, but it’s alot of hackery for something that should be a back-endsolution.
- Allow for the configuration such that all sub-parts of a wordare also indexed. e.g. REC995850505293 is also stored asEC995850505293, and C995850505293, and995850505293, and so on. Clearly the consumption in spacewould be tremendous, but there are scenarios where this would beextremely valuable. Now a search for *0505* and it can find thepartial word matches, linking back to the primary word fromthere.
If anyone has any ideas of places where I might look forsolutions to this sort of problem, please drop me a line.