Examples given in this series will reference the sample database “Northwind”. It can be installed via the script found on the Microsoft website
The Basics We’re All Professionals Here!
Much of the material covered in this outing will be old hat for a lot of developers, but is nonetheless worth a recap — Even among the pros there remains misunderstandings and conflicting information about the fundamentals of databases, and the true magnitude of impact they have on systems. I’ve intentionally authored this series conversationally, as opposed to a “high impact hit list!”, however if you’d just like a brief summary list you can find one at the end.
This needs to be addressed as there is a growing camp of “data wasters” that erroneously believe that the larger the amount of waste, the more Enterprise Ready a solution is (capitalization used derisively). I step into this quagmire knowing full well that this section will yield me some “you’re a dummy!” responses from some
newly earned adversaries, however that’s a price I’m willing to pay if I can save but one byte tree.
Minimize the size of your data. Don’t use GUIDs where they aren’t necessary (e.g. where you don’t really need global uniqueness/replication), and don’t use a bigint where an int or a smallint will suffice. Don’t use a smallint where a tinyint will suffice or an nvarchar where a varchar would be fine. Use the smallest type that is reasonable for the field. Don’t invent vague packing technicalities or native type size issues in an attempt to justify oversized data.
Of course you should plan for realistic growth, and I’m not advocating that you use a tinyint to store your CustomerID field, but keep the rational, real world in mind when designing your applications – are you really going to exceed 2 billion users Is there going to be more than 32767 languages in your
application Is it likely that we’re going to a new calendaring system that might have 2 billion months?
Evaluate if it might be an acceptable compromise to simply use these large types in your façade while actually storing smaller types in the actual database. This would give you improved performance, and would allow you to easily upsize your data types in the future in the unlikely event that it becomes necessary.
Clearly there are cases where large data types are legimately warranted, however too many database architects abscond themselves of any responsibility for efficiency by making everything a GUID or a bigint “just in case” (GUIDs have a substantial creation cost as well, in addition to the obvious storage and I/O costs. While GUIDs once used the available network card MAC address as the foundation and generally sequentially increased in value on each new GUID, in current Windows variants GUIDs are basically random numbers — used as a clustered primary key they can lead to endless data reordering. NOTE: See Sequential GUIDs in SQL Server for solutions for this problem).
Why does it matter In real enterprise apps before you know it there are tens or hundreds of millions of rows throughout your database, and these rows need to be read from, and written to, the glacially slow storage subsystem constantly — given this finite resource, doesn’t it seem logical that 1MB of I/O carrying
20,000 records is better than 1MB of I/O carrying only 5,000 records Of course it is. Isn’t it better that 10% of your database can fit in the memory cache rather than just 5% Of course.
Don’t be lulled into a false justification of large data types by running ridiculously small benchmarks, where all of the data exists in the memory cache and the I/O is dwarfed by the computational element of the query, yielding “only” a performance hit of 10% or so with larger types — when your database gets to real enterprise size, size really does matter. That cluster indexed GUID primary key not only makes the row bigger, it makes every non-clustered index bigger (and thus slower) as well, and when the weakest link SAN is running at 100%, you’ll regret every wasteful byte.
Many SQL Server performance problems are rooted in missing or inappropriate indexes, or alternately unused indexes. This is often true for databases thrown together by front-end experts unhappily tasked with supplying the back-end database, just as it’s often the case for those carefully crafted by highly-focused database professionals.
An understanding of indexes, and a focus on their application, is paramount for high performance databases. Not only is it critical to create the right indexes, it’s important to craft your access to properly utilize the indexes that are there.
A Bookshelf Full of Examples
Indexes in reference books serve the same purpose (and share many of the same traits) as those in the database world — by referencing the index you can follow a shortcut to a particular piece of information, seeking directly to that specific page, versus going from page to page scanning the contents. In the SQL
Server world these sorts of indexes are called non-clustered indexes (or secondary indexes) — they are a subset of the data ordered for a specific purpose, containing a pointer to where the real data row can be found.
In the Northwind database an example of a non-clustered index is ShipPostalCode on the Orders table. This index sorts by the ShipPostalCode, and may be used for a query such as the following.
SELECT * FROM Orders WHERE ShipPostalCode = '05022'
If you take a look at the execution plan (Ctrl-K or choose the option “Show Execution Plan” under the Query drop-down in Query Analyzer. When you execute a query with this option enabled a new tab, Execution Plan, will appear beside the results tab), you can see that an index seek took place, and then a bookmark lookup to
find the actual data row. If we run SET STATISTICS IO ON on the connection before running the above query, we’ll get some statistics on the IO used to satisfy the query, which will be as follows.
Table 'Orders'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.
Compare this to the following query, which simulates having no index.
SELECT * FROM Orders WITH(INDEX(0)) WHERE ShipPostalCode = '05022'
In this case the execution shows a full table scan, and our IO statistics reports the following.
Table 'Orders'. Scan count 1, logical reads 21, physical reads 0, read-ahead reads 0.
Without the index there was significantly more I/O, and the differential would be vastly worse if this were a large enterprise-sized table instead of a small sample table. To make matters much, much worse such a table scan will block on every single exclusive page or row lock on the table, waiting for the data to be unlocked just to be able to verify if the wanted information is contained within, while the index seek knows that the locked data isn’t the data that it’s looking for and is unaffected. Try running the above two queries in two separate query analyzer windows while the following script is running in yet another (increase the WAITFOR delay if you can’t jump between them all within 30 seconds).
UPDATE Orders SET OrderDate = '1997-08-27'
WHERE OrderID = 10647
WAITFOR DELAY '00:00:30'
The first query, using the index, instantly returns the result regardless of the row lock, which is logical given that the row being locked is not what the query is looking for, while the second query, not using the index (which can happen because no index exists, or the index isn’t deemed the best choice) blocks until the
other connection’s lock is released. Not using an index is not only vastly less efficient, it can significantly worsen blocking problems as a database scales (or rather tries to scale).
Returning to our analogy, many books take it a step further and order the content, making the data itself an index of sorts. A cookbook might sort by main category and then dish name, while a phone book famously sorts by [city, last name, first name] (in that order). Thus if you want to search based upon the sorted data, it’s extremely efficient — in the example of the phone book you can very quickly seek to the desired city and last name, scanning a small number of records for the desired person. In the SQL Server world this sort of index is called a clustered index (the sorting of the data itself), and for obvious reasons you can only have one clustered index in a book, or on a table. The primary benefit of a clustered index is that all of the table data is immediately available for every index match — no dereferencing is necessary.
In the Northwind database consider the following query.
SELECT * FROM Orders JOIN [Order Details] ON Orders.OrderID =[Order Details].OrderID WHERE Orders.ShipPostalCode ='05022'
If you look at the execution plan, the Order Details data is grabbed via a very efficient clustered index seek.
Clustered indexes aren’t all milk and honey, though. For instance imagine that you’re the hard working typesetter maintaining the layout of the phone book, and you’ve carefully arranged all of the entries on the respective pages. Every time a new entry comes in and doesn’t coincidentally fit right at the end of the sort order, or someone changes the information on an existing row in a way that alters the sort order (“Smith” changes his name to “Jones”), you need to reorganize some pages to make space. This same data-churn problem occurs with both non-clustered and clustered indexes, but clustered indexes exacerbate the problem given that it contains the entirety of the row data.
Of course you could plan for this by keeping a bit of blank space on each of your pages to facilitate at least a couple of changes, which is what the fillfactor is used for in SQL Server (a lower fillfactor leaves more empty space but reduces the true data density – insert performance is improved, but read
performance is diminished. A high fill factor increases the real data density and thus read performance, but increases the likelihood of inserts requiring page splits. Note that fillfactor only applies on index creation, and whenever you defragment/rebuild your indexes), however this can be a serious performance issue for
out-of-order inserted data, or frequently changing cluster indexed fields. It’s for this reason that many developers use the monotonically increasing identity field as their cluster index. Historically there was a worry that having multiple inserts all going to the same “place” in the index, at the end, would lead to scalability killing contention at this hot spot. SQL Server has logic to deal with identity fields and effectively eliminates this hot spot issue.
Another problem with clustered indexes is their girth (they contain the entire row data). This is largely irrelevant if you’re seeking to specific records, or where you actually plan on using all of the data after a lookup, however if you are querying a range of data (for instance the first name of all of the people with the last name “Forbes” in the city of “Oakville”) the query engine will read in the entire row contents for each matching record using a range scan, extracting only the requested data. In our example phone book there is so little extra data that it’s a minor overhead, however in many large real-world tables this can have a
serious performance impact.
Consider if instead we had a secondary index that was sorted by City, Last Name, First Name, the query engine could very efficiently scan past only the small index entries.
This brings up a very important point — Some indexes contain enough information that you don’t even need to go to the content, your query being satisfied by the index itself. Consider an index in a tour book that sorts famous attractions by their name, and the country and city that they were located in, pointing
to the page where further information regarding it could be found. If you just want to know what city the Accademia dell’ Arte del Disegno in Italy is found in, a quick seek through the index will tell you that it’s in Florence. In this instance the index was a “covering index”, in that it fully covered our request and we
didn’t need to dereference to the complete topical information. This is often the most efficient query mechanism of all.
Consider the following variation of a query we ran earlier.
SELECT OrderID FROM Orders WHERE ShipPostalCode = '05022'
This will efficiently use the ShipPostalCode index to find the specific record, and because the query is fully satisfied by the index itself, the costly bookmark lookup is avoided, and IO is minimal.
Table 'Orders'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0.
The observant will note that OrderID isn’t actually in the index ShipPostalCode, or at least it doesn’t appear to be. The trick is that all of the fields that are the sort fields for the clustered index, if one is defined, are automatically added as data fields to every other index on the table. This can be blessing in cases where you want one of the clustered fields and suddenly the non-clustered index is a covering index, but it also needs to be weighed against the fact that it makes every other index larger, and thus less “data dense”.
Small covering index seeks are the most efficient method of pulling data, and it’s a good reason to ensure that you are only pulling the specific fields that you actually need from any given table, preferably with a covering index. Range scans are also highly efficient in many situations, and are usually used when the
wanted entries in an index are consecutive, such as when you search BETWEEN two dates against a date index, though because of the previously mentioned bookmark lookup costs range scans are generally only seen if the index is fully covering, or against the clustered index.
Even in the case where a full scan is necessary, indexes might still be fully covering. Consider if the following index were added to the Customer table.
CREATE INDEX CountryCity ON Customers(Country,City,Address)
This index of course contains Country, City, Address, but as mentioned above it also contains CustomerID because it’s in the clustered index. Of course if we query on Country, or Country and City, or Country, City, and Address, an efficient seek or range scan might be used to pull the matching records.
SELECT CustomerID FROM Customers WHERE Country = 'Canada'
What if instead we wanted to search only on the address In that case the index can’t be searched in order because it sorts by country and then city and address, and thus a particular address could exist anywhere in the index.
SELECT CustomerID FROM Customers WHERE Address = '43 rue St. Laurent'
You might be surprised to see that it still used the index, albeit this time it’s an inefficient scan rather than a seek. The index was used because it covered the query (all predicates and returned columns), and because the index is only a subset of the data it requires less I/O to scan the entire index than it does to
scan the entire table data.
Statistics and Bookmarks
In a prior example we ran a query that required a bookmark lookup to satisfy the query (thus it did not have a covering index). The query was as follows.
SELECT * FROM Orders WHERE ShipPostalCode = '05022'
If you look at the execution plan for this query you can see that it seeks the “lookups” in the index, and then does a bookmark lookup against the clustered index (which is the actual table data). In this case there is only a single row to return, but even still the bookmark lookup cost is estimated to account for 50% of the cost of the query.
The cost of bookmark lookups, where an item is found in the index but it isn’t a covering index, is the reason why many people are surprised to find that SQL Server has ignored what they believe are perfect indexes and instead table scanned (“Why isn’t it using my index! ARGHHH!!!”). Consider the following query.
SELECT * FROM Orders WHERE ShipPostalCode = '24100'
Looking at the query plan you can see that it actually did a cluster index scan (which is a table scan on a table with a clustered index) instead of using our index, and the subtree cost is 0.0530.
This might seem perplexing because we seem to have a perfectly satisfactory index, however let’s do the same query again, this time using a query hint to force it to use our index.
SELECT * FROM Orders WITH(INDEX(ShipPostalCode)) WHERE ShipPostalCode = '24100'
If you look at the query plan you can see that it used our index, as demanded, but this time the bookmark lookups account for 80% of the query time. Our subtree cost comes in at 0.0564 — more than it was doing a table scan!
In this case this was only 10 records of a total of 830 (1.2%) yet still it opted to do a full table scan rather than using our index. Many developers have been perplexed in this situation, wondering why SQL Server was avoiding their beautiful index, but it’s doing it for a very valuable reason – it was cheaper than
indirectly looking up each piece of data through bookmarks.
Of course we could have avoided bookmark lookups by using the index as a covering index if possible by using a query like the following, presuming this was all the data we needed to extract from the table.
SELECT OrderID FROM Orders WHERE ShipPostalCode = '24100'
Now it uses our index, is super efficient, and has a subtree cost of only 0.0064. In larger databases the difference can be the tremendous by avoiding both the bookmark lookup and the table scan.
So how did the query engine guess how many rows would match a criteria to choose which method (whether by index and bookmark lookup, or table scan) to use to most efficiently satisfy the query That’s where something called distribution statistics comes into play. Statistics are a representative set of the data that are
used by the query engine to make a best-guess plan for how to most efficiently serve the data. You can view the statistics for a given index, in the following case for the index ShipPostalCode, via the DBCC SHOW_STATISTICS command.
Due to the limited number of discrete ShipPostalCode values in the table, the statistics are entirely accurate in this case. In a more realistic database, with thousands or millions of rows, statistics start to become much more of an estimation (with an ever increasing margin of error). These estimations can lead to entirely wrong assumptions by the query engine in some edge cases, such as where it thinks a given set of predicates will yield thousands of rows when really it might yield only a couple.
Statistics can also fail for multi-field indexes. In this case the selectivity of the first field is used, so in the case of the index we created earlier (Country, City, Address), due to the fact that the country has a low selectivity (there are lots of entries for each country), the index will often be ignored, even though the city and address combination is highly unique. For this reason it is generally recommended that the most selective field comes first in your index, so in the case of that index the fields would be Address, City, and then Country. This is debateable because it also makes the index more single purpose — it no longer serves an efficient purpose for less-granular searches like just Country, or Country/City. This needs to be evaluated on a case by case basis, and truly wouldn’t be an issue for fully-normalized tables.
It should also be noted that ensuring that your statistics are as accurate as possible is critical. SQL Server includes automatic statistic updates, on by default, where it will attempt to do data sampling and update statistics when it feels they are out of date. Nonetheless it is a best practice to schedule full statistic updating at regular intervals (at a minimum weekly), preferably using the WITH FULLSCAN option so it is as accurate as possible. The standard database maintenance plan includes a step for statistic updating, and allows you to choose the amount of data to sample.
Regardless of all of the above, there will be cases where you may find that your statistics are up to date, your indexes are optimal, yet SQL Server is still incorrectly choosing not to use your index. In this case it may be an unfortunate reality that an index hint needs to be added to the query to politely (or rather
sternly) request that it reconsider. Obviously this should be a last resort.
Actually Using Indexes
So you’ve created beautiful indexes, and you’ve ensured that your query only pulls the necessary data from each table, using covering indexes where possible to avoid costly bookmark lookups. You pull up the execution plan to find
that the query engine is entirely ignoring your index. There are several reasons why this
Consider the following query.
SELECT OrderID FROM Orders WHERE LEFT(ShipPostalCode,4) = '0502'
Fairly simple query, and from the looks of it one might think it’d be an efficient covered index seek. Upon execution you’ll discover that actually it was an inefficient scan. Consider the following instead.
SELECT OrderID FROM Orders WHERE ShipPostalCode LIKE '0502%'
In this case the query is executed as a highly efficient index seek. I have had cases where this tiny difference reduced an enterprise report from running for literally hours to a matter of seconds.
The reason is that in the former the indexed field was hidden within a function. The query engine can’t predict what the result of the function will be, so it’s forced to evaluate it for every row to see what pops out. LIKE is a first class comparison, it knows how it behaves, so the query engine can actually optimize
against it. There are countless cases where people hide criteria fields in functions unnecessarily, and the result is massive, unexpected inefficiency.
The most common example of this mistake is using DATEADD/DATEDIFF to pull rows within a certain period of time — instead of pre-calculated a fixed demarcation (i.e. precalculating GetDate() – 3 years) and then doing a direct comparison with the row data, developers are forcing whole table scans with wasteful date computations on every single row. For instance consider a query to report news items that have occurred within the past 12 hours from a hypothetical news table .
SET @CurrentTime = GetDate()
SELECT * FROM NewsStories WHERE
Guaranteed to be terribly inefficient, yet it’s overwhelmingly common. The query engine can much more effectively optimize the following variant.
SET @StartTime = DATEADD(hh,-12,GetDate())
SELECT * FROM NewsStories WHERE NewsDate > @StartTime
Database Cheat Sheet
Indexed Computed Columns
At the outset I advocated that you minimize space usage (increasing real data density). The goal wasn’t to try to fit that database on a floppy disk, but rather to minimize the amount of I/O necessary to satisfy a given query, as I/O is the weakest link of most enterprise systems. There are design choices, such as adding
additional indexes, that actually increase the size of your database on disk yet reduce the I/O necessary for certain queries, and these are usually very worthwhile trade-offs.
Another powerful technique you can use to trade disk space for improved database performance is indexed calculated columns. There are countless variations, but I’ll cover one scenario that is fairly commonly used — report counts by month for a given year. In the case of the Orders table this could be achieved via
the following query.
SELECT YEAR(OrderDate) AS [Year], MONTH(OrderDate) AS
[Month], COUNT(*) AS [Monthly Orders]
GROUP BY YEAR(OrderDate), MONTH(OrderDate)
Instead of adhoc decomposing the date into month and year constituents, consider adding them as computed columns.
ALTER TABLE dbo.Orders ADD
OrderDateYear AS CONVERT(smallint,YEAR(OrderDate)),
OrderDateMonth AS CONVERT(tinyint,MONTH(OrderDate))
Now we can change our query to the following.
SELECT OrderDateYear AS [Year], OrderDateMonth AS [Month],
COUNT(*) AS [Monthly Orders]
GROUP BY OrderDateYear, OrderDateMonth
By itself we’ve done nothing for the query efficiency (in fact it is actually less efficient as it’s applying the where predicate after building the set), though we’ve achieved a bit of “code re-use”. However we now have the foundations for some powerful indexed computed columns.
CREATE NONCLUSTERED INDEX IX_OrderDateDecomposed ON dbo.Orders
) ON [PRIMARY]
Now the query referencing these computed columns is dramatically more efficient. Even better, these indexed computed columns haven’t decreased the real data density of the table because they’re only materialized in the index. NOTE: Ensure that queries that don’t need these computed fields don’t pull them explicitly or implicitly via the wasteful * column selector, as it’ll unnecessarily calculate each of the computed fields for each row.
[This functionality only exists in the Enterprise and Developer edition of SQL Server 2000]
An indexed view, sometimes referred to as a materialized view, is a sort of indexed computed columns on steroids, taking the idea of storing computed results to the next level. In a previous example we improved the performance of some data aggregation logic by grouping on some indexed computed columns. We can take a variant
of that and create a view out of it.
CREATE VIEW dbo.OrdersByMonth
SELECT OrderDateYear AS [Year], OrderDateMonth AS [Month],
COUNT_BIG(*) AS [Monthly Orders]
GROUP BY OrderDateYear, OrderDateMonth
As it is the view is acting as nothing more than a template for queries against the underlying table, and is only of benefit for code reuse (which itself is a very worthwhile goal). We can take that a step further by materializing this view so the actual results are stored, and changes to the underlying table is
automatically reflected in the aggregates. The following command creates the indexed view.
CREATE UNIQUE CLUSTERED INDEX IX_OrdersByMonth ON OrdersByMonth(Year,Month)
Now the following query will be satisfied by the indexed view, using the pre-computed values, rather than recalculating for every query.
SELECT [Year],[Month],[Monthly Orders] FROM OrdersByMonth WHERE Year=1997
In the end the resource usage for our monthly order count query has dropped by about 90% over the original query, and this is for a tiny sample database. In the real-world the differential can be extraordinary.
Indexed views do have some downsides, such as the automatic maintenance that occurs whenever the underlying data changes, however they can be an extraordinarily powerful tool in your arsenal and deserve further research if your platform supports it.
- Keep your rows as small as possible to maximize real data density.
- All tables should have a clustered index except in rare exception situations. Normally this will be a single field primary key.
- Small clustered indexes keep non-clustered indexes small, increasing real data density.
- Clustered indexes help make other indexes covering indexes.
- Queries serviced by covering indexes are extremely efficient.
- Avoid hiding criteria fields in functions — indexes will not be used for them.
- Consider indexed computed columns where appropriate
- If you shelled out the cash for Enterprise Edition, seriously evaluate how indexed views fit in your solutions
- Index, index, index! Only in extremely rare cases are the additional update and insertion costs associated with maintaining indexes heavier than the benefits.
- Understand execution plans. Evaluate them regularly.