Your Query Planner and You

An interesting post, “Use Subqueries to Count Distinct 50X Faster“, caught my eye on Hacker News this morn, the submission detailing ways to coerce your query planner into more efficient execution strategies.

In their case they achieved a remarkable 68x performance increase. They are using PostgreSQL, normally an excellent database platform with good execution plans, however it apparently has a known weakness with DISTINCT clauses.

Several people have asked how the strategies port to other database platforms, so given that I’ve been evaluating SQL Server 2014 CTP2 for a deploy upgrade, I thought I’d detail how it dealt with this particular scenario (which will allow me to use this same scenario in a piece about In-Memory OLTP, so goodness there), and how broadly one can utilize the “divide and conquer” approach to query optimization (the article details it as being a universal approach).

To get things started, as with the Digg case, I can’t replicate their scenario exactly, however based on their stated details I put together a simple script to create the schema and then dump in n users, n dashboards, and then n visits between the two. In that case it sets up 100,000 users, 100 dashboards, and 100,000,000 dashboard visits, though if you play with it you can modify those proc calls near the tail end.

The dashboard visits data is populated to a power curve to more closely represent likely data populations (e.g. long tail). In my case this yielded some 7,986,868 distinct sets of user_id/dashboard_id in time_on_site_log. Obviously in the real world time_on_site_log would have more columns (e.g. time, id, etc), however given that the query will naturally use the dashboard_id/user_id index and leave the heap table alone, such a difference is irrelevant.

schema

This is all running on a pretty meager quad-core desktop PC (humorously the same one as when I did that Digg scenario years ago. A Core i7-950 holds up remarkably well), the database served from magnetic disk, and restricted to 2GB of memory.

Running their initial query (modified slightly only to alias the count column to reference in the order by), after priming the cache, yielded results in some 31 seconds, with a subtree cost of 848.096 (this is a sort of fuzzy “overall cost” metric provided by SQL Server). As they experienced, it indeed joins early between the log and dashboard tables, then working with the compound data, doing significant seemingly unnecessary work on hundreds of MB of data.


select
dashboards.name,
count(distinct time_on_site_logs.user_id) as c
from time_on_site_logs
join dashboards on time_on_site_logs.dashboard_id = dashboards.id
group by name
order by c desc

The query plan is a bit large to graphically include here, however those with Management Studio can view it (after decompressing). It is very similar to the EXPLAIN plans they posted on HN.

It isn’t a terrible execution plan, but it is working with a lot of data it doesn’t need to.

Now jumping right to their most optimized version.

select
dashboards.name,
log_counts.ct
from dashboards
join (
select distinct_logs.dashboard_id,
count(1) as ct
from (
select distinct dashboard_id, user_id
from time_on_site_logs
) as distinct_logs
group by distinct_logs.dashboard_id
) as log_counts
on log_counts.dashboard_id = dashboards.id
order by log_counts.ct desc

With this variant results are returned in about 19 seconds, with a subtree cost of 370.101. It is slower than their best case (though the worst case ran much faster than their worst case), and about 60% faster than the original on SQL Server. I suspect that the cardinality of sets is much higher in my case, coupled with the fact that they are likely running on a beefy server (given that the author details the best case running for 21 seconds on their laptop, which is 20x+ slower than the results they gave in their original benchmarks).

So it remains a fairly expensive query given the entropy of data and number of rows, at least when run on a desktop, but the new query plan is much more logical and simplified, and the query planner does not link in dashboards data until after the aggregates have been calculated and the data minified.

So in this case it seems that hand optimizing was effective for SQL Server.

Yet that is something that I generally advise people to avoid unless they specifically know what gap in the query planner they are working around: many cases will see them making nasty SQL that actively works against performance, not only immediately, but the ability to analyze and improve via structural changes of types, indexes, fill factors, and so on.

This is more pronounced over time, as data entropy changes, indexes are added or removed or modified, and the underlying technology and their balance change (e.g. optimizing to minimize IO and maximize CPU usage can fall apart when you get the new server with a monster PCI-E flash storage device). When you are optimizing for a specific static situation, you rob the query planner of the ability to adapt to changing situations. You also make it much more difficult for future developers — and that includes future you — to understand what exactly you were trying to do.

I’ve seen my share of terrible, terrible SQL over the years, much of it created under the flawed notion that the author is smarter than the query planner. That they know more about how to generate the results, and must actively treat it as an antagonist that they work against. Worse, it helps the user avoid adding the database artifacts that allow the query planner to make optimal plans, surrogate strategies codified in every call.

Simple, singular set operations that could be efficiently fulfilled end up being spun into dozens of inserts into temporary tables, this reductionist “divide and conquer” approach depriving the query planner of its ability to coalesce and optimize. These end up becoming database junkyards, derelict code of byzantine “optimized” SQL that no one wants to touch.

I do not in any way mean to slander the author of the linked post, as they are working around a very real issue that some encounter in postgreSQL, however in general you should not attempt to out-think the query planner. Doing so is almost always self-destructive, hiding some fundamental flaw in your usage of the database.

Speaking of which, go back to that original query for a moment.

select
dashboards.name,
count(distinct time_on_site_logs.user_id) as c
from time_on_site_logs
join dashboards on time_on_site_logs.dashboard_id = dashboards.id
group by name
order by c desc

I’ve bolded the group by, noting that it is told to group by name. What if multiple dashboards share the same name? This is quite clearly not intended behavior (the hand optimized version holds each row unique by dashboard_id).

Let’s add one single index to the dashboards table.


CREATE UNIQUE NONCLUSTERED INDEX [IX_dashboards_name] ON [dbo].[dashboards]
(
[name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

This is a unique index on the dashboard names. It guarantees that a given dashboard.id has a unique dashboard.name. That under no circumstance does the query planner have to worry about coalescing multiple dashboard_id’s because the other side of the join has names that are equivalent for the current character set.

If we run that query again, you might be surprised by the results.

A seemingly irrelevant unique index added on dashboards.name — not even used directly by the query — and suddenly the query planner is running a very different plan. Indeed, if you compare this plan with the hand optimized version, they are literally identical. They have the same runtime, the same subtree cost, and exactly the same steps.

The knowledge of the data that was cast into the hand optimized query is now a part of the actual data, benefiting any and all queries that touch the same rudiments. This is important because it means every case where the same pattern occurs — which is generally what happens with a given schema — the same advantages are yielded without any extra coercion.

Tell the database everything you know about the data to help it work with you instead of against you.

If a column is unique, flag it as unique.

If a table has a foreign key, implement that foreign key (the query planner can entirely eliminate the work of some joins if there are foreign keys, that relationship guaranteeing that there is a fulfilling row on the other side).

It a column isn’t nullable (which they generally shouldn’t be. If you have a nullable column, revisit your design), set it as not nullable. Nulls add significant complexity to data usage.

These are seemingly incidental things, but they are critically important to the query planner, and when you have queries that don’t run as you suspect, first visit the metadata to ensure that the hints are all there.

And of course you must ensure that you have appropriate covering indexes that have a data density for optimal efficiency. Ensure that you have up to date, covering statistics.

These all benefit every usage, in the present and the future. It will pay great dividends.

Again I do not intend to disparage the authors of that post who dealt with a very real problem with PostgreSQL, and may indeed have had all the metadata I detailed configured already. However this discussion yielded ignorant nuggets such as this comment, which is an unfortunate, and entirely wrong, claim that I’ve seen too often in real life. That attitude yields hack code.