Database Performance, LVM Snapshots, golang and sqlite3

Snapshots for Beginners

Snapshots are a very useful mechanism found in database products, disk management layers, and virtualization tools. Instead of making a complete copy of the source, snapshot functionality allows you to create the copy virtually instantly.

Snapshots can be used to perform backups without locks or transaction contention, or to save volume contents at a particular point in time, for instance because you’re about to make some risky changes that you want the ability to quickly revert, or because you need to access the data as it was at a particular point in time.

The speed of snapshots is courtesy of trickery: Snapshot technology generally doesn’t copy anything at the outset, but instead imposes overhead atop all I/O operations going forward.

VirtualBox snapshot volumes, for instance, contain the changed disk contents from the point of the snapshot forward (contrary to what one would intuitively think), leaving the base image unchanged. When you delete a snapshot it merges all of those changes down, whereas if you “rollback” the snapshot it simply deletes the change data and reverts to the root. It adds a layer of indirection to disk access as all activities need to determine first whether they apply to the changeset or the root set.

LVM (Logical Volume Manager, a Linux volume layer that adds considerable flexibility to storage on the platform) snapshots — the topic of this entry — use a very different but more common approach. The snapshot is effectively a sparse copy of the original data, acting as simply an indirection to the root volume. As you change data in the root volume it performs a copy-on-write, moving those changed disk sectors to the snapshot volume.

If you had a 40GB root volume and then created a snapshot, at the outset your root volume will of course take 40GB of space, while the snapshot takes 0GB. If you then changed all 40GB of data, the root volume would still take 40GB, while the snapshot would also take 40GB, every block of the original data having been copied over before completing each write operation.

Snapshot Performance Concerns

There is an obvious performance concern when using snapshots. Aside from read activities needing the additional indirection, all write operations against the root now require an analysis of which snapshots it impacts, and then a copy of the original data to those affected snapshots.

The overhead can be onerous, and there are many dire warnings out there, driven from artificial benchmarks. While to-the-metal, reductionist benchmarks are important, it’s often useful to assess how this impacts the whole of operations at a higher level: The impact of seemingly enormous performance changes is often very different from expectations.

Benchmarking Snapshots

So with a couple of Amazon EC2 instance types (one, a c3.2xlarge running Ubuntu 14.04 using HVM, with dual local SSDs, and the other a m1.xlarge running Ubuntu 14.04 with PV, with local magnetic drives), and a simple benchmarking tool I built with go 1.3 and sqlite3 (using the excellent sqlite3 driver, itself a great demonstration of Go’s most powerful feature being it’s least sexy), I set about determining the impact of snapshots.

800 LVM Snapshots? Why Would Anyone Ever…

Why would this matter? As a bit of background, much of my professional life is spent in the financial industry, building very high performance, broad functionality solutions for leading financial firms.

One common need in the industry is as-it-was-at-a-particular-point-in-time data browsing: To be able to re-generate reports with varying parameters and investigate the source data that yielded those outputs. In effect, a version control system for data.

That data is constantly changing as new data comes in, and building an entire data history in the database can yield enormous volumes of data, and creates a database schema that is incredibly difficult to query or utilize with any level of performance.

When it’s at the system level, such as with Oracle’s Flashback Query, it can yield a much more tenable solution, but then much of the implementation is a black box, and you don’t have the ability to flag exactly when such an archive should be built — all data churn becomes a part of the history. As if you were committing your source after every keystroke.

I needed the ability to maintain historical versions in a manner that was immediately usable, in a high performance fashion, but didn’t redundantly copy 100% of the data for each archive, which would quickly become unmanageable.

To explain, in one solution I engineered, the generated reporting set was exported to sqlite3, for a wide variety of interesting and compelling reasons. That data store is periodically incrementally refreshed with new data as the various participants certify it and gold plate it for dissemination.

While I seriously considered LVM as one component of a solution (a part of a more complex historical platform), I ended up adding block level versioning directly into sqlite3 (it is a fantastic foundation for derived solutions), but nonetheless found LVM to be a fascinating solution, and wanted to dig deeper into the performance considerations of the technology.

To the natural question that one would ask about the utility of this benchmark — if you were really starting a small database from scratch, the easiest solution is to simply make a 100% copy of the database (presuming that you have the reasonable ability to transactionally pause and fsync), quickly and easily, for every historical copy. Imagine, however, that instead you’ve already amassed a very large volume of data, and are dealing with the daily change of that data: Where the daily delta is very small on a large volume of data.

The Benchmark

The benchmark creates a table with the following definition and indexes-

create table valuation (id integer not null primary key, 
    investment_id integer not null, 
    period_id integer not null, 
    value real not null);
create index valuation_investment_idx ON 
    valuation(investment_id, period_id);
create index valuation_period_idx ON 
    valuation(period_id, investment_id);

It then iterates 95 times, adding a new “investment”, and then individually adding 10,000 period records for each investment (26 years of daily valuations), the value varied randomly between records. On the second and subsequent iteration it then updates the entire period history of the prior investment, as if updated details had come in. All of these changes occur within a single transaction per investment iteration (meaning 95 transactions in all).

Running this benchmark without the indexes, for those curious, became impossibly slow almost immediately due to the update stage needing to endlessly table scan.

For the snapshot-enabled tests it creates a snapshot of the root LVM volume between investment iterations, meaning that at the end of the run there are 95 snapshots, each of which contains the data as it was at the moment of creation (such that there is a snapshot where only the first investment exists. Then a snapshot with the first and the second, the details of the first changed. Then a third, second and first, and so on), those databases immediately usable, providing a glimpse of historic data.

The first test was run on an Amazon EC2 c3.2xlarge, which is a 8vCPU/15GB RAM instance chosen as it features two local 80GB SSDs. One SSD was mounted as an LVM volume, hosting the database and snapshots, while the other hosted the sqlite3 journal (which required changes to the sqlite3 source code, but yielded much better performance, and dramatically reduced what would have been churn across snapshots. Churn is the enemy on all systems).


A few things are evident from this.

  • The more data you insert, the slower the process is. This isn’t terribly surprising, and is primarily a facet of index maintenance.
  • LVM itself imposes a small overhead. The gray represents straight to the SSD, with no snapshots, against a non LVM volume (LVM2 not even installed). Red is with LVM installed, still with no snapshots. Blue represents snapshots on every investment iteration.
  • Snapshots impose far less of an overhead than I expected, at least against SSDs. At iteration 95, each insert had the overhead of maintaining 94 snapshots as well, yet it only halved the transaction speed.

Flash memory really is the solution here, as the primary overhead of maintaining snapshot are distinct IOs, moving to the various locations on the disk. I captured the iostats at every iteration, and it really is fascinating seeing the IOPS ramp up.

As one aside, at first consideration of the methodology you might logically think that the high degree of performance was explained by the limited overlap of the snapshots — e.g. on iteration 50, only investment 50 and 49 are altered and thus shouldn’t impact prior snapshots that didn’t even include this accumulated data. Unfortunately this isn’t true: When new and novel data is added to the database, those earlier snapshots still see the copy of the “original” data, which happened to be empty space on the hard drive.

After 95 iterations, for instance, the first snapshot had 52GB of changes written to it, even though the original data captured on that snapshot was in the sub-MB range.

Before looking at how snapshots impact performance on magnetic drives, here’s a quick comparison of the benchmark run on the SSD-equipped machine, compared to running it on a magnetic-drive hosting m1.xlarge, again hosting the benchmark on one of the local drives, the journal on another.


On pure IOPS and reductionist benchmarks, the flash drives launch to a many magnitudes advantage, but in this practical implementation it managed around a 5x speed advantage in most cases.

So how do snapshots impact the magnetic drive machine?


The situation is dramatically worse. Instead of a 2x slowdown, there is more in the range of a 5x slowdown, yielding a whole that is some 8.5x slower than the SSD equipped machine in the snapshot scenario.

Snapshot Performance – Not Too Bad

Under a somewhat realistic test scenario, on a modern machine (which now simply must mean flash-equipped), the performance overhead of LVM snapshots was actually significantly lower than I expected. It would be much better still if it took into account file system usage, and didn’t replicate unused file blocks to snapshots that simply don’t care about them.

Further I’ve tested up to 850 LVM2 snapshots, hitting a configuration ring-buffer limit that I could circumvent if I wanted to really push limits.

Having said that, LVM snapshots have a deadly, critical fault — when you create a snapshot, you have to allocate non-shared space that will hold its changeset. As mentioned above one test saw 52GB of changes pushed to one snapshot. Not only is this fixed, if you under-estimate this capacity and it fills, LVM2 degrades in a very unpleasant manner.

This kills the machine.

Quite seriously, though, several such cases required hard resets and rebuilding the LVM2 volume set after such a condition.

In any case, under controlled conditions snapshots can be a tool in the belt. How much it impacts your specific workload depends greatly upon the sorts of IOPS you generate, and the workload overhead you incur. But it’s worth looking into.

Name Navigator Updated with 2013 Data

The US Social Security Administration released the 2013 state-level name files this morning, so I’ve incorporated the new and revised data into the Name Navigator.

Behold the mighty staying power of Jacob, now in fisticuffs with the infant Liam, Noah and Masons of the world. The competition for top female names is just as contested.


Daenerys and Tyrion made a Game-of-thrones inspired appearance in 2013.

When a bit of time presents itself I plan on doing some fun correlation and trend analysis (particularly correlations between name groups, and lag factors between states), and to add some mortality calculations to the site to calculate estimated age / birth region for each name. I will find opportunities in the very near future.

Data is simply fascinating, and it’s wonderful when information like this is broadly shared.

As one disclaimer, the SSA does not provide 100% coverage with the data, excluding any incidents of names with less than 5 occurrences per state. A very uncommon name like Daenerys only blips into existence in 2013 in the state data, with 14 individuals having the name across three states. In actuality there were 67 females given the name across the United States in 2013, however because they were distributed across the states in counts below 5, they are effectively hidden at the sub-national level.

There is no way to simply shake the names across the states without distorting the data and analysis, and this gap only impacts very uncommon names.

For the probability percentages, the counts are compared against normalized and corrected values.

The Brutal Winter of 2013-2014. Data and discovery.

Spring is a welcome respite after a brutal winter here in Southern Ontario.


The frigid cold came early, hitting us with -23C (-10F) temperatures before the season had officially begun. It blasted us with a destructive ice storm that wreaked havoc on power grids, causing multi-day power outages and ruined holiday seasons.

It caused massive tree damage that is still being cleared: I’ve become a bit of an expert with the chainsaw, and have barely dented the workload before me.

The snow kept coming, the base layer of ice making every trek dangerous. A normal Southern Ontario winter sees intermittent periods of warmth that clears the slate for the next snow fall, but that never happened this year.

It just kept accumulating.

Our cars were stuck multiple times, each yielding frustrating multi-hour recovery efforts. Do you know where the recovery hooks are on your vehicle, or if it even has them? I didn’t know the answer to these questions before this winter, but I do now.

Of course complaining about the weather is something that people are prone to do, but as someone who historically loves the winter season, this outing soured me on the whole concept of a cold season. Even bitter cold is wonderful on occasion, but it is intolerable when it lasts for weeks.

It is the first time I’ve questioned the geography of where I live.

In any case, in putting together a solution for a client recently I needed to make some simple demonstrations with easily available data (a similar motivation yielded the oft emulated Interesting Facts about Domain Names entry years ago), and investigating the season’s weather was an interesting target.

I was curious if it really was that abnormal: In most prior years I’ve held -20C as a magical mark of “extraordinarily cold”, but this year that mark seemed to be hit with regularity.

So I went searching on Environment Canada. They do indeed have data available, even if they make it a little harder to use in bulk than necessary, but what is disheartening is the lack of quality of the data.

Stations go in and out of service frequently, and it’s less than ideal to compare different stations given natural variations. Even sticking with singular stations, however, there are significant gaps of data from all of them. Weeks of data simply vanished. In some cases only some of the data is available.

Surely this is a funding issue, and I suppose with the aggregate of hundreds of stations such gaps can be smoothed out, but for someone trying to use the data it makes it more of a pain than it should be.

In any case, I wanted data that was representative of Southern Ontario, and after evaluating many stations across the region, I settled on Roseville (4816). To fill in missing data, I used a synthesized set of data from two nearby stations in the affected periods, Elora (41983) and Waterloo Wellington A (4832).

Vertica and Pentaho Kettle made quick work of getting the data into a workable form, doing any data clean-up and discovery, and extracting data.

The data still is far from perfect — in many cases the high/low range falls significantly outside of the preceding day’s high/low range, which seems impossible — but it demonstrate trends and comparative results well.

D3 and jQuery made it a trivial task to visualize the data.

In any case, for those who want to visualize exactly what the daily temperature ranges looked like in Southern Ontario this winter (where winter is broadly specified as December 1st to March 31st of the following year), and those for decades before, the following tool makes the task simple. Note that for it to function your browser needs to support SVG, as all modern browsers do.

Note: The interactive temperature chart is also available at It will likely be non-functional if you are reading this with an RSS reader.

And on the magically cold value of -20C, 2013 had 25 days (between December 1st, 2013 to March 31st, 2014) below -20C. 1993 had 21 such days, while the closest recent year was 2004 with 12 days hitting that magically cold mark.

2012 had 5 days lower than -20C. 2011 had 0. The relative mildness of recent winters — versus the similar brutality of winters such as 1993-1994 — made it feel that much worse.