Using sqlite3 for financial calculation purposes

SQLite Is Everywhere

Even if you know nothing about the SQLite project, you’re probably using it right now.

It’s used in Apple’s Safari browser, and their iOS and OSX operating systems. It’s used in Mozilla Firefox. It’s used in Google’s Chrome browser and OS, and the Android OS.

SQLite is everywhere.

The wide usage is a result of a very liberal license that allows carte blanche use of a thoroughly tested embedded database solution built on a highly-portable, very efficient codebase.

It offers a surprisingly robust set of functionality. It supports fully ACID operations. It supports massive databases. It is easily extensible with custom functionality, such as very high performance calculations.

Don’t Burn Your Oracle License Just Yet

It’s not going to replace your big enterprise database solution: It’s an embedded project with limited writer concurrency, more often used to manage browser bookmarks and cookies than to run large OLTP platforms.

But it can certainly help with some of the edges. The edges that in practice end up being one of the biggest headaches of most financial systems, but are often the primary value of the platform: Reporting and analysis.

Most financial systems are built around collecting and aggregating data from disparate sources, using both manual input and automated processes to collect the data in the OLTP information store.

Transaction details. Security market close numbers. Fund compositions and holdings. Etc.

A very high degree of write concurrency is a necessity, though individual transactions tend to impact or draw upon a relatively small set of data (assuming you don’t attempt to front-load derived value computations). This is a domain that is often backed by a clustered copy of one of the big database products (Oracle, DB2, SQL Server, etc), and where I/O is the primary potential limitation — seldom an issue anymore with the enormous IOPS potential of flash solutions — horizontal scaling not being an issue outside of ensuring redundancy needs are met via a mechanism such as simple mirroring.

On the output side of the equation — the purpose for doing all of that data work — is reporting and analysis, which is a process that generally has marginal write concurrency needs, but often draws upon enormous sets of data to build those results. It’s a process that is ill-suited for most relational databases, especially when attempts are made to shoehorn it onto the same platform hosting the OLTP, causing destructive contention for limited resources, and a lock system that is overloaded trying to rationalize the competing demands.

Which is, of course, why most database vendors offer data warehousing products. Products more suitable for read-heavy jobs, where data is organized optimally for reporting purposes (for instance column-oriented). The data warehousing model is generally that you periodically or incrementally publish data from your OLTP system to your data warehouse, and then via technologies like OLAP allow users and agents to construct and generate results and what-if scenarios.

Data warehousing solutions work brilliantly for many markets, for instance to populate the scorecards and back the drilldowns of a chain of retail stores. It is less suitable for financial systems for a wide variety of implementation reasons, where attempting to make the solution fit often ends up causing more grief than satisfaction. The cube model, for instance, is unsuitable for many complex hierarchical holding scenarios.

HP’s Vertica offers a compelling option in column-oriented databases, compelling for its R integration, however it can be overwrought for some purposes.

In between are industry-specific solutions such as Advent’s Geneva. Geneva offers novel technical solutions to service the financial space (with a very unique and compelling in-memory, knowledge-date capable database system), and is a market leader, but like any product in the real world it isn’t a fit for all needs.

So many financial organizations end up with something unique and custom-made. Often a solution is pursued by trying to bolt something onto the OLTP system, throwing all of the eggs into one basket, creating an inefficient, unreliable platform that no longer serves any need well.

On such a hydra there is seldom the ability to scale economically. Security is usually a fingers-crossed affair. Data integrity is often compromised. Licensing costs often end up being a significant hindrance to optimal deployments. Maintainability….forget about it. Performance is usually abysmal.

On the other end of the custom-built spectrum are non-database solutions hosting maps of maps of dictionaries of hashes, data pulled, on-demand, into in-memory constructs by various .NET / Java / Python instances, loaded into purpose built structures and solutions. While they allow for trivial, low-cost scale-out and can offer optimal performance (many orders of magnitude over the performance generally achieved bolting into the RDBMS), the problem with these solutions is that they tend towards being overspecialized, every alteration turning into a significant coding project. They’re very difficult to bring to a project success state given the number of assumptions and analysis steps that need to be absolutely correct for the end result to yield the desired effect.

But assume you haven’t chosen either of those paths. Instead you have a clean and well-designed OLTP platform that only stores canonical data (nothing derived, keeping your data tight and hyper-efficient), every transaction as short and efficient as possible. You fully understand your data, and the universe of beneficial uses and calculations you can do on that data. What then?

You don’t want to be tied into the RDBMS, with the costs and limitations/serious performance restrictions that entails. You don’t want to build something entirely from scratch, essentially trying to build a specialized database product in the process, reinventing a problem solved many times over.

SQLite Enters The Scene

This is where SQLite comes into play as an element of a bounding solution.

  • It has a robust, well-proven, high-performance, low overhead set of data storage and SQL-enabled data retrieval functionality that you can embed into your solutions, including in-memory databases. This gives you optimized, specialized computing, but with the versatility that you can flex the data processed with simple SQL alterations
  • The SQLite library is very easily extensible. The example I give here is a simple IRR (internal rate of return) aggregate function extension, but the opportunities are endless, with our own library hosting a number of financial computations, all extremely high performance. With the foundation of SQLite you can add artificial table sources, unique native date functions, even custom data storage, etc.
  • The file format is well known and very widely supported. Assuming you do a reporting/analysis dump into a SQLite database file (or many files — scale out is unbounded, so you can build data islands to whatever extent makes sense), you can use that data in your specialized, SQL-enabled analysis solution hosting your customized SQLite library, or in any other tool that supports such files.
  • With those islands of sqlite databases you now have the capacity to use very efficient page-level file versioning, giving you the capacity to do state-in-time analysis (in the variety of “give me the performance metrics for this hierarchy of holdings as it would have been reporting given what was known on January 5th”. This allows you to report numbers exactly as they ).

Adding a Custom Function to SQLite

As an example of the simple implementation of a custom, native aggregate function within the SQLite library, an IRR variant (note that in Excel parlance the following implementation is actually a simile of the more versatile XIRR function) might be added to SQLite via-

/*
** The follow structure helps keep track of irr information for the
** irr aggregate function
*/

#define TRANSACTION_SET_GROUP_COUNT 64
typedef struct TransactionSet TransactionSet;
struct TransactionSet {
  i64 period[TRANSACTION_SET_GROUP_COUNT];
  double amount[TRANSACTION_SET_GROUP_COUNT];
  TransactionSet *nextTransactionSet;
} __attribute__ ((aligned (16)));

typedef struct NpvCtx NpvCtx;
struct NpvCtx {
  i64 cnt;
  i64 minPeriod;
  i64 maxPeriod;
  TransactionSet *baseTransactions;
  TransactionSet *headTransactions;
};

static double calcNpv(NpvCtx *p, double rate, int derivative)
{
  rate += 1.0;
  double result = 0.0;
  int i;
  int c;
  c = p->cnt;
  TransactionSet *t = p->baseTransactions;
  if (derivative) {
    for (i = 0; i < c; i++) {
      int period = t->period[i] - p->minPeriod;
      result += -period * t->amount[i] / pow(rate, period+1);
    }
  } else {
    for (i = 0; i < c; i++) {
      result += (t->amount[i] / pow(rate, (double)t->period[i] - p->minPeriod));
    }
  }
  return result;
}

/*
** Routines to implement daily irr.
*/
static void irrStep(sqlite3_context *context, int argc, sqlite3_value **argv){
  int atype, ptype;
  NpvCtx *p;
  p =  sqlite3_aggregate_context(context, sizeof(NpvCtx));

  atype = sqlite3_value_numeric_type(argv[0]);
  ptype = sqlite3_value_numeric_type(argv[1]);

  if( p && atype!=SQLITE_NULL && ptype!=SQLITE_NULL){
    if (atype==SQLITE_FLOAT && ptype==SQLITE_INTEGER) {
      int period = sqlite3_value_int(argv[1]);
      int offset = p->cnt % TRANSACTION_SET_GROUP_COUNT;

      if (offset == 0) {
        TransactionSet *newSet;
        newSet = malloc(sizeof(TransactionSet));
        newSet->nextTransactionSet = (void *)0;

        if (p->headTransactions) {
          p->headTransactions->nextTransactionSet = newSet;
          p->headTransactions = newSet;
        } else {
          p->baseTransactions = newSet;
          p->headTransactions = newSet;
        }
      }

      if (p->cnt == 0) {
        p->maxPeriod = period;
        p->minPeriod = period;
      } else {
        if (period > p->maxPeriod) {
          p->maxPeriod = period;
        } else if (period < p->minPeriod) {
          p->minPeriod = period;
        }
      }
      p->headTransactions->period[offset] = period;
      p->headTransactions->amount[offset] = sqlite3_value_double(argv[0]);
      p->cnt++;
    }
  }
}

#define IRR_CLOSE_ENOUGH 0.000001
#define IRR_MAX_ITERATIONS 80

static void irrFinalize(sqlite3_context *context){
  NpvCtx *p;
  p =  sqlite3_aggregate_context(context, 0);
  if (p) {
    int iterations;
    double guess, lguess, result, differential, diff;
    guess = 0.1;
    lguess = 0.0;
    diff = 1.0;
    iterations = 0;

    do {
      result = calcNpv(p, guess, 0);
      differential = calcNpv(p, guess, 1);
      if (differential == 0) {
        iterations = IRR_MAX_ITERATIONS;
        break;
      }
      lguess = guess;
      guess = guess - result / differential;

      diff = guess - lguess;
      if (diff < 0) diff *= -1;
      if (diff < IRR_CLOSE_ENOUGH) break;
      if (iterations++ > IRR_MAX_ITERATIONS) break;
    } while (1);

    if (iterations < IRR_MAX_ITERATIONS) {
      sqlite3_result_double(context, (double)guess);
    } else {
      // no solution found
      sqlite3_result_null(context);
    }
    // manually free the temporary chain of transaction data.
    TransactionSet *current, *next;
    current = p->headTransactions;
    while (current) {
      next = current->nextTransactionSet;
      free(current);
      current = next;
    }
  } else {
    sqlite3_result_double(context, (double)0);
  }
}


...
    AGGREGATE(irr, 2, 0, 0, irrStep, irrFinalize),

The aggregate function takes two parameters per row: value and period. Period is an integer value of your choosing, whether year, week from some origin, days from some origin, etc. If it were called with the data (-300000.0, 0), (150000.0, 1), (150000.0, 2), (150000.0, 3), (10000.0, 4), say via a query such as SELECT irr(amount, period) from cashtransactions where investment_id = 768;, it would return 24.3118538739616% (which is the rate at which you would discount prior cash flows to yield a result approximately equal to 0).

I’ve removed some error-checking/memory-management and hand-coded vectorization for simplification, though note that this takes full advantage of SSE and even AVX through auto-vectorization of any decent compiler, and can compute enormous data counts extremely quickly. With this very simple addition to SQLite, you have IRR and NPV functionality natively in the library, and thus in any of your apps that use that library (through the normal SQLite API).

As some implementation comments, note that this does need to copy subject data to temporary in-memory storage, which can mean wastefully copying from, in some cases, an in-memory table into an in-memory array. It does this both to allow for vectorization through alignment and placement, but also the nature of IRR calculations requires iterating over the same data repeatedly performing refined approximations of the NPV rate.

You can also add functionality like this via loadable extensions, but for our case our changes are so substantial that we add our customizations right in the SQLite library.

There are other open-source database solutions, of course, and another approach would be building a custom, financial-industry focused version of PostgrSQL or some other data product — and such products exist. Needs vary, however in our case the benefit of using SQLite was specifically that it was a self-contained, very efficient and lightweight, pluggable shared library that allowed us to still very easily incorporate the non-relational parts of the application in the host application without trying to shoehorn it into the relational design. SQLite is used as a data manager for an analysis and reporting engine, with some rudiments of financial functionality right in that library, but it is just a part of the solution.

In the financial calculation solutions that my organization provides to clients, we leverage this heavily customized, extremely high-performance SQLite library via Go (golang) as the orchestrator, using mechanisms I’ve mentioned on here before.

The end result is a powerful, flexible, efficient, easily modified and expanded, very easily scaled and migrated solution that can be pushed on thin cloud instances or hyper-threaded on mega servers (or even smartphones or embedded devices). One that offers performance that is simply unmatched by any competitive product, and is many orders of magnitude faster than the in-house built solutions often created. One that allows you to deploy to armies of computers with ease and at minimal cost or legal complexity.

Like The Sound of This

These are the sorts of solutions we @ yafla build for our clients. They enable efficiency and scale and flexibility of operations that make small and medium organizations into computing Goliaths.

These are solutions that endure.

Contact me today to get us making the technology of your organization into a formidable foe by using and building upon the awesome, incredible technology around us.