Programming Guide - Opening Compressed Databases

The Genomics Extension integrates with your programming language's existing SQLite3 bindings to provide a familiar experience wherever possible.

Python: sqlite3
Java/JVM: sqlite-jdbc
Rust: rusqlite
C++: SQLiteCpp (optional, recommended) or directly using...
C: SQLite C/C++ API

First complete the installation instructions.

Loading the extension

PythonJavaRustC++C

import sqlite3
import genomicsqlite

import java.sql.*;
import org.sqlite.SQLiteConfig;
import net.mlin.genomicsqlite.GenomicSQLite;

use genomicsqlite::ConnectionMethods;
use rusqlite::{Connection, OpenFlags, params, NO_PARAMS};

The genomicsqlite::ConnectionMethods trait makes available GenomicSQLite-specific methods for rusqlite::Connection (and rusqlite::Transaction). See rustdoc for some extra details.

#include <sqlite3.h>
#include "SQLiteCpp/SQLiteCpp.h" // optional
#include "genomicsqlite.h"

int main() {
  try {
    GENOMICSQLITE_CXX_INIT();
  } catch (std::runtime_error& exn) {
    // report exn.what()
  }
  ...
}

Link the program to sqlite3 and genomicsqlite libraries. Optionally, include SQLiteCpp headers before genomicsqlite.h to use its more-convenient API; but don't link it, as the genomicsqlite library has it built-in.

GNU/Linux: to link the prebuilt libgenomicsqlite.so distributed from our GitHub Releases, you may have to compile your source with CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0. This is because the library is built against an old libstdc++ version to improve runtime compatibility. The function of this flag is explained in the libstdc++ docs on Dual ABI. If you build libgenomicsqlite.so from source, then the flag will not be needed.

General note: GenomicSQLite C++ routines are liable to throw exceptions.

#include <sqlite3.h>
#include "genomicsqlite.h"

int main() {
  char *zErrMsg = 0;
  int rc = GENOMICSQLITE_C_INIT(&zErrMsg);
  if (rc != SQLITE_OK) {
    /* report zErrMsg */
    sqlite3_free(zErrMsg);
  }
  ...
}

Link the program to sqlite3 and genomicsqlite libraries.

All GenomicSQLite C routines returning a char* string use the following convention. If the operation succeeds, then it's a nonempty, null-terminated string. Otherwise, it points to a null byte followed immediately by a nonempty, null-terminated error message. In either case, the caller must free the string with sqlite3_free(). NULL is returned only if out of memory.

Opening a compressed database

↪ GenomicSQLite Open: create or open a compressed database, returning a connection object with various settings pre-tuned for large datasets.

PythonJavaRustSQLiteCppC++C

dbconn = genomicsqlite.connect(
  db_filename,
  read_only=False,
  **kwargs  #  genomicsqlite + sqlite3.connect() arguments
)
assert isinstance(dbconn, sqlite3.Connection)

SQLiteConfig config = new SQLiteConfig();
// config.setReadOnly(true)
// (or other sqlite-jdbc options as needed)
java.util.Properties props = config.toProperties();
props.setProperty("genomicsqlite.config_json", "{}");
// GenomicSQLite tuning options as JSON text   ^^^^

Connection dbconn = DriverManager.getConnection(
  "jdbc:genomicsqlite:" + dbfileName, props
);

let dbconn: Connection = genomicsqlite::open(
  db_filename,
  OpenFlags::SQLITE_OPEN_CREATE | OpenFlags::SQLITE_OPEN_READ_WRITE,
  &json::object::Object::new()  // tuning options
)?;

std::unique_ptr<SQLite::Database> GenomicSQLiteOpen(
  const std::string &db_filename,
  int flags = 0,
  const std::string &config_json = "{}"  // tuning options as JSON text
);

int GenomicSQLiteOpen(
  const std::string &db_filename,
  sqlite3 **ppDb,
  std::string &errmsg_out,
  int flags = 0,  // as sqlite3_open_v2() e.g. SQLITE_OPEN_READONLY
  const std::string &config_json = "{}"  // tuning options as JSON text
) noexcept; // returns sqlite3_open_v2() code

int genomicsqlite_open(
  const char *db_filename,
  sqlite3 **ppDb,
  char **pzErrMsg, /* if nonnull and an error occurs, set to error message
                    * which caller should sqlite3_free() */
  int flags,              /* as sqlite3_open_v2() e.g. SQLITE_OPEN_READONLY */
  const char *config_json /* tuning options as JSON text (may be null) */
); /* returns sqlite3_open_v2() code */

Afterwards, all the usual SQLite3 API operations are available through the returned connection object, which should finally be closed in the usual way. The storage compression layer operates transparently underneath.

❗ GenomicSQLite databases should only be opened using this routine. If a program opens an existing GenomicSQLite database using a generic SQLite3 API, it will find a valid database whose schema is that of the compression layer instead of the intended application's. Writing into that schema might effectively corrupt the database!

Tuning options

The aforementioned tuned settings can be further adjusted. Some bindings (e.g. C/C++) receive these options as the text of a JSON object with keys and values, while others admit individual arguments to the Open routine.

threads = -1: thread budget for compression, sort, and prefetching/decompression operations; -1 to match up to 8 host processors. Set 1 to disable all background processing.
inner_page_KiB = 16: SQLite page size for new databases, any of {1, 2, 4, 8, 16, 32, 64}. Larger pages are more compressible, but increase random I/O cost.
outer_page_KiB = 32: compression layer page size for new databases, any of {1, 2, 4, 8, 16, 32, 64}.
The default configuration (inner_page_KiB, outer_page_KiB) = (16,32) balances random access speed and compression. Try setting them to (8,16) to prioritize random access, or (64,2) to prioritize compression (if compressed database will be <4TB).
zstd_level = 6: Zstandard compression level for newly written data (-7 to 22)
unsafe_load = false: set true to disable write transaction safety (see advice on bulk-loading below).
❗ A database written to unsafely is liable to be corrupted if the application crashes, or if there's a concurrent attempt to modify it.
page_cache_MiB = 1024: database cache size. Use a large cache to avoid repeated decompression in successive and complex queries.
immutable = false: set true to slightly reduce overhead reading from a database file that won't be modified by this or any concurrent program, guaranteed.
force_prefetch = false: set true to enable background prefetching/decompression even if inner_page_KiB < 16 (enabled by default only ≥ that, as it can be counterproductive below; YMMV)

The connection's potential memory usage can usually be budgeted as roughly the page cache size, plus the size of any uncommitted write transaction (unless unsafe_load), plus some safety factor. ❗However, this can multiply by (threads+1) during queries whose results are at least that large and must be re-sorted. That includes index creation, when the indexed columns total such size.

genomicsqlite interactive shell

The Python package includes a genomicsqlite script that enters the sqlite3 interactive shell on an existing compressed database. This is a convenient way to inspect and explore the data with ad hoc SQL queries, as one might use grep or awk on text files. With the Python package installed (pip3 install genomicsqlite or conda install -c mlin genomicsqlite):

$ genomicsqlite DB_FILENAME [--readonly]

to enter the SQL prompt with the database open. Or, add an SQL statement (in quotes) to perform and exit. If you've installed the Python package but the script isn't found, set your PATH to include the bin directory with Python console scripts.

Database compaction. The utility has a subcommand to compress and defragment an existing database file (compressed or uncompressed), which can increase its compression level and optimize access to it.

$ genomicsqlite DB_FILENAME --compact

generates DB_FILENAME.compact; see its --help for additional options, in particular --level, --inner-page-KiB and --outer-page-KiB affect the output file size as discussed above.

Due to decompression overhead, the compaction procedure may be impractically slow if the database has big tables that weren't initially written in their primary key order. To prevent this, see below Optimizing storage layout.

Reading databases over the web

The GenomicSQLite Open routine and the genomicsqlite shell also accept http: and https: URLs instead of local filenames, creating a connection to read the compressed file over the web directly. The database connection must be opened read-only in the appropriate manner for your language bindings (such as the flag SQLITE_OPEN_READONLY). The URL server must support HTTP GET range requests, and the content must not change for the lifetime of the connection.

Under the hood, the extension uses libcurl to send web requests for necessary portions of the database file as queries proceed, with adaptive batching & prefetching to balance the number and size of these requests. This works well for point lookups and queries that scan largely-contiguous slices of tables and indexes (and a modest number thereof). It's less suitable for big multi-way joins and other aggressively random access patterns; in such cases, it'd be better to download the database file upfront to open locally.

Reading large databases over the web, budget an additional ~600MiB of memory for HTTP prefetch buffers.
The HTTP driver writes log messages to standard error when requests fail or had to be retried, which can be disabled by setting configuration web_log = 0 or environment SQLITE_WEB_LOG=0; or increased up to 5 to log every request and other details.
To disable TLS certificate and hostname verification, set web_insecure = true in the GenomicSQLite configuration, or SQLITE_WEB_INSECURE=1 in the environment.
The above-described genomicsqlite DB_FILENAME --compact optimizes a database for web access by making the request pattern more contiguous.

Web access optimization with .dbi helper files

Experimental feature

Optionally, web access can be further optimized by a small .dbi helper file served alongside the main database file. The client automatically probes for this by appending .dbi to the database URL (unless there's a query string). If that's not usable for any reason, the database falls back to direct access. Increase the web_log to 3 or higher to see which mode is used.

Use genomicsqlite DB_FILENAME --dbi to generate the .dbi helper for an immutable database file, then publish them alongside each other. The .dbi must be regenerated if the database subsequently changes.

To override the automatic probe, set configuration web_dbi_url to a different URL for the .dbi file, or to a local file:/path/to.dbi downloaded beforehand. Use the latter feature to save multiple connections from each having to fetch the .dbi separately. Lastly, set web_nodbi to true or environment SQLITE_WEB_NODBI=1 to disable dbi mode entirely.

The .dbi helper is optional, but often beneficial for big databases accessed with high-latency requests. It collects bits of the main file that are key for navigating it, but typically scattered throughout (even after compaction). Prefetching them in the compact .dbi saves the reader from having to pluck them from all over the main file.

Advice for big data

Writing large databases quickly

sqlite3_config(SQLITE_CONFIG_MEMSTATUS, 0) if available, to reduce overhead in SQLite3's allocation routines.
Open database with unsafe_load = true to reduce transaction processing overhead (at aforementioned risk) for the connection's lifetime.
Also open with the flag SQLITE_OPEN_NOMUTEX, if your application naturally serializes operations on the connection.
Perform all of the following steps within one big SQLite transaction, committed at the end.
Insert data rows reusing prepared, parameterized SQL statements.
1. Process the rows in primary key order, if feasible (otherwise, see below Optimizing storage layout).
2. Consider preparing data in producer thread(s), with a consumer thread executing insertion statements in a tight loop.
3. Bind text/blob parameters using SQLITE_STATIC if suitable.
Create secondary indexes, including genomic range indexes, only after loading all row data. Use partial indexes when they suffice.

Optimizing storage layout

For multiple reasons mentioned so far, large tables should have their rows initially inserted in primary key order (or whatever order will promote access locality), ensuring they'll be stored as such in the database file; and tables should be written one-at-a-time. If it's inconvenient to process the input data in this way, the following procedure can help:

Create temporary table(s) with the same schema as the destination table(s), but omitting any PRIMARY KEY specifiers, UNIQUE constraints, or other indexes.
Stream all the data into these temporary tables, which are fast to write and read, in whatever order is convenient.
INSERT INTO permanent_table SELECT * FROM temp_table ORDER BY colA, colB, ... using the primary key (or other desired sort order) for each table.

The Genomics Extension automatically enables SQLite's parallel, external merge-sorter to execute the last step efficiently. Ensure it's configured to use a suitable storage subsystem for big temporary files.

Compression guidelines

The Zstandard-based compression layer is effective at capturing the high compressibility of bioinformatics data. But, one should expect a general-purpose database to use extra space to keep everything organized, compared to a file format dedicated to one read-only schema. To set a rough expectation, the maintainers feel fairly satisfied if the database file size isn't more than double that of a bespoke compression format — especially if it includes useful indexes (which if well-designed, should be relatively incompressible).

The aforementioned zstd_level, threads, and page_size options all affect the compression time-space tradeoff, while enlarging the page cache can reduce decompression overhead (workload-dependent).

With SQLite's row-major table storage format, the first read of a lone cell usually entails decompressing at least its whole row, and there aren't any special column encodings for deltas, run lengths, etc. The "last mile" of optimization may therefore involve certain schema compromises, such as storing infrequently-accessed columns in a separate table to join when needed, or using application-layer encodings with BLOB I/O.