Designing a High-Performance OLTP Database from First Principles

Jan 6, 2026

A deep dive inspired by TigerBeetle and Viewstamped Replication

While reading about Viewstamped Replication (VSR), I came across TigerBeetle, which uses this replication protocol to achieve extremely high reliability. Little did I know how fascinating the design of such a database would turn out to be.

If you had to design a highly efficient OLTP database from first principles, how would you do it?

One strong answer is: do what TigerBeetle did.

TigerBeetle builds its system by tightly integrating the four primary dimensions of computer science—network, storage, memory, and compute—to arrive at a configuration that is radically more optimal than conventional databases.


Design Goals

The system starts with a set of aggressive but clear goals:

                    Design Goals
                         |
        +----------------+----------------+
        |                |                |
    1000× Perf      10× Safety    10× Dev/Ops Exp
        |                |                |
  Orders of mag    Data Integrity    Simplicity
   more ops/sec     & Durability     & Reliability
                         |
              Single Node Optimization
                         |
            No Horizontal Scaling Complexity

Rethinking OLTP Workloads

The nature of OLTP workloads has changed drastically over time.

Originally, OLTP systems handled simple debit/credit operations, primarily for financial data. These operations are essentially database transactions derived from business transactions.

If we conceptually split OLTP into:

we can target scale far more effectively.

Inverting Query Amplification

Traditional financial databases often require ~10 database queries per business transaction.

What if we invert this relationship?

Because OLTP workloads are fundamentally accounting workloads:

1 database query = 1000 business transactions

This inversion fundamentally changes scalability.

Traditional Approach:
    1 Business Transaction  →  ~10 Database Queries
    ❌ High overhead, many round trips

Inverted Approach (TigerBeetle):
    1 Database Query  →  ~1000 Business Transactions
    ✅ Massive batching, optimal throughput

Exploiting the Four Axes

           TigerBeetle Architecture
                    |
    +-------+-------+-------+-------+
    |       |       |       |       |
 Network Storage Memory  Compute
    |       |       |       |
    |       |       |       +-- Viewstamped Replication
    |       |       |       +-- Deterministic Leader Election
    |       |       |
    |       |       +-- Zero-copy
    |       |       +-- Zero-allocation
    |       |       +-- No runtime malloc
    |       |
    |       +-- 1 MiB WAL writes
    |       +-- Direct I/O
    |
    +-- Batching: 8000 txns/batch

Network

With batching:

Storage

Note: Direct I/O is now often faster than buffered I/O due to modern storage hardware. The real bottleneck has shifted to memory, which must be addressed via:

Compute

Distributed systems must handle leader election when the primary node fails—which is inevitable.

Common consensus algorithms include Paxos and Raft. Here, we focus on Viewstamped Replication (VSR), which handles primary failure while preserving operation order and data correctness.


Viewstamped Replication (VSR)

Key Notes

TigerBeetle uses VSR instead of Paxos or Raft primarily because leader election is deterministic.

VSR Protocol Flow:

Client ────Request───> Primary
                         |
                      Prepare
                         |
          +--------------+---------------+
          |                              |
    Prepare Msg                    Prepare Msg
          |                              |
          v                              v
      Replica 1                      Replica 2
          |                              |
     Prepare OK                     Prepare OK
          |                              |
          +--------------+---------------+
                         |
                         v
                      Primary
                         |
                      Commit
                         |
          +--------------+---------------+
          |              |               |
      Response      Commit Msg      Commit Msg
          |              |               |
          v              v               v
       Client        Replica 1       Replica 2

The Performance Result

4 syscalls, 4 memory copies, 3 network requests → ~8000 TPS

The biggest win is actually the reduction of row locks—from ~16k down to effectively zero.

Request Batch → recv syscall → write WAL syscall → send ack syscall → read state syscall
                                                                              |
                                                                         8000 TPS ✅

Why Only Four Syscalls?

In a typical processing cycle, the system needs to:

  1. Receive batched requests from the network
  2. Write them to the WAL
  3. Send acknowledgments
  4. Read state for processing

Instead of issuing a syscall per packet or block, TigerBeetle aggregates all operations.

Using io_uring, the application submits a batch of reads and writes and enters the kernel once. This allows a single execution thread to saturate modern NVMe drives and 100 GbE network links.


Durability Considerations

Scale
  └─> More Data
       └─> Higher Risk of Corruption
            └─> Exponential Impact
                 └─> Need Cryptographic Guarantees

Replication Concerns with Raft

TigerBeetle avoids this by using cryptographic hash chains for replication integrity.

Raft Limitation:
    Replica 1: Missing Data
    Replica 2: Corrupted
    Primary Fails
      └─> Cannot Elect New Leader
           └─> System Stalls ❌

TigerBeetle Solution:
    Cryptographic Hash Chains
      └─> Verify Data Integrity
           └─> Safe Leader Election ✅

Codebase Deep Dive

Every data structure—from protocol headers to on-disk state—is fortified with cryptographic checksums.

Core Components

One interesting finding from this crash report is how the quorum_headers function tallies cluster responses.

Time is measured not in wall-clock seconds, but in logical ticks—processed messages via replica.tick().

Because state is a pure function of the log, state transfer is as simple as sending snapshots or data files.

Replica State Machine:

    [Start]
       |
       v
   ┌────────┐
   │ Normal │<─────────────┐
   └────────┘              |
       |                   |
  Primary Failure          |
       |                   |
       v                   |
   ┌───────────┐           |
   │ ViewChange │          |
   └───────────┘           |
       |                   |
  log_view < view?         |
       |                   |
    +──+──+                |
    |     |                |
   Yes   No                |
    |     |                |
    v     v                |
DoViewChange  StartView    |
    |          |           |
    |          |           |
    +────┬─────+           |
         |                 |
    New Primary Elected    |
         |                 |
         └─────────────────┘

Handling Leader Failure

To initiate leader change, a replica transitions from normal mode to participation mode.

This is handled in src/vsr/superblock.zig:

pub fn view_headers(superblock: *const SuperBlockHeader) vsr.Headers.ViewChangeSlice {
    return vsr.Headers.ViewChangeSlice.init(
        if (superblock.vsr_state.log_view < superblock.vsr_state.view)
           .do_view_change
        else
           .start_view,
        superblock.view_headers_all[0..superblock.view_headers_count],
    );
}

If log_view < view, the system infers a transition state and executes do_view_change.

The critical insight: view state is persisted in the superblock, not volatile memory.


Storage as the Anchor of Trust

TigerBeetle does not rely on filesystem guarantees—because they effectively do not exist.

Instead, it implements its own transactional guarantees using a cryptographic hash chain, capable of reconstructing a universal state even from partially corrupted replicas.

SuperBlockHeader

Defined in src/vsr/superblock.zig

Uses Zig’s extern struct to guarantee precise on-disk layout.

Key fields include:

FieldTypeDescription
checksumu128Checksum of remaining fields
copyu16Superblock copy index (0–3)
sequenceu64Monotonic version counter
clusteru128Cluster UUID
parentu128Hash of previous superblock
vsr_stateVSRStateEmbedded consensus state

Embedding VSRState directly into storage solves split-brain at the storage layer. On startup, the node immediately knows its latest consensus state—no log replay needed.

The superblock is physically duplicated four times on disk.

        SuperBlock
            |
    +-------+-------+-------+
    |       |       |       |
  Copy 0  Copy 1  Copy 2  Copy 3
    |       |       |       |
    +-------+-------+-------+
            |
        Redundancy
            |
   Survives Partial Corruption

Journal (WAL) as a Hash Chain

TigerBeetle’s WAL—called the journal—is not a simple append log.

Each entry contains:

This forms a hash chain similar to a blockchain.

The function valid_hash_chain_between verifies integrity across log ranges. If corruption is detected (e.g., torn writes), the system repairs itself by fetching clean blocks from peers.

Journal Hash Chain:

[Entry 1] --checksum--> [Entry 2] --checksum--> [Entry 3] --checksum--> [Entry 4] --checksum--> [Entry 5]

Each entry contains:
  - Sequence number
  - Checksum of previous entry
  - Transaction data

Similar to blockchain structure for integrity verification

Storage Data Structures

Instead of a single LSM tree, TigerBeetle uses an LSM forest:

This improves locality and performance.

        LSM Forest
            |
    +-------+-------+
    |               |
Object Trees   Index Trees
    |               |
Transfers by    +---+---+
 Timestamp      |       |
                |       |
            Primary  Secondary
            Indexes  Indexes
                |
                v
         Better Locality
                |
                v
        Higher Performance

I/O: Direct I/O and Determinism

TigerBeetle bypasses OS abstractions like page cache to eliminate unpredictability.

From /src/io/linux.zig:

fn fs_supports_direct_io(dir_fd: fd_t)!bool {
    if (!@hasField(posix.O, "DIRECT")) return false;
    //...
}

Kernel page cache is dangerous because:

  1. fsync() may lie about durability
  2. Memory pressure causes unpredictable eviction
  3. Cached corruption propagates silently
Traditional I/O ❌                Direct I/O ✅
      |                                |
 Kernel Page Cache               Bypass Page Cache
      |                                |
  +---+---+                       +----+----+
  |   |   |                       |         |
fsync  Unpredictable        Deterministic  True
may lie  eviction           Behavior     Durability
      |
Silent corruption

Static Memory Allocation

Many production crashes happen under load due to allocation failures.

TigerBeetle avoids this entirely:

See src/message_pool.zig and this stack trace

Static Memory Allocation Flow:

Startup
  └─> Calculate Max Memory Requirements
       └─> Allocate One Large Contiguous Block
            └─> Create Buffer Pool
                 └─> Runtime ✅
                      ├─> Pop Buffer
                      ├─> Use Buffer
                      └─> Return Buffer
                           └─> (cycle continues)

Benefits:
  • No allocation failures under load
  • Predictable memory usage
  • No fragmentation
  • Zero runtime malloc overhead

A Note on Programming Style

This style of programming is refreshing.

Even without deep familiarity with Zig, the code is easy to follow due to:

It is a style that makes correctness obvious—and bugs uncomfortable.


End result: a database designed from first principles, where performance, safety, and determinism are not trade-offs, but consequences of the architecture.


References