Designing a High-Performance OLTP Database from First Principles

Jan 6, 2026

database systems tigerbeetle performance

A deep dive inspired by TigerBeetle and Viewstamped Replication

While reading about Viewstamped Replication (VSR), I came across TigerBeetle, which uses this replication protocol to achieve extremely high reliability. Little did I know how fascinating the design of such a database would turn out to be.

If you had to design a highly efficient OLTP database from first principles, how would you do it?

One strong answer is: do what TigerBeetle did.

TigerBeetle builds its system by tightly integrating the four primary dimensions of computer science—network, storage, memory, and compute—to arrive at a configuration that is radically more optimal than conventional databases.

Design Goals

The system starts with a set of aggressive but clear goals:

No horizontal scaling—optimize orders of magnitude of ops/sec on a single node
1000× performance
10× safety
10× developer and operator experience

                    Design Goals
                         |
        +----------------+----------------+
        |                |                |
    1000× Perf      10× Safety    10× Dev/Ops Exp
        |                |                |
  Orders of mag    Data Integrity    Simplicity
   more ops/sec     & Durability     & Reliability
                         |
              Single Node Optimization
                         |
            No Horizontal Scaling Complexity

Rethinking OLTP Workloads

The nature of OLTP workloads has changed drastically over time.

Originally, OLTP systems handled simple debit/credit operations, primarily for financial data. These operations are essentially database transactions derived from business transactions.

If we conceptually split OLTP into:

OLGP – general-purpose workloads
OLTP – pure transaction processing

we can target scale far more effectively.

Inverting Query Amplification

Traditional financial databases often require ~10 database queries per business transaction.

What if we invert this relationship?

Because OLTP workloads are fundamentally accounting workloads:

1 database query = 1000 business transactions

This inversion fundamentally changes scalability.

Traditional Approach:
    1 Business Transaction  →  ~10 Database Queries
    ❌ High overhead, many round trips

Inverted Approach (TigerBeetle):
    1 Database Query  →  ~1000 Business Transactions
    ✅ Massive batching, optimal throughput

Exploiting the Four Axes

           TigerBeetle Architecture
                    |
    +-------+-------+-------+-------+
    |       |       |       |       |
 Network Storage Memory  Compute
    |       |       |       |
    |       |       |       +-- Viewstamped Replication
    |       |       |       +-- Deterministic Leader Election
    |       |       |
    |       |       +-- Zero-copy
    |       |       +-- Zero-allocation
    |       |       +-- No runtime malloc
    |       |
    |       +-- 1 MiB WAL writes
    |       +-- Direct I/O
    |
    +-- Batching: 8000 txns/batch

Network

With batching:

One recv() call can ingest ~1 MiB of data
This corresponds to roughly 8000 transactions per batch

Storage

One write() call writes a 1 MiB WAL entry
Low latency via fsync() or direct I/O

Note: Direct I/O is now often faster than buffered I/O due to modern storage hardware. The real bottleneck has shifted to memory, which must be addressed via:

Zero-copy
Zero-allocation
No malloc during runtime

Compute

Distributed systems must handle leader election when the primary node fails—which is inevitable.

Common consensus algorithms include Paxos and Raft. Here, we focus on Viewstamped Replication (VSR), which handles primary failure while preserving operation order and data correctness.

Viewstamped Replication (VSR)

Key Notes

Use VSR on a single-node system to turn it into a highly available distributed system
Fault tolerance is achieved through redundancy in space and time
Based on state machine replication: initial state + replicated operation log → identical final state

TigerBeetle uses VSR instead of Paxos or Raft primarily because leader election is deterministic.

VSR Protocol Flow:

Client ────Request───> Primary
                         |
                      Prepare
                         |
          +--------------+---------------+
          |                              |
    Prepare Msg                    Prepare Msg
          |                              |
          v                              v
      Replica 1                      Replica 2
          |                              |
     Prepare OK                     Prepare OK
          |                              |
          +--------------+---------------+
                         |
                         v
                      Primary
                         |
                      Commit
                         |
          +--------------+---------------+
          |              |               |
      Response      Commit Msg      Commit Msg
          |              |               |
          v              v               v
       Client        Replica 1       Replica 2

The Performance Result

4 syscalls, 4 memory copies, 3 network requests → ~8000 TPS

The biggest win is actually the reduction of row locks—from ~16k down to effectively zero.

Request Batch → recv syscall → write WAL syscall → send ack syscall → read state syscall
                                                                              |
                                                                         8000 TPS ✅

Why Only Four Syscalls?

In a typical processing cycle, the system needs to:

Receive batched requests from the network
Write them to the WAL
Send acknowledgments
Read state for processing

Instead of issuing a syscall per packet or block, TigerBeetle aggregates all operations.

Using io_uring, the application submits a batch of reads and writes and enters the kernel once. This allows a single execution thread to saturate modern NVMe drives and 100 GbE network links.

Durability Considerations

More scale demands more durability
Even 0.5% disk corruption over two years becomes catastrophic at large scale
Durability issues grow exponentially with workload size

Scale
  └─> More Data
       └─> Higher Risk of Corruption
            └─> Exponential Impact
                 └─> Need Cryptographic Guarantees

Replication Concerns with Raft

Raft does not exploit global redundancy
In a 1-primary, 2-replica setup:
- One replica missing data
- One replica corrupted
If the primary fails, Raft cannot determine a new leader and stalls

TigerBeetle avoids this by using cryptographic hash chains for replication integrity.

Raft Limitation:
    Replica 1: Missing Data
    Replica 2: Corrupted
    Primary Fails
      └─> Cannot Elect New Leader
           └─> System Stalls ❌

TigerBeetle Solution:
    Cryptographic Hash Chains
      └─> Verify Data Integrity
           └─> Safe Leader Election ✅

Codebase Deep Dive

Written in Zig
Explicit control flow
Strict, static resource allocation
Extensive use of “checksums to check the checksums”

Every data structure—from protocol headers to on-disk state—is fortified with cryptographic checksums.

Core Components

VSR implementation: src/vsr.zig
Replica state machine: src/vsr/replica.zig
- Acts as the CPU of a TigerBeetle node
- Transitions state based on VSR messages (Prepare, Commit, ViewChange, etc.)

One interesting finding from this crash report is how the quorum_headers function tallies cluster responses.

Time is measured not in wall-clock seconds, but in logical ticks—processed messages via replica.tick().

Because state is a pure function of the log, state transfer is as simple as sending snapshots or data files.

Replica State Machine:

    [Start]
       |
       v
   ┌────────┐
   │ Normal │<─────────────┐
   └────────┘              |
       |                   |
  Primary Failure          |
       |                   |
       v                   |
   ┌───────────┐           |
   │ ViewChange │          |
   └───────────┘           |
       |                   |
  log_view < view?         |
       |                   |
    +──+──+                |
    |     |                |
   Yes   No                |
    |     |                |
    v     v                |
DoViewChange  StartView    |
    |          |           |
    |          |           |
    +────┬─────+           |
         |                 |
    New Primary Elected    |
         |                 |
         └─────────────────┘

Handling Leader Failure

To initiate leader change, a replica transitions from normal mode to participation mode.

This is handled in src/vsr/superblock.zig:

pub fn view_headers(superblock: *const SuperBlockHeader) vsr.Headers.ViewChangeSlice {
    return vsr.Headers.ViewChangeSlice.init(
        if (superblock.vsr_state.log_view < superblock.vsr_state.view)
           .do_view_change
        else
           .start_view,
        superblock.view_headers_all[0..superblock.view_headers_count],
    );
}

If log_view < view, the system infers a transition state and executes do_view_change.

The critical insight: view state is persisted in the superblock, not volatile memory.

Storage as the Anchor of Trust

TigerBeetle does not rely on filesystem guarantees—because they effectively do not exist.

Instead, it implements its own transactional guarantees using a cryptographic hash chain, capable of reconstructing a universal state even from partially corrupted replicas.

SuperBlockHeader

Defined in src/vsr/superblock.zig

Uses Zig’s extern struct to guarantee precise on-disk layout.

Key fields include:

Field	Type	Description
`checksum`	`u128`	Checksum of remaining fields
`copy`	`u16`	Superblock copy index (0–3)
`sequence`	`u64`	Monotonic version counter
`cluster`	`u128`	Cluster UUID
`parent`	`u128`	Hash of previous superblock
`vsr_state`	`VSRState`	Embedded consensus state

Embedding VSRState directly into storage solves split-brain at the storage layer. On startup, the node immediately knows its latest consensus state—no log replay needed.

The superblock is physically duplicated four times on disk.

        SuperBlock
            |
    +-------+-------+-------+
    |       |       |       |
  Copy 0  Copy 1  Copy 2  Copy 3
    |       |       |       |
    +-------+-------+-------+
            |
        Redundancy
            |
   Survives Partial Corruption

Journal (WAL) as a Hash Chain

TigerBeetle’s WAL—called the journal—is not a simple append log.

Each entry contains:

A sequence number
A checksum pointing to the previous entry

This forms a hash chain similar to a blockchain.

The function valid_hash_chain_between verifies integrity across log ranges. If corruption is detected (e.g., torn writes), the system repairs itself by fetching clean blocks from peers.

Journal Hash Chain:

[Entry 1] --checksum--> [Entry 2] --checksum--> [Entry 3] --checksum--> [Entry 4] --checksum--> [Entry 5]

Each entry contains:
  - Sequence number
  - Checksum of previous entry
  - Transaction data

Similar to blockchain structure for integrity verification

Storage Data Structures

Instead of a single LSM tree, TigerBeetle uses an LSM forest:

Object trees – transfers sorted by timestamp
Index trees – primary and secondary indexes

This improves locality and performance.

        LSM Forest
            |
    +-------+-------+
    |               |
Object Trees   Index Trees
    |               |
Transfers by    +---+---+
 Timestamp      |       |
                |       |
            Primary  Secondary
            Indexes  Indexes
                |
                v
         Better Locality
                |
                v
        Higher Performance

I/O: Direct I/O and Determinism

TigerBeetle bypasses OS abstractions like page cache to eliminate unpredictability.

From /src/io/linux.zig:

fn fs_supports_direct_io(dir_fd: fd_t)!bool {
    if (!@hasField(posix.O, "DIRECT")) return false;
    //...
}

Kernel page cache is dangerous because:

fsync() may lie about durability
Memory pressure causes unpredictable eviction
Cached corruption propagates silently

Traditional I/O ❌                Direct I/O ✅
      |                                |
 Kernel Page Cache               Bypass Page Cache
      |                                |
  +---+---+                       +----+----+
  |   |   |                       |         |
fsync  Unpredictable        Deterministic  True
may lie  eviction           Behavior     Durability
      |
Silent corruption

Static Memory Allocation

Many production crashes happen under load due to allocation failures.

TigerBeetle avoids this entirely:

Startup: compute maximum memory requirements
Allocate one large contiguous block
Runtime: pop and return buffers from a pool

See src/message_pool.zig and this stack trace

Static Memory Allocation Flow:

Startup
  └─> Calculate Max Memory Requirements
       └─> Allocate One Large Contiguous Block
            └─> Create Buffer Pool
                 └─> Runtime ✅
                      ├─> Pop Buffer
                      ├─> Use Buffer
                      └─> Return Buffer
                           └─> (cycle continues)

Benefits:
  • No allocation failures under load
  • Predictable memory usage
  • No fragmentation
  • Zero runtime malloc overhead

A Note on Programming Style

This style of programming is refreshing.

Even without deep familiarity with Zig, the code is easy to follow due to:

Explicit callbacks
Clear control flow
Descriptive struct layouts
Noun-based naming (replica.pipeline, replica.preparing)

It is a style that makes correctness obvious—and bugs uncomfortable.

End result: a database designed from first principles, where performance, safety, and determinism are not trade-offs, but consequences of the architecture.