← Back to all posts

Tick Archive: High-Throughput Market Data Storage

The Problem

Real-time market data is valuable but ephemeral. For backtesting and ML training, you need historical tick data. Commercial solutions are expensive. I built my own archival system.

Architecture

Kafka (market-data) -> Go Ingestor -> WAL (local HDD) -> Flusher -> Parquet (Ceph)
                           |                              |
                    symbols.dict                   date/hour/bucket/
                    (symbol->id)                   part-NNNNNN.parquet

Infrastructure:

  • Deployed on tick-archive-lxc (VMID 206, 10.31.11.16)
  • 4 vCPU, 4GB RAM
  • 20GB local-lvm for WAL (fast sequential writes)
  • 250GB Ceph RBD for Parquet (distributed, fault-tolerant)

WAL Implementation

The WAL is designed specifically for HDD-friendly sequential writes:

Buffering Strategy:

  • 4MB write buffer accumulates before syscall
  • fdatasync every 128MB or 15 seconds (safety cap)
  • 256MB segment size before rotation

Fixed-Size Binary Records (48 bytes):

ts_nanos    int64   (8)  - Event timestamp UTC
symbol_id   uint32  (4)  - From symbols.dict
price       int64   (8)  - Scaled: price * 1e6
size        uint32  (4)  - Trade size
exchange    uint16  (2)  - Exchange enum
cond_len    uint8   (1)  - Condition count (0-15)
cond_data   [15]u8  (15) - Packed condition codes
trade_id    int64   (8)  - Trade ID from feed
flags       uint16  (2)  - Replay/correction flags
crc32       uint32  (4)  - CRC of above 44 bytes

Backpressure Mechanism:

  • Checks total WAL size before each write
  • Returns ErrWALFull if exceeds 10GB
  • Ingestor pauses consumption, Kafka backs up naturally
  • Never drops data silently

Symbol Dictionary:

  • Maps symbol strings to uint32 IDs (4 bytes vs variable string)
  • Append-only, persisted to symbols.dict
  • fsync before updating in-memory map

Parquet Conversion

Partitioning Strategy:

date=YYYY-MM-DD/hour=HH/bucket=NNN/part-NNNNNN.parquet
  • date: Natural retention boundary
  • hour: Limits files per directory, enables parallel replay
  • bucket: xxhash32(symbol_id) & 255 distributes symbols evenly

Arrow/Parquet Schema:

  • Uses Apache Arrow Go library
  • Zstd compression (level 3) for 3-5x compression
  • Dictionary encoding enabled

Atomic Commit Pattern:

  1. Write to tmp/ directory first
  2. fsync the file
  3. Atomic rename to final path
  4. fsync parent directory
  5. Write _SUCCESS marker
  6. Only then delete WAL segment

Storage Tiers

Tier Storage Purpose
WAL 20GB local SSD Durability, crash recovery
Hot Parquet 250GB Ceph RBD Recent data, fast access
Cold Archive Object storage Historical (future)

Performance

Current Throughput:

  • Messages: 452/sec (trades + quotes + bars)
  • Trades: 134/sec written to WAL
  • Symbols: 847 unique in first minute

Storage Projections:

  • WAL growth: 6.4 KB/sec = 23 MB/hour
  • Trading day (6.5 hrs): ~150 MB
  • Parquet with Zstd: 3-5x compression = ~30-50 MB/day
  • 14-day retention: ~500-700 MB

Design Decisions

Why Local HDD for WAL (not Ceph, not tmpfs):

  • tmpfs: Loses data on restart (rejected)
  • Ceph: High latency for fdatasync (rejected)
  • Local HDD: Durable, predictable latency, handles sequential writes fine

Why Fixed-Size Binary Records (not JSON, not Protobuf):

  • JSON: 5-10x larger, parsing overhead
  • Protobuf: Overkill for fixed schema
  • Binary: Compact, fast, simple seeking

Why Backpressure over Data Loss:

  • System slows down rather than dropping data
  • Visible in metrics for alerting
  • Kafka naturally buffers during pause

Crash Recovery

  1. Orphaned tmp files: Cleaned up on startup
  2. Missing _SUCCESS markers: Re-flush from WAL
  3. WAL segment survives: Until all Parquet confirmed
  4. CRC validation: Per-record corruption detection
  5. Recovery: Scan until last good CRC, truncate

Deployment

Two systemd services:

  • tick-archive-ingest: Kafka to WAL (port 8080 metrics)
  • tick-archive-flush: WAL to Parquet (port 8081 metrics)

Prometheus metrics: tick_archive_ingest_messages_total, tick_archive_ingest_trades_total, tick_archive_ingest_backpressure_total