The Problem
Real-time market data is valuable but ephemeral. For backtesting and ML training, you need historical tick data. Commercial solutions are expensive. I built my own archival system.
Architecture
Kafka (market-data) -> Go Ingestor -> WAL (local HDD) -> Flusher -> Parquet (Ceph)
| |
symbols.dict date/hour/bucket/
(symbol->id) part-NNNNNN.parquet
Infrastructure:
- Deployed on tick-archive-lxc (VMID 206, 10.31.11.16)
- 4 vCPU, 4GB RAM
- 20GB local-lvm for WAL (fast sequential writes)
- 250GB Ceph RBD for Parquet (distributed, fault-tolerant)
WAL Implementation
The WAL is designed specifically for HDD-friendly sequential writes:
Buffering Strategy:
- 4MB write buffer accumulates before syscall
- fdatasync every 128MB or 15 seconds (safety cap)
- 256MB segment size before rotation
Fixed-Size Binary Records (48 bytes):
ts_nanos int64 (8) - Event timestamp UTC symbol_id uint32 (4) - From symbols.dict price int64 (8) - Scaled: price * 1e6 size uint32 (4) - Trade size exchange uint16 (2) - Exchange enum cond_len uint8 (1) - Condition count (0-15) cond_data [15]u8 (15) - Packed condition codes trade_id int64 (8) - Trade ID from feed flags uint16 (2) - Replay/correction flags crc32 uint32 (4) - CRC of above 44 bytes
Backpressure Mechanism:
- Checks total WAL size before each write
- Returns ErrWALFull if exceeds 10GB
- Ingestor pauses consumption, Kafka backs up naturally
- Never drops data silently
Symbol Dictionary:
- Maps symbol strings to uint32 IDs (4 bytes vs variable string)
- Append-only, persisted to symbols.dict
- fsync before updating in-memory map
Parquet Conversion
Partitioning Strategy:
date=YYYY-MM-DD/hour=HH/bucket=NNN/part-NNNNNN.parquet
- date: Natural retention boundary
- hour: Limits files per directory, enables parallel replay
- bucket: xxhash32(symbol_id) & 255 distributes symbols evenly
Arrow/Parquet Schema:
- Uses Apache Arrow Go library
- Zstd compression (level 3) for 3-5x compression
- Dictionary encoding enabled
Atomic Commit Pattern:
- Write to tmp/ directory first
- fsync the file
- Atomic rename to final path
- fsync parent directory
- Write _SUCCESS marker
- Only then delete WAL segment
Storage Tiers
| Tier | Storage | Purpose |
|---|---|---|
| WAL | 20GB local SSD | Durability, crash recovery |
| Hot Parquet | 250GB Ceph RBD | Recent data, fast access |
| Cold Archive | Object storage | Historical (future) |
Performance
Current Throughput:
- Messages: 452/sec (trades + quotes + bars)
- Trades: 134/sec written to WAL
- Symbols: 847 unique in first minute
Storage Projections:
- WAL growth: 6.4 KB/sec = 23 MB/hour
- Trading day (6.5 hrs): ~150 MB
- Parquet with Zstd: 3-5x compression = ~30-50 MB/day
- 14-day retention: ~500-700 MB
Design Decisions
Why Local HDD for WAL (not Ceph, not tmpfs):
- tmpfs: Loses data on restart (rejected)
- Ceph: High latency for fdatasync (rejected)
- Local HDD: Durable, predictable latency, handles sequential writes fine
Why Fixed-Size Binary Records (not JSON, not Protobuf):
- JSON: 5-10x larger, parsing overhead
- Protobuf: Overkill for fixed schema
- Binary: Compact, fast, simple seeking
Why Backpressure over Data Loss:
- System slows down rather than dropping data
- Visible in metrics for alerting
- Kafka naturally buffers during pause
Crash Recovery
- Orphaned tmp files: Cleaned up on startup
- Missing _SUCCESS markers: Re-flush from WAL
- WAL segment survives: Until all Parquet confirmed
- CRC validation: Per-record corruption detection
- Recovery: Scan until last good CRC, truncate
Deployment
Two systemd services:
- tick-archive-ingest: Kafka to WAL (port 8080 metrics)
- tick-archive-flush: WAL to Parquet (port 8081 metrics)
Prometheus metrics: tick_archive_ingest_messages_total, tick_archive_ingest_trades_total, tick_archive_ingest_backpressure_total