← Back to all posts

SEC EDGAR Pipeline: Trading Signals from Regulatory Filings

The Opportunity

SEC filings are public but underutilized by retail traders. Institutional investors monitor these obsessively. I built a pipeline to poll SEC EDGAR for new filings in near real-time, parse and extract signal-worthy events, and deliver tradeable alerts before the news cycle catches up.

Filing Types That Matter

Form What It Means Signal Priority
Form 4 Insider buys/sells High for CEO/CFO buys, cluster buys
8-K Material events (earnings, M&A, leadership) High for earnings beat/miss
13D/G Activist positions (5%+ ownership) High for activist language
13F Institutional holdings (quarterly) High for notable filers
S-3, 424B, S-1 Shelf registrations, prospectus, IPO Float/dilution tracking

Architecture

Ingestion Layer:
- Airflow DAGs for historical backfill (2023-present)
- RSS poller (60-second cycle) for real-time new filings

Processing Layer:
- Kafka topic: sec-filings (raw filing notifications)
- Parser workers with form-specific parsers (XML/HTML extraction)
- Signal generator with configurable thresholds

Storage Layer:
- SQL Server (TradingDB) with 9 normalized tables
- sec_companies, sec_filings_raw, sec_insider_tx, sec_8k_events
- sec_13f_holdings, sec_beneficial_owners, sec_signals
- sec_cusip_mapping, sec_float_data, sec_lockup_calendar

Signal Layer:
- Kafka topic: sec-signals (actionable alerts)
- Redis bridge for dashboard consumption

Signal Generation Logic

Form 4 Signals

  • INSIDER_BUY – Any purchase (medium priority)
  • CEO_CFO_BUY – C-suite purchase (high priority)
  • INSIDER_CLUSTER_BUY – 2+ insiders buy within 7 days (high)
  • INSIDER_LARGE_BUY – Purchase greater than $100k (high)
  • DIRECTOR_BUY – Director non-officer purchase (medium)

Form 8-K Signals

  • 8K_EARNINGS – Item 2.02 filed (high)
  • 8K_EARNINGS_BEAT – Beat keywords detected (high)
  • 8K_EARNINGS_MISS – Miss keywords detected (medium)
  • Extracts EPS, revenue, quarter from text via regex patterns

Form 13D/G Signals

  • 13D_NEW_POSITION – New 5%+ holder (high)
  • 13D_ACTIVIST – Activist language in purpose (high)
  • 13G_LARGE_PASSIVE – Greater than 10% passive position (low)

Form 13F Signals

  • 13F_NEW_POSITION – Institution initiates new position
  • 13F_ACCUMULATION – Greater than 20% share increase from prior quarter
  • 13F_REDUCTION – Greater than 20% decrease (not exit)
  • 13F_EXIT – Complete position exit

Notable filers (Berkshire Hathaway, Pershing Square, Renaissance Technologies, Bridgewater, Point72) get elevated priority and lower thresholds ($1M vs $10M minimum).

Why Insider Buys Matter

Insiders sell for many reasons (diversification, taxes, life events). But they only buy for one: they think the stock is going up. Cluster buying (multiple insiders in short period) is especially bullish.

SEC Rate Limiting

SEC has clear guidelines: 10 requests/second max, proper User-Agent. The pipeline respects these with exponential backoff on 503 errors. Getting blocked would defeat the purpose.

Float Tracking

  • sec_float_data – Base float from Alpaca + adjusted float
  • sec_float_events – Float-changing events (offerings, lockup expirations)
  • sec_lockup_calendar – IPO lockup expiration tracking
  • Signals: FLOAT_OFFERING_PRICED, FLOAT_LOCKUP_1D, FLOAT_SHELF_FILED

Infrastructure

Service Location Purpose
Airflow DAGs af01:/opt/airflow_env/dags/sec/ Historical backfill
sec-realtime pve04 LXC RSS polling, Kafka producer
sec-parser pve04 LXC Kafka consumer, signal generation
SQL Server sql03.ad.techsnet.net TradingDB storage
Kafka 10.31.11.10 Message broker
Redis 10.31.13.10 Dashboard buffer

Code Statistics

  • 6 Airflow DAGs: Form 4, 8-K, 13D/G, 13F, 13F reverse backfill, float refresh
  • 6 Parsers: form4.py, form8k.py, form13dg.py, form13f.py, forms1.py, forms3_424b.py
  • 9 SQL Migrations: Normalized schema for all filing types
  • Signal Generator: ~400 lines covering all form types

Key Design Decision: Reverse Backfill

13F backfill runs newest-to-oldest to prioritize recent data. Quarterly filings mean older data is less actionable – get the current quarter processed first.