Skip to main content

Streaming Large Datasets

Added in v0.7. DeepDiff DB uses keyset-paginated batch hashing to keep memory usage flat regardless of table size, and a bounded goroutine pool for parallel table processing.

How Keyset Pagination Works

Before v0.7, DeepDiff DB loaded an entire table into memory before computing hashes. For tables with millions of rows this could use 700–900 MB of heap per table.

Starting from v0.7, each table is read in pages using a keyset (cursor) query:

-- First page
SELECT pk, col1, col2 FROM orders
ORDER BY id
LIMIT 10000;

-- Subsequent pages
SELECT pk, col1, col2 FROM orders
WHERE id > :lastSeenId
ORDER BY id
LIMIT 10000;

Each page is hashed and discarded before the next page is fetched. Peak heap usage is proportional to the batch size — approximately 1–2 MB per 10,000 rows for typical schemas — so a 50 million row table uses the same ~15 MB peak as a 50,000 row table.

The cursor field is the table's primary key (or the first column of a composite primary key).

Configuring Batch Size

Via configuration file

performance:
hash_batch_size: 10000 # rows per page; 0 = full-table scan (pre-v0.7 behaviour)
max_parallel_tables: 2 # tables hashed concurrently

Via CLI flags (overrides config)

deepdiffdb diff --config deepdiffdb.config.yaml --batch-size 5000 --parallel 4
deepdiffdb gen-pack --config deepdiffdb.config.yaml --batch-size 5000 --parallel 4

Setting --batch-size 0 disables pagination entirely and reverts to pre-v0.7 full-table-scan behaviour. Use this only if you have confirmed that available memory is sufficient for the largest table.

Parallel Table Hashing

The --parallel flag (or performance.max_parallel_tables in config) controls how many tables are hashed concurrently. DeepDiff DB uses a bounded goroutine pool backed by errgroup and semaphore.NewWeighted.

Example — 4 tables in parallel:

deepdiffdb gen-pack --config deepdiffdb.config.yaml --batch-size 10000 --parallel 4

Each parallel table consumes its own batch of database connections, so the effective connection count is parallel * 2 (one connection per database per table). Make sure your database's max_connections limit accommodates this.

Memory Characteristics

ScenarioPeak heap
Pre-v0.7 (no batching)~700–900 MB for a 5M row table
v0.7+ default (batch=10000, parallel=1)~15–20 MB
batch=5000, parallel=4~30–40 MB total across 4 tables

Memory figures are approximate and depend on row width.

When to Tune These Settings

SituationRecommendation
Tables with > 500k rowsUse --batch-size 10000 (default)
Very wide rows (many large text columns)Reduce --batch-size to 2000–5000
Fast database with many CPU coresIncrease --parallel to 2–4
Slow network or rate-limited databaseReduce --parallel to 1
Available heap is under 512 MBKeep --batch-size at 5000 or less
Debugging or reproducing pre-v0.7 behaviourSet --batch-size 0

Debug Telemetry

At --log-level debug, DeepDiff DB emits per-batch memory telemetry:

{"level":"debug","table":"orders","batch":42,"total_rows_hashed":420000,"alloc_mb":14.2}

This is useful for observing memory behaviour in production-like environments before committing to a batch size configuration.