Performance Tuning
Build Phase
The build phase is the most resource-intensive part of create-db and update-db --add-samples. It aggregates per-sample genotype data into Roaring Bitmap Parquet files.
Key Options
| Option | Default | Effect |
|---|---|---|
--build-threads |
all CPUs | Number of parallel DuckDB workers |
--build-memory |
2GB |
DuckDB memory limit per worker |
Memory Sizing
Each worker processes one 1-Mbp bucket independently. Memory usage depends on cohort size and variant density:
| Cohort size | Typical peak per worker |
|---|---|
| 1K samples | ~200 MB |
| 10K samples | ~500 MB |
| 50K samples | ~1–2 GB |
Set --build-memory to the expected peak plus 20% headroom. For WGS with dense variant regions, use 4GB or higher.
Thread Scaling
All 1-Mbp buckets across all chromosomes are discovered upfront and distributed to workers. With 52 cores and 2,500 buckets, all cores can run concurrently:
afquery create-db \
--manifest manifest.tsv \
--output-dir ./db/ \
--genome-build GRCh38 \
--build-threads 52 \
--build-memory 4GB
Worker capping
Workers are capped to min(--build-threads, n_buckets) — if you have more threads than buckets, the excess workers are simply idle. Use afquery info --db ./db/ to check the number of buckets after build.
Expected scaling (50K samples, 2,500 buckets, GRCh38):
| Cores | Build time | Speedup vs. 1 core |
|---|---|---|
| 1 | ~8 hours | 1× |
| 4 | ~2 hours | ~4× |
| 8 | ~1 hour | ~7× |
| 16 | ~30 min | ~14× |
| 32 | ~18 min | ~24× |
| 52 | ~13 min | ~38× |
Scaling is near-linear up to ~32 cores; beyond that, I/O and SQLite contention limit further gains.
Total RAM required: build_threads × build_memory
# 16 threads × 2GB = 32 GB peak
afquery create-db ... --build-threads 16 --build-memory 2GB
# 8 threads × 4GB = 32 GB peak (better for dense WGS)
afquery create-db ... --build-threads 8 --build-memory 4GB
Ingest Phase
The --threads option (distinct from --build-threads) controls VCF parsing parallelism. It uses ProcessPoolExecutor with one process per sample. Set to the number of I/O-bound cores available:
Query Phase
Query Execution Path
graph TD
A["Query Request<br/>chr1:925952"]
B["Open DuckDB<br/>connection"]
C["Locate Parquet<br/>bucket_0/data.parquet"]
D["Read rows<br/>matching pos"]
E["Deserialize<br/>bitmaps"]
F["Bitmap AND<br/>with eligible<br/>sample set"]
G["Compute<br/>AC/AN/AF"]
H["Result<br/>~10-100 ms"]
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#e8f5e9
style D fill:#e8f5e9
style E fill:#f3e5f5
style F fill:#f3e5f5
style G fill:#fff3e0
style H fill:#c8e6c9
Sub-100 ms Point Queries
Query performance for a typical 50K-sample cohort (see Benchmarking to measure these on your own database):
| Query type | Cold (first call) | Warm (cached) |
|---|---|---|
| Point query | < 100 ms | ~10 ms |
| Region (1 Mbp) | ~300 ms | ~50 ms |
| Batch (100 variants) | ~200 ms | ~20 ms |
DuckDB Connection
AFQuery opens a fresh DuckDB connection per query call and closes it before return. This is intentional for thread safety — connections are not reused. If you are calling db.query() in a tight loop, the per-connection overhead (~5 ms) may become significant.
For high-throughput batch workloads, use db.query_batch() or db.query_region() to amortize connection overhead over many variants.
Cold vs Warm Queries
The first query on a chromosome reads Parquet data from disk into OS cache. Subsequent queries on the same chromosome benefit from the OS page cache. On systems with sufficient RAM, warm query times are order-of-magnitude faster.
To "warm up" a chromosome:
# One region query to load the chromosome into OS cache
db.query_region("chr1", start=1, end=250_000_000)
Annotation
Thread Scaling
afquery annotate parallelizes variant annotation across --threads workers. Scaling is near-linear up to the number of available cores:
# 4-core machine
afquery annotate --db ./db/ --input variants.vcf --output annotated.vcf --threads 4
# 32-core machine
afquery annotate --db ./db/ --input variants.vcf --output annotated.vcf --threads 32
Disk I/O is the bottleneck for very large VCFs on spinning disks. SSDs or NVMe storage are recommended for annotation workloads.
Disk Usage Estimates
Parquet with Roaring Bitmap encoding is very compact:
| Scenario | Estimate |
|---|---|
| Storage per variant per sample | ~2 bytes |
| 50K samples, 100M variants | ~10 TB |
| 1K samples, 10M variants | ~20 GB |
Actual disk usage depends on variant density and carrier rates. Rare variants (low AC) compress better than common variants.
Memory at Query Time
Query memory is very low:
- Bitmap operations: only the relevant bitmaps are loaded from Parquet (~64 KB per variant at 50K samples)
- No full chromosome load: DuckDB reads only the specific rows matching the query position
- Capture index: one small interval tree per WES technology loaded at
Database.__init__
Typical query-time RAM: < 500 MB regardless of cohort size.
Profiling
Enable verbose output to see per-step timings (available on annotate, dump, create-db, and update-db):
Next Steps
- Benchmarking — measure and track query performance on your database
- Create a Database — build options including
--build-threadsand--build-memory - Pipeline Integration — thread configuration in Nextflow and Snakemake workflows