Skip to content

Data Model

This page documents the on-disk layout of an AFQuery database, including file formats, schemas, and schema version differences.


Directory Layout

<db_dir>/
β”œβ”€β”€ manifest.json          # Build configuration
β”œβ”€β”€ metadata.sqlite        # Sample/phenotype/technology/changelog metadata
β”œβ”€β”€ variants/              # Hive-partitioned Parquet variant data
β”‚   β”œβ”€β”€ chr1/
β”‚   β”‚   β”œβ”€β”€ bucket_0/      # Positions 0–999,999
β”‚   β”‚   β”‚   └── data.parquet
β”‚   β”‚   β”œβ”€β”€ bucket_1/      # Positions 1,000,000–1,999,999
β”‚   β”‚   β”‚   └── data.parquet
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ chr2/
β”‚   └── ...
└── capture/               # WES coverage interval trees
    β”œβ”€β”€ wes_v1.pkl
    └── wes_v2.pkl

manifest.json

Stores the build configuration used during create-db.

Field Type Description
genome_build string "GRCh37" or "GRCh38"
schema_version string "2.0"
pass_only_filter bool Whether FILTER=PASS was enforced during ingest
db_version string User-specified version label
created_at string ISO 8601 timestamp
manifest_path string Path to original manifest TSV

metadata.sqlite

SQLite database containing all mutable metadata. Tables:

samples

Column Type Description
sample_id INTEGER PRIMARY KEY 0-indexed, monotonically increasing
sample_name TEXT UNIQUE Unique sample identifier
sex TEXT male or female
tech_id INTEGER Foreign key β†’ technologies.tech_id
active INTEGER 1 = active, 0 = removed

technologies

Column Type Description
tech_id INTEGER PRIMARY KEY Auto-increment
tech_name TEXT UNIQUE Technology name
bed_path TEXT | NULL Path to BED file; NULL = WGS

sample_phenotype

Column Type Description
sample_id INTEGER Foreign key β†’ samples.sample_id
phenotype_code TEXT arbitrary string (e.g., ICD-10 code, HPO term, project tag)

Many-to-many: one sample can have multiple phenotype codes.

precomputed_bitmaps

Column Type Description
key TEXT PRIMARY KEY Bitmap identifier (e.g., "all", "male", "female", "tech:wgs")
bitmap BLOB Serialized Roaring Bitmap

Cached bitmaps for common filter combinations, rebuilt on update-db.

changelog

Column Type Description
event_id INTEGER PRIMARY KEY Auto-increment
event_type TEXT preprocess, add_samples, remove_samples, compact
event_time TEXT ISO 8601 timestamp
sample_names TEXT | NULL JSON array of affected sample names, or NULL
notes TEXT | NULL Human-readable summary

Parquet Schema

Each bucket Parquet file has this schema:

Column Arrow type Description
pos uint32 1-based genomic position
ref large_utf8 Reference allele
alt large_utf8 Alternate allele
het_bitmap large_binary Serialized Roaring Bitmap of heterozygous sample IDs
hom_bitmap large_binary Serialized Roaring Bitmap of homozygous alt sample IDs
fail_bitmap large_binary Serialized Roaring Bitmap of FILTER≠PASS sample IDs

Rows are sorted by (pos, alt) within each bucket.

large_utf8 / large_binary

AFQuery uses large_utf8 and large_binary (64-bit offsets) rather than utf8 / binary (32-bit). This is required for compatibility with DuckDB's Parquet reader on large chromosomes.


Bitmap Format

Bitmaps use the Roaring Bitmap format, serialized by pyroaring's portable serialization.

  • Bit position = sample ID (0-indexed integer)
  • het_bitmap: bit set iff sample is heterozygous at this variant
  • hom_bitmap: bit set iff sample is homozygous alt at this variant
  • fail_bitmap: bit set iff sample has genotype AC>0 AND FILTERβ‰ PASS

To deserialize in Python:

from pyroaring import BitMap

with open("bucket.parquet", "rb") as f:
    ...  # use pyarrow or duckdb to read the column

bm = BitMap.deserialize(bitmap_bytes)
sample_ids = list(bm)


Capture Index (WES)

One pickle file per WES technology in capture/<tech>.pkl. Contains a pyranges interval tree (or dict of interval trees keyed by chromosome) for fast coverage lookup.

A position is covered by a WES technology if it falls within any interval in that technology's BED file (0-based, half-open).


Partitioning

Variants are partitioned into 1-Mbp buckets:

bucket_id = pos // 1_000_000

DuckDB integer arithmetic

When computing bucket IDs in DuckDB SQL, always use:

CAST(pos AS BIGINT) / 1000000
Not CAST(pos / 1000000 AS BIGINT) β€” DuckDB performs float division first and rounds, producing wrong bucket IDs.


See FILTER=PASS Tracking.