Multi-Cohort Strategies

When your organization manages samples from multiple cohorts — different institutions, studies, or disease programs — you need a strategy for how to organize AFQuery databases. This page covers three common patterns with trade-offs.

Pattern 1: One Database per Cohort

Each cohort gets its own database, queried independently.

/databases/
  cardiology_cohort/       ← 5000 samples, cardiology
  neurology_cohort/        ← 3000 samples, neurology
  rare_disease_registry/   ← 8000 samples, mixed rare diseases

When to Use

Cohorts come from different institutions with separate data governance
Sample sets have no overlap
Each cohort has its own VCF pipeline and genome build
You need to annotate VCFs against a specific cohort only

Trade-offs

Advantage	Disadvantage
Simple data governance — each database is self-contained	Cannot compute cross-cohort AF in a single query
Independent updates — rebuild one without touching others	Duplicate storage if samples overlap
Clear provenance — each database tracks its own manifest	More databases to maintain

Cross-Cohort Comparison (Python)

from afquery import Database

dbs = {
    "cardiology": Database("/databases/cardiology_cohort/"),
    "neurology": Database("/databases/neurology_cohort/"),
    "rare_disease": Database("/databases/rare_disease_registry/"),
}

chrom, pos, alt = "chr1", 12345678, "T"

for name, db in dbs.items():
    results = db.query(chrom, pos=pos, alt=alt)
    if results:
        r = results[0]
        print(f"{name:20s}  AC={r.AC:4d}  AN={r.AN:5d}  AF={r.AF:.4f}")
    else:
        print(f"{name:20s}  not observed")

Pattern 2: Merged Database with Phenotype Codes

All samples in one database, with phenotype codes distinguishing cohorts.

Manifest Design

sample_id   vcf_path    sex technology  phenotype_codes
CARD_001    /data/card/001.vcf.gz   female  wgs cardiology,EUR,control
CARD_002    /data/card/002.vcf.gz   male    wgs cardiology,EUR,case_HCM
NEURO_001   /data/neuro/001.vcf.gz  female  wes neurology,AFR,case_epilepsy
RD_001  /data/rd/001.vcf.gz male    wgs rare_disease,SAS,case_LQTS

Key: use cohort names (cardiology, neurology, rare_disease) as phenotype codes alongside disease-specific and ancestry labels.

Querying by Cohort

# AF in cardiology cohort only
afquery query --db ./merged/ --locus chr1:12345678 --ref C --alt T \
  --phenotype cardiology

# AF in everyone except rare disease
afquery query --db ./merged/ --locus chr1:12345678 --ref C --alt T \
  --phenotype ^rare_disease

# AF in European subset across all cohorts
afquery query --db ./merged/ --locus chr1:12345678 --ref C --alt T \
  --phenotype EUR

When to Use

All cohorts share the same genome build and VCF pipeline
You want cross-cohort AF queries without scripting
Data governance allows combining samples
Phenotype code design can capture all relevant groupings

Trade-offs

Advantage	Disadvantage
Single database to maintain	Rebuilding requires all VCFs accessible
Cross-cohort queries via phenotype filters	Phenotype code design must be planned upfront
One annotation pass covers all cohorts	Adding a new cohort requires `afquery update`
Flexible ad-hoc stratification	Larger database, longer rebuild time

Pattern 3: Tiered Approach

Maintain both per-cohort and merged databases.

/databases/
  institutional/
    cardiology_cohort/     ← institutional, restricted access
    neurology_cohort/      ← institutional, restricted access
  shared/
    combined_controls/     ← merged control samples, broader access

When to Use

Some cohorts have access restrictions that prevent full merging
You need a shared "reference panel" of controls from multiple sources
Institutional databases are updated independently on different schedules

Implementation

Each institution maintains its own database
Control samples (or a consented subset) are merged into a shared database
Clinical queries annotate against both institutional and shared databases

# Annotate against institutional cohort
afquery annotate --db /databases/institutional/cardiology/ \
  --input patient.vcf.gz --output step1.vcf.gz

# Annotate against shared controls (use a different INFO prefix via Python API)

Decision Matrix

Factor	Pattern 1 (Separate)	Pattern 2 (Merged)	Pattern 3 (Tiered)
Data governance	Easiest	Requires agreement	Flexible
Cross-cohort queries	Scripting required	Built-in	Partial
Rebuild cost	Per-cohort only	All samples	Both
Storage	Proportional	Slightly less	Higher (duplication)
Maintenance complexity	Low per-database	Low (one database)	Higher
Best for	Multi-institution	Single organization	Mixed governance

Next Steps

Create a Database — database creation options
Update a Database — adding samples to an existing database
Sample Filtering — phenotype and technology filters
Pipeline Integration — using databases in automated pipelines