Skip to main content

Reproduce — score and benchmark a methylation cohort on a laptop

Reproduces the GSE322563 native EPIC v2 primary-endpoint row of the methods paper end-to-end, starting from a fresh git clone. The same pipeline produces every BenchmarkResult row in PAPER §5. ~15 minutes after the raw inputs are downloaded.

Manuscript tag: memo-2026-04-22-bw Tested on: macOS Apple Silicon · Linux x86_64 No R, no LaTeX, no Conda required.

Prerequisites

Step 1

Clone the repository at the immutable tag

git clone https://github.com/AllisonH12/thermocas9.git
cd thermocas9
git checkout memo-2026-04-22-bw        # MANUSCRIPT-shaped tag

# One-time environment setup — installs Python 3.11 + all locked deps
uv sync

# Sanity check — should print "245 passed"
uv run pytest -q

If the test count is anything other than 245, stop and open an issue — that means the lock file or the source tree is out of step with the tag.

Step 2

Pull the raw GEO and UCSC inputs

These files are ~3 GB compressed. They live under data/raw/ and are gitignored.

# GEO supplementary — beta matrix
mkdir -p data/raw/gse322563 data/raw/epic_v2 data/raw/ucsc data/raw/hg19

curl -L -o data/raw/gse322563/GSE322563_beta_matrix_EPIC_v2.txt.gz \
  "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE322nnn/GSE322563/suppl/GSE322563_beta_matrix_EPIC_v2.txt.gz"

# EPIC v2 platform annotation (Illumina GPL33022)
curl -L -o data/raw/epic_v2/GPL33022_family.soft.gz \
  "https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL33nnn/GPL33022/soft/GPL33022_family.soft.gz"

# UCSC hg19 reference annotations (gene model + CpG islands + repeats + DNase-HS)
for f in refGene.txt.gz cpgIslandExt.txt.gz rmsk.txt.gz wgEncodeRegDnaseClusteredV3.txt.gz; do
  curl -L -o "data/raw/ucsc/$f" "https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/$f"
done

# UCSC hg19 reference FASTA — only chr5/6/10 (the chromosomes hosting all
# three Roth Fig. 5d positives: ESR1 on chr6, EGFLAM on chr5, GATA3 on chr10).
for chrom in chr5 chr6 chr10; do
  curl -L -o "data/raw/hg19/${chrom}.fa.gz" \
    "https://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/${chrom}.fa.gz"
  gunzip -f "data/raw/hg19/${chrom}.fa.gz"
done
cat data/raw/hg19/chr5.fa data/raw/hg19/chr6.fa data/raw/hg19/chr10.fa \
  > data/raw/hg19/hg19_chr5_6_10.fa
Step 3

Build the EPIC v2 probe annotation

Lifts the EPIC v2 SOFT annotation onto hg19 and restricts to chr5/6/10. Takes ~2 minutes; needs pyliftover (already locked into the uv environment).

uv run python scripts/build_epic_v2_probes.py \
  --soft data/raw/epic_v2/GPL33022_family.soft.gz \
  --output data/raw/probes_hg19_epic_v2.tsv
Step 4

Build the cohort summary

Collapses the per-sample β matrix into per-probe (β_mean, β_iqr, n) summaries for the tumor and normal arms. ~30 seconds.

uv run python scripts/build_gse322563_native_epic_v2_cohort.py \
  --beta-matrix data/raw/gse322563/GSE322563_beta_matrix_EPIC_v2.txt.gz \
  --epic-v2-probes data/raw/probes_hg19_epic_v2.tsv \
  --output-dir data/derived/gse322563_native_epic_v2_cohort/
Step 5

Build the candidate catalog

The candidate catalog (~3 M PAM-bearing loci across chr5/6/10, restricted to within 500 bp of an EPIC v2 probe) is not distributed via git — it is gitignored under data/derived/catalog_*.jsonl and rebuilt deterministically from the reference FASTA + the shipped PAM model + the EPIC v2 probe annotation. ~3–5 minutes on a modern laptop.

uv run thermocas build-catalog \
  --reference data/raw/hg19/hg19_chr5_6_10.fa \
  --pam-model config/pam_model.yaml \
  --probe-annotation data/raw/probes_hg19_epic_v2.tsv \
  --output data/derived/catalog_hg19_chr5_6_10_epic_v2.jsonl

The same --pam-model config/pam_model.yaml value must be passed to score-cohort in the next step — the catalog header records the model SHA and the scorer refuses to run against a catalog produced by a different PAM model.

Step 6

Score the cohort

Walks the catalog, joins each candidate to its nearest probe, and applies V2.5-diff under the cohort YAML. ~90 seconds on a modern laptop.

uv run thermocas score-cohort \
  --catalog data/derived/catalog_hg19_chr5_6_10_epic_v2.jsonl \
  --cohort data/derived/gse322563_native_differential.yaml \
  --pam-model config/pam_model.yaml \
  --backend summary \
  --probe-annotation data/raw/probes_hg19_epic_v2.tsv \
  --tumor-summary data/derived/gse322563_native_epic_v2_cohort/tumor_summary.tsv \
  --normal-summary data/derived/gse322563_native_epic_v2_cohort/normal_summary.tsv \
  --probabilistic \
  --output data/derived/scored_gse322563_native_differential.jsonl

Two flags are easy to miss and both matter:

  • --backend summary selects the per-probe-summary backend (the format build_gse322563_native_epic_v2_cohort.py produced in Step 4). The default local backend expects raw per-sample β matrices and will reject the summary TSVs.
  • --probabilistic is what populates the p_therapeutic_selectivity field that Step 7's --score-field reads. Without it, the scored JSONL only carries the deterministic V1 final_score and the V2.5 benchmark row will not reproduce.

To score under V2.5-sigmoid instead of V2.5-diff, edit the cohort YAML in place — there is no separately-shipped sigmoid YAML for this cohort path. Change two fields in data/derived/gse322563_native_differential.yaml:

probabilistic_mode: tumor_plus_gap_sigmoid   # was: tumor_plus_differential_protection
sigma_fixed: 0.0707                          # add this line; ≈ √2 · σ_floor

The cohort loader's iff-semantics validators enforce that sigma_fixed is present iff probabilistic_mode == tumor_plus_gap_sigmoid. For the multi-bandwidth WG sweep referenced in PAPER §5.2.2, see scripts/sigmoid_bandwidth_sweep.py and scripts/sigmoid_delta_sigma_wg_sweep.py, which re-score programmatically rather than via materialized YAMLs.

Step 7

Benchmark against the Roth validated positives

uv run thermocas benchmark \
  --scored data/derived/scored_gse322563_native_differential.jsonl \
  --positives data/derived/epic_v2_positives/positives_roth_validated.txt \
  --cohort-name GSE322563-native-validated-V2.5 \
  --score-field p_therapeutic_selectivity --top-k 20 \
  --no-enforce-holdout \
  --output examples/gse322563_native_roth_labels/bench_validated_differential.jsonl

Inspect the result with cat or jq. The roc_auc, precision_at_k, precision_at_k_{min, max}, recall_at_k, recall_at_k_{min, max}, tie_band_size_at_k, and tie_break_policy fields are exactly what PAPER §5 reports. The roc_auc value should match the paper to four decimal places.

Step 8

Annotate the top 20 with biological context

Joins each top-K hit to nearest gene (refGene), CpG-island context, RepeatMasker overlap, and ENCODE DNase-HS cluster breadth. Emits both a TSV (machine-readable) and a Markdown shortlist (collaborator-readable).

uv run python scripts/annotate_top_hits.py \
  --scored data/derived/scored_gse322563_native_differential.jsonl \
  --top-k 20 \
  --rmsk data/raw/ucsc/rmsk.txt.gz \
  --dnase data/raw/ucsc/wgEncodeRegDnaseClusteredV3.txt.gz \
  --positives data/derived/epic_v2_positives/positives_roth_wide.txt \
  --output examples/gse322563_native_roth_labels/top20_annotated_v25.tsv \
  --markdown examples/gse322563_native_roth_labels/top20_annotated_v25.md

The committed Markdown file under examples/gse322563_native_roth_labels/top20_annotated_v25.md should be byte-identical to the one you just produced.

What you have just reproduced

Going further

When something does not match

If a BenchmarkResult value diverges from the paper:

  1. Confirm git rev-parse memo-2026-04-22-bw resolves to the same SHA as the cloned tip.
  2. Run uv run python scripts/verify_manuscript_claims.py. A green run means the bench artifacts and constants the paper cites are intact.
  3. If the verifier is green but your numbers differ, the divergence is almost always either (a) a different --score-field flag, (b) a different positives-list tier, or (c) a stale data/raw/ download. Re-pull from Step 2 and re-run.