Compositional probability-scale scoring and tie-band-aware benchmarking for methylome-guided ThermoCas9 target-site ranking
- Author
- Allison Huang, Columbia University · allisonhmercer@gmail.com
- Date
- 2026-04-22
- PAPER.pdf
- long-form technical memo (audit-trail revision; cite
paper-5-10j) - MANUSCRIPT.pdf
- Bioinformatics-shaped short version (cite
memo-2026-04-22-bw) - Repository
- github.com/AllisonH12/thermocas9 · BSD-3 · Python 3.11+
- bioRxiv DOI
- to be added on posting
- Zenodo DOI
- to be minted on first GitHub release (citable archive of the tagged source)
p_observation_trustworthy saturates by EvidenceClass,
not continuously. Read the score as a probability-scale ranking axis, not as a calibrated
hypothesis test. The framework is open educational research — not a clinical decision-support
system.
The biology, in one paragraph
Roth et al. (Nature 2026) showed that ThermoCas9 — a thermophilic Cas9 from Geobacillus thermodenitrificans T12 — refuses to cleave when the fifth cytosine of its PAM is methylated. An unmethylated PAM cleaves as usual; a 5-methylcytosine at that position abolishes binding. The therapeutic implication is that loci which are unmethylated in tumor cells but methylated in matched normal cells become selectively editable — the editor only cuts the disease side. Roth demonstrated this with three breast-cancer targets in MCF-7 vs MCF-10A.
Turning that mechanism into a genome-scale shortlist is a ranking problem. Which loci, across millions of candidates, look like the best bets?
Why a generic guide-scorer does not work
Off-the-shelf CRISPR guide-scoring tools (Azimuth, DeepCRISPR, CRISPRitz) score sequence properties and off-target risk. They have no notion of methylation state at the PAM cytosine, because they were not designed for a methylation-sensitive Cas9 variant. Methylation-array differential-analysis tools (methylKit, minfi, limma) produce per-probe statistics but not ranked target lists; their output feeds a scorer, it does not replace one.
Three problems, specifically, made it worth building a new scorer:
- Replicate counts are tiny. Methylation cohorts often ship with
n = 2–3samples per side. Any scorer must carry its own uncertainty rather than emit a scalar that hides it. - The "normal" arm is inconsistent. Adjacent-normal tissue in a bulk methylation study is not the same object as a matched cell-line pair. A scorer that bakes in a fixed β_normal threshold will fail across cohort types.
- Top-K can sit inside a tied band. On low-replicate cohorts the top of any composite can collapse into hundreds or thousands of equally-scored candidates. Reporting a single Precision@K without that interval misleads the reader.
The compositional skeleton
thermocas answers all three with the same multiplicative form:
p_therapeutic_selectivity = p_targ × (gap factor) × p_trust
Each factor maps to one independently-replaceable question.
p_targ — is the tumor side actually targetable?
This is the probability that the tumor-arm methylation at the PAM cytosine is low enough
to permit cleavage. It is computed by integrating a method-of-moments Beta posterior on the
per-probe summary β, with a piecewise-linear fallback when the moment match degenerates. A site
cannot be a candidate if the editor cannot bind to it.
Gap factor — is the tumor-vs-normal contrast meaningful? Two instances ship in the current tag:
- V2.5-diff (
tumor_plus_differential_protection) —p_diff = P(β_normal − β_tumor > δ)under an independent-normal approximation on the per-probe β summaries withδ = 0.20andσ_floor = 0.05. The original V2.5 composite, retained for backward-compatibility and AUC parity on cell-line cohorts. - V2.5-sigmoid (
tumor_plus_gap_sigmoid) —sigmoid((β_n − β_t − δ) / σ_fixed)withδ = 0.20andσ_fixed ≈ 0.0707 ≈ √2 · σ_floor. A fixed-bandwidth logistic on the same gap signal. The worked / default whole-genome prioritization axis selected in the §5.2.2 frozen WG panel.
The point is not which gap factor "wins" — the point is that the slot is replaceable. A future axis (an exact difference-of-Betas, an SE-on-mean variant, a mixture-aware version) can be plugged in without rewriting the scorer.
p_trust — is the per-probe observation trustworthy?
Discrete EvidenceClass (EXACT, PROXIMAL_CLOSE, PROXIMAL, REGIONAL, UNOBSERVED) times
a sample-count ramp min(1, min(n_t, n_n) / 30). A site whose methylation is nominated
by a regional proxy on a 2-sample cohort cannot dominate the top-K just because its
p_targ × gap happens to peak.
The multiplicative form is a gating-style ranking heuristic: a candidate is
penalized if any required component is weak. An additive score over the same signals would let a
strong component compensate for a failed gate, which is less aligned with a selectivity screen.
The product is not a calibrated joint probability — p_targ, the gap factor,
and p_trust are correlated on real catalogs (EXACT-class records cluster at extreme
β values that also drive p_targ). Read the score as a probability-scale ranking
axis throughout, not as "0.9 means 90% of 0.9-scored sites edit successfully."
Tie-band-aware benchmarking — the actual durable contribution
This is the part of the paper most likely to outlive any specific gap-factor choice.
On a n = 2/2 matched cell-line cohort with a probabilistic composite, the score
distribution at K = 100 routinely sits inside a tied band. V2.5-diff at the whole-genome scale
produces tied bands of 421 to 1,493 records at K = 100 across the four evaluated
cohort paths — a single Precision@100 number for that score is meaningless without an interval.
Every BenchmarkResult row that thermocas produces carries:
precision_at_k(the natural-order observed value);precision_at_k_{min, max}(the adversarial interval over the tied band);recall_at_kandrecall_at_k_{min, max};tie_band_size_at_k(so the reader sees the denominator);- a mid-rank Mann–Whitney
roc_auc(no tie-break ambiguity); - the
tie_break_policyused.
Switching from V2.5-diff to V2.5-sigmoid does not change matched-cell-line AUC by more than 0.002,
but it collapses every WG tie_band@100 to 1. That is a strict
improvement in top-K usability on the same biological signal — and it is only visible because the
benchmark contract reports tie bands explicitly.
What the benchmark says
Four public methylation cohorts, scored on three positives tiers (validated from the
Roth Fig. 5d targets; narrow ±50 bp; wide ±500 bp).
- GSE322563 (Roth MCF-7/MCF-10A) is the primary endpoint. On both the HM450-intersect and native EPIC v2 ingest paths, V2.5-diff and V2.5-sigmoid place all three Roth Fig. 5d positives in the upper 0.06–4.6% of millions of WG candidates.
- A 1,000,000-draw random-triple null on the n = 3 primary endpoint places V2.5-diff at one-sided
p ≈ 4 × 10⁻⁶ / 1.5 × 10⁻⁵ / 2.8 × 10⁻⁵across the three matched-cell-line cohort paths — strictly tighter than raw Δβ-only's8.6 × 10⁻⁵ / 3.1 × 10⁻⁴ / 1.1 × 10⁻⁴on the same paths. This is an against-random check, not a paired superiority test (the paired sign-flip null floors atp = 0.125withn_pos = 3). - The headline tissue stress-test is on GSE69914 (n = 305 primary tumor + 50 healthy-donor breast tissue, unpaired) under transported Roth labels. A probe-level limma-style moderated-
tbaseline is competitive on cell lines (AUC 0.959–0.991) but collapses to AUC 0.573 on bulk tissue. V2.5-sigmoid recovers tissue to AUC 0.862 on the same probe-level inputs — a +0.29 swing that confirms thep_targ × p_trustwrapping is the load-bearing contribution, not the gap factor's specific shape. The cohort-level tissue gain is per-positive heterogeneous: GATA3 carries the matched-pool signal, ESR1 is matched-near-random under V2.5-sigmoid.
The atlas page has a per-positive whole-genome rank dot-plot that visualizes this heterogeneity directly — including the GSE69914 ESR1 reversal under V2.5-sigmoid.
Honest scope
A few things the paper is explicit about:
- The validated positives set is n = 3. AUC at this scale is a rank-lift summary, not an inferential discovery-performance estimate.
- V2.5-sigmoid was selected by post-repair sensitivity and whole-genome stress testing on the same benchmark family. Its prospective ranking utility remains to be validated on newly generated labels or wet-lab follow-up. The detailed external-validation design lives at docs/notes/external_validation_instruction.md in the repo.
- On matched cell lines, raw Δβ-only is already strong (AUC 0.96–0.97). V2.5 variants add small but consistent margins (+0.010 to +0.080 across the nine tier × path rows). The transferable contribution is the tie-band-aware benchmark contract and whole-genome tied-band diagnosis, not a claim that any one gap factor is universally best.
How to cite
Until the bioRxiv DOI is live, cite the immutable git tag:
Huang, A. (2026). Compositional probability-scale scoring and tie-band-aware benchmarking for
methylome-guided ThermoCas9 target-site ranking. Technical memo, version paper-5-10j.
https://github.com/AllisonH12/thermocas9/tree/paper-5-10j
For the Bioinformatics-shaped short version, cite memo-2026-04-22-bw instead.
Where to go next
- See where the candidates land — per-cohort top-100 with a per-positive WG-rank dot-plot.
- Reproduce a benchmark row on a laptop — 8-step
uv runwalkthrough. - Read the long version — full audit trail including V2 → V2.5 development history.