Computational discovery of methylation-selective ThermoCas9 target sites in cancer, with one pilot in vitro validation

A complete 10-week undergraduate research project. The student builds a TCGA/GDC pipeline that scans for ThermoCas9-compatible PAM sites whose critical fifth-position cytosine is hypomethylated in tumor and methylated in normal tissue, then validates the top candidate with one in vitro cleavage assay. Yields a complete computational story plus one experimental reality check.

By Allison Huang · Thermocas9 Inc Audience undergraduate + mentor Duration 10 weeks Source Nature 2026

Project concept

Title: Computational discovery of methylation-selective ThermoCas9 target sites in cancer, with one pilot in vitro validation assay.

Core question: Can an undergraduate identify genomic sites where ThermoCas9 is likely to cut tumor DNA but not matched normal DNA because the PAM-site cytosine is hypomethylated in tumor and relatively methylated in normal tissue? This is the correct mechanistic framing because the 2026 ThermoCas9 paper shows that methylation sensitivity is concentrated at the fifth PAM cytosine, while methylation within the protospacer has little effect. (Nature 2026)

Why this is a good undergraduate project

It is narrow enough to finish and still teaches real translational genomics:

It also fits the current evidence boundary. ThermoCas9 has a methylation-sensitive PAM logic and the translational opportunity is real, but the field is still preclinical. That makes candidate prioritization a better summer endpoint than therapeutic development. (Nature 2026)

The scientific logic the student should learn

ThermoCas9 recognizes PAMs in the 5'-NNNNCNR-3' family, with the paper specifically testing CpG- and CpC-containing PAMs such as 5'-NNNNCGA-3' and 5'-NNGGCCA-3'. Methylation of the fifth PAM cytosine strongly reduces cleavage, especially when present on the non-target strand or both strands. Protospacer methylation had little effect in the reported assays. (Nature 2026)

the whole project in one sentence
The computational project should not ask "which genes are globally hypomethylated?" It should ask: which exact genomic loci contain a ThermoCas9-compatible PAM whose key cytosine is hypomethylated in tumor and relatively methylated in normal tissue?

What the student should produce by the end of summer

  1. A reproducible pipeline that scans candidate loci for ThermoCas9 PAM compatibility
  2. A ranked list of roughly 10 to 30 candidate sites in one cancer type
  3. Figures showing tumor-normal methylation separation at the candidate PAM-site CpG
  4. A short rationale for the top 3 to 5 loci
  5. One pilot assay showing that a selected methylated substrate is cut less efficiently than the matched unmethylated substrate

That is enough for a final talk, a poster, or a methods-focused internal report.

Recommended project design

The cleanest version is a single-cancer, single-data-source, single-assay design.

Recommended cancer type

Choose a tumor type in TCGA with substantial methylation data and at least some normal comparators. The GDC provides harmonized methylation beta values for TCGA samples, processed with SeSAMe; beta values range from 0 to 1. (GDC Data Portal)

Breast cancer is attractive because the ThermoCas9 paper already used MCF-7 and MCF-10A and demonstrated selective editing at breast-cancer-relevant loci. That gives the student a biologically coherent starting point, even if the final top candidates are not ESR1 or GATA3. (Nature 2026)

Recommended public data source

Use TCGA methylation data from the NCI Genomic Data Commons. The GDC methylation workflow generates harmonized beta-value files from Illumina methylation arrays using SeSAMe. Each beta value represents methylation level at a probe-associated CpG site. (GDC docs)

Recommended computational endpoint

Build a candidate ranking score rather than trying to claim definitive target discovery. That matters because TCGA methylation arrays assay known CpG sites associated with specific probes, not every cytosine in the genome. The HM450 platform covers more than 450,000 CpG sites, but it is still a predefined subset. (Illumina HM450)

honest framing
"We are prioritizing ThermoCas9 candidate loci using public methylation array data, then experimentally checking the exact PAM-site methylation effect in one pilot assay."

Precise summer-project scope

Phase 1

Define the search space

The student chooses one cancer type and a small set of biologically plausible genes or regions. There are two sensible options.

Option A · gene-centered (recommended)

Start from 20 to 50 genes that are relevant to the cancer type, then scan nearby sequence for ThermoCas9-compatible PAMs whose critical cytosine overlaps a measured CpG or lies very close to one. Easier for an undergraduate and keeps the result interpretable.

Option B · genome-first

Start from differentially methylated CpG probes and search nearby sequence for ThermoCas9-compatible PAMs. More discovery-oriented but harder to interpret and debug.

For an undergraduate, Option A is better.

Phase 2

Obtain methylation values

The student downloads TCGA tumor and normal methylation beta values from GDC. GDC provides harmonized methylation outputs and improved probe annotation resources, including hg38 coordinates and masking information for low-quality probes. (GDC Portal)

At this stage, the student should learn three things:

  • beta values range from 0 to 1, where higher means more methylated (GDC docs)
  • probes have genomic coordinates and annotation metadata (GDC docs)
  • low-quality or poorly mapping probes should be filtered using available annotation masks (GDC mask)
Phase 3

Identify candidate ThermoCas9 sites

Scan reference sequence around each gene or region for PAMs in the ThermoCas9 family. The Nature paper states ThermoCas9 has a broad PAM specificity, with a strict requirement for the fifth position to be a C-G pair, summarized as 5'-NNNNCNR-3', and gives example functional PAMs including 5'-NNNNCGA-3'. (Nature 2026)

For a summer project the search should be conservative. Prioritize:

  • CpG-containing PAMs, especially NNCGA-like motifs, because CpG methylation is the dominant mammalian methylation context and the ThermoCas9 paper directly emphasizes this point
  • PAMs where the key cytosine overlaps or lies immediately adjacent to a measured CpG probe
  • loci with clear tumor-normal methylation separation

This makes the project closer to the actual translational use case.

Phase 4

Rank candidates

A simple scoring model is enough. For each site, score:

methylation differential
tumor beta minus normal beta, or the reverse depending on desired unmethylated tumor state
separation robustness
variance and overlap between tumor and normal distributions
probe confidence
whether the nearest methylation probe is high quality and close to the PAM cytosine
biological interest
whether the locus sits in a promoter, enhancer, or disease-relevant gene neighborhood using available annotation
practical assayability
can the student synthesize a short substrate around this site for the pilot assay

The student is not trying to build a perfect predictor. The aim is a defensible shortlist.

Phase 5

Pilot validation assay

The best single validation assay is a cell-free cleavage assay with matched methylated and unmethylated DNA substrates.

Why this assay:

  • directly tests the paper's mechanism
  • much simpler than cell editing
  • avoids conflating methylation effects with chromatin, delivery, or cell-state issues
  • fits undergraduate time

The Nature paper itself used in vitro DNA cutting assays and synthetic oligo duplexes to distinguish methylated and unmethylated PAM contexts. (Nature 2026)

Minimal assay design

Take one top-ranked candidate sequence and synthesize two duplexes:

  • one unmethylated
  • one with 5mC at the critical PAM cytosine

Then assemble ThermoCas9 RNP with the matching guide and compare cleavage.

A strong pilot result would be:

  • clear cleavage of the unmethylated substrate
  • reduced cleavage of the methylated substrate

That reproduces the core mechanistic behavior in a project-specific sequence context.

Exact undergraduate deliverables (week 10)

Computational

  • One notebook or script pipeline
  • One candidate table with ranking score
  • Heatmap or boxplot panel of methylation at top loci
  • Schematic figure showing the selected PAM and methylated cytosine

Experimental

  • One gel image or cleavage quantification panel
  • Methylated vs unmethylated substrate comparison

Written

  • 4–6 page report with methods, results, limitations, next steps
  • Honest discussion of array-coverage limits

A realistic 10-week timeline

Weeks Focus Output
1–2 Background and setup Read ThermoCas9 paper sections on PAM specificity and methylation sensitivity. Learn GDC methylation data structure. Get a small example dataset running. (Nature 2026)
3–4 First-pass candidate pipeline Selected tumor type · selected gene list · first scan of ThermoCas9-compatible PAMs · merged methylation annotation
5–6 Rank sites and stress-test the shortlist Top 20–30 candidates · plots showing tumor-normal methylation separation · one chosen site for pilot assay
7–8 Pilot assay execution Synthesized substrate design · guide design · initial cleavage readout
9–10 Cleanup and final presentation Final ranked table · one clear validation figure · honest discussion of why array-based screening is only a prioritization step

What the mentor needs to provide

For this to work, the mentor or lab must provide four things:

  1. A pre-existing way to obtain or access ThermoCas9 protein or construct
  2. Minimal support for guide preparation and RNP assembly
  3. Access to basic cleavage readout, usually gel electrophoresis
  4. A defined choice of cancer type or gene set to prevent the project from sprawling

Without those, the student should do the computational project only.

do not let the project expand into
  • full genome-wide discovery across many cancers
  • cell-based editing as the first validation
  • off-target mapping
  • animal studies
  • claims of therapeutic readiness

All too large for summer scope and would blur the main result.

The most important technical limitation

key caveat the student must understand
TCGA methylation arrays measure probe-associated CpG sites, not all cytosines genome-wide. The HM450 platform covers hundreds of thousands of CpGs, but coverage is still selective. GDC documentation explicitly describes methylation beta values as associated with array probes and known CpG sites. (Illumina HM450)

So the pipeline cannot prove the exact ThermoCas9 PAM cytosine is differentially methylated in patient samples unless that precise CpG is directly measured or confirmed with follow-up assays. That is why the project's experimental step should validate the methylation-sensitive cleavage behavior on an exact synthetic substrate.

A strong student will say this clearly in the discussion section.

The hypothesis statement (for the student report)

Hypothesis
Some cancer-associated loci contain ThermoCas9-compatible PAMs whose critical PAM cytosine is predicted to be less methylated in tumor than in normal tissue, enabling prioritized discovery of candidate methylation-selective editing sites using public methylation-array data and preliminary validation by in vitro cleavage assay.

Specific, realistic, and defensible.

Suggested analysis workflow

  1. Choose cancer type
  2. Collect tumor and normal methylation beta values from GDC (GDC Portal)
  3. Filter probes using annotation and mask information (GDC mask)
  4. Map probe coordinates to candidate genes or regions
  5. Scan sequence for ThermoCas9-compatible PAMs based on the fifth-position C requirement and CpG-containing motifs emphasized in the Nature paper (Nature 2026)
  6. Score loci by tumor-normal methylation separation and assay feasibility
  7. Select one site
  8. Test methylated versus unmethylated substrate cleavage in vitro

What a final figure set could look like

Figure 1

ThermoCas9 methylation-sensitivity mechanism schematic

The PAM cytosine (C5*) shown as the critical variable. Highlight Asp1017 contact and the steric block from a 5-methyl group. Cite the paper's mechanistic finding. (Nature 2026)

Figure 2

Pipeline diagram

Flow from GDC methylation data → probe filtering → PAM scan → tumor-normal scoring → candidate shortlist. Cite GDC methylation data structure and annotation workflow. (GDC docs)

Figure 3

Boxplots / violin plots for top 5 candidate loci

Tumor versus normal beta values per locus, with sample sizes. Highlight the chosen site. (GDC beta values)

Figure 4

Pilot cleavage assay

Methylated vs unmethylated substrate cleavage for the top locus. Mechanistically aligned with the ThermoCas9 paper's in vitro design. (Nature 2026)

What would count as success

Success is not finding the perfect clinical target. Success is:

That is already a very good undergraduate summer result.

Recommended most-practical version

Cancer type
Breast cancer
Search space
20 to 30 breast-cancer-relevant genes or regulatory regions
Data
TCGA / GDC methylation beta values
Computational endpoint
Ranked list of CpG-containing ThermoCas9 PAM candidates
Experimental endpoint
One methylated versus unmethylated cleavage assay on the top candidate

Stays aligned with the only strong current translational proof of concept and avoids unnecessary complexity.

next outputs available on request
This project page can be turned into a week-by-week student project sheet, a methods checklist, or a starter analysis notebook outline. Contact contact@thermocas9.com.

Sources

  1. Roth M.O., Shu Y., Zhao Y., Trasanidou D., Hoffman R.D., et al. Molecular basis for methylation-sensitive editing by Cas9. Nature (2026). DOI 10.1038/s41586-026-10384-z. Open access (CC BY-NC-ND 4.0).
  2. NCI Genomic Data Commons (GDC) Data Portal — TCGA methylation beta values.
  3. GDC Methylation Analysis Pipeline documentation — beta values, SeSAMe processing, probe annotation.
  4. Illumina HumanMethylation450 (HM450) BeadChip datasheet — over 450,000 probe-associated CpG sites.
  5. GDC: Improved DNA Methylation Array Probe Annotation — hg38 coordinates and probe-mask resources.