Project concept
Title: Computational discovery of methylation-selective ThermoCas9 target sites in cancer, with one pilot in vitro validation assay.
Core question: Can an undergraduate identify genomic sites where ThermoCas9 is likely to cut tumor DNA but not matched normal DNA because the PAM-site cytosine is hypomethylated in tumor and relatively methylated in normal tissue? This is the correct mechanistic framing because the 2026 ThermoCas9 paper shows that methylation sensitivity is concentrated at the fifth PAM cytosine, while methylation within the protospacer has little effect. (Nature 2026)
Why this is a good undergraduate project
It is narrow enough to finish and still teaches real translational genomics:
- sequence motif scanning
- methylation-data handling
- candidate ranking
- tumor-versus-normal biomarker logic
- one mechanistically targeted validation assay
It also fits the current evidence boundary. ThermoCas9 has a methylation-sensitive PAM logic and the translational opportunity is real, but the field is still preclinical. That makes candidate prioritization a better summer endpoint than therapeutic development. (Nature 2026)
The scientific logic the student should learn
ThermoCas9 recognizes PAMs in the 5'-NNNNCNR-3' family, with the paper specifically
testing CpG- and CpC-containing PAMs such as 5'-NNNNCGA-3' and
5'-NNGGCCA-3'. Methylation of the fifth PAM cytosine strongly
reduces cleavage, especially when present on the non-target strand or both strands. Protospacer
methylation had little effect in the reported assays.
(Nature 2026)
What the student should produce by the end of summer
- A reproducible pipeline that scans candidate loci for ThermoCas9 PAM compatibility
- A ranked list of roughly 10 to 30 candidate sites in one cancer type
- Figures showing tumor-normal methylation separation at the candidate PAM-site CpG
- A short rationale for the top 3 to 5 loci
- One pilot assay showing that a selected methylated substrate is cut less efficiently than the matched unmethylated substrate
That is enough for a final talk, a poster, or a methods-focused internal report.
Recommended project design
The cleanest version is a single-cancer, single-data-source, single-assay design.
Recommended cancer type
Choose a tumor type in TCGA with substantial methylation data and at least some normal comparators. The GDC provides harmonized methylation beta values for TCGA samples, processed with SeSAMe; beta values range from 0 to 1. (GDC Data Portal)
Breast cancer is attractive because the ThermoCas9 paper already used MCF-7 and MCF-10A and demonstrated selective editing at breast-cancer-relevant loci. That gives the student a biologically coherent starting point, even if the final top candidates are not ESR1 or GATA3. (Nature 2026)
Recommended public data source
Use TCGA methylation data from the NCI Genomic Data Commons. The GDC methylation workflow generates harmonized beta-value files from Illumina methylation arrays using SeSAMe. Each beta value represents methylation level at a probe-associated CpG site. (GDC docs)
Recommended computational endpoint
Build a candidate ranking score rather than trying to claim definitive target discovery. That matters because TCGA methylation arrays assay known CpG sites associated with specific probes, not every cytosine in the genome. The HM450 platform covers more than 450,000 CpG sites, but it is still a predefined subset. (Illumina HM450)
Precise summer-project scope
Define the search space
The student chooses one cancer type and a small set of biologically plausible genes or regions. There are two sensible options.
Option A · gene-centered (recommended)
Start from 20 to 50 genes that are relevant to the cancer type, then scan nearby sequence for ThermoCas9-compatible PAMs whose critical cytosine overlaps a measured CpG or lies very close to one. Easier for an undergraduate and keeps the result interpretable.
Option B · genome-first
Start from differentially methylated CpG probes and search nearby sequence for ThermoCas9-compatible PAMs. More discovery-oriented but harder to interpret and debug.
For an undergraduate, Option A is better.
Obtain methylation values
The student downloads TCGA tumor and normal methylation beta values from GDC. GDC provides harmonized methylation outputs and improved probe annotation resources, including hg38 coordinates and masking information for low-quality probes. (GDC Portal)
At this stage, the student should learn three things:
Identify candidate ThermoCas9 sites
Scan reference sequence around each gene or region for PAMs in the ThermoCas9 family. The
Nature paper states ThermoCas9 has a broad PAM specificity, with a strict requirement for the
fifth position to be a C-G pair, summarized as 5'-NNNNCNR-3',
and gives example functional PAMs including 5'-NNNNCGA-3'.
(Nature 2026)
For a summer project the search should be conservative. Prioritize:
- CpG-containing PAMs, especially
NNCGA-like motifs, because CpG methylation is the dominant mammalian methylation context and the ThermoCas9 paper directly emphasizes this point - PAMs where the key cytosine overlaps or lies immediately adjacent to a measured CpG probe
- loci with clear tumor-normal methylation separation
This makes the project closer to the actual translational use case.
Rank candidates
A simple scoring model is enough. For each site, score:
- methylation differential
- tumor beta minus normal beta, or the reverse depending on desired unmethylated tumor state
- separation robustness
- variance and overlap between tumor and normal distributions
- probe confidence
- whether the nearest methylation probe is high quality and close to the PAM cytosine
- biological interest
- whether the locus sits in a promoter, enhancer, or disease-relevant gene neighborhood using available annotation
- practical assayability
- can the student synthesize a short substrate around this site for the pilot assay
The student is not trying to build a perfect predictor. The aim is a defensible shortlist.
Pilot validation assay
The best single validation assay is a cell-free cleavage assay with matched methylated and unmethylated DNA substrates.
Why this assay:
- directly tests the paper's mechanism
- much simpler than cell editing
- avoids conflating methylation effects with chromatin, delivery, or cell-state issues
- fits undergraduate time
The Nature paper itself used in vitro DNA cutting assays and synthetic oligo duplexes to distinguish methylated and unmethylated PAM contexts. (Nature 2026)
Minimal assay design
Take one top-ranked candidate sequence and synthesize two duplexes:
- one unmethylated
- one with 5mC at the critical PAM cytosine
Then assemble ThermoCas9 RNP with the matching guide and compare cleavage.
A strong pilot result would be:
- clear cleavage of the unmethylated substrate
- reduced cleavage of the methylated substrate
That reproduces the core mechanistic behavior in a project-specific sequence context.
Exact undergraduate deliverables (week 10)
Computational
- One notebook or script pipeline
- One candidate table with ranking score
- Heatmap or boxplot panel of methylation at top loci
- Schematic figure showing the selected PAM and methylated cytosine
Experimental
- One gel image or cleavage quantification panel
- Methylated vs unmethylated substrate comparison
Written
- 4–6 page report with methods, results, limitations, next steps
- Honest discussion of array-coverage limits
A realistic 10-week timeline
| Weeks | Focus | Output |
|---|---|---|
| 1–2 | Background and setup | Read ThermoCas9 paper sections on PAM specificity and methylation sensitivity. Learn GDC methylation data structure. Get a small example dataset running. (Nature 2026) |
| 3–4 | First-pass candidate pipeline | Selected tumor type · selected gene list · first scan of ThermoCas9-compatible PAMs · merged methylation annotation |
| 5–6 | Rank sites and stress-test the shortlist | Top 20–30 candidates · plots showing tumor-normal methylation separation · one chosen site for pilot assay |
| 7–8 | Pilot assay execution | Synthesized substrate design · guide design · initial cleavage readout |
| 9–10 | Cleanup and final presentation | Final ranked table · one clear validation figure · honest discussion of why array-based screening is only a prioritization step |
What the mentor needs to provide
For this to work, the mentor or lab must provide four things:
- A pre-existing way to obtain or access ThermoCas9 protein or construct
- Minimal support for guide preparation and RNP assembly
- Access to basic cleavage readout, usually gel electrophoresis
- A defined choice of cancer type or gene set to prevent the project from sprawling
Without those, the student should do the computational project only.
- full genome-wide discovery across many cancers
- cell-based editing as the first validation
- off-target mapping
- animal studies
- claims of therapeutic readiness
All too large for summer scope and would blur the main result.
The most important technical limitation
So the pipeline cannot prove the exact ThermoCas9 PAM cytosine is differentially methylated in patient samples unless that precise CpG is directly measured or confirmed with follow-up assays. That is why the project's experimental step should validate the methylation-sensitive cleavage behavior on an exact synthetic substrate.
A strong student will say this clearly in the discussion section.
The hypothesis statement (for the student report)
Specific, realistic, and defensible.
Suggested analysis workflow
- Choose cancer type
- Collect tumor and normal methylation beta values from GDC (GDC Portal)
- Filter probes using annotation and mask information (GDC mask)
- Map probe coordinates to candidate genes or regions
- Scan sequence for ThermoCas9-compatible PAMs based on the fifth-position C requirement and CpG-containing motifs emphasized in the Nature paper (Nature 2026)
- Score loci by tumor-normal methylation separation and assay feasibility
- Select one site
- Test methylated versus unmethylated substrate cleavage in vitro
What a final figure set could look like
ThermoCas9 methylation-sensitivity mechanism schematic
The PAM cytosine (C5*) shown as the critical variable. Highlight Asp1017 contact and the steric block from a 5-methyl group. Cite the paper's mechanistic finding. (Nature 2026)
Pipeline diagram
Flow from GDC methylation data → probe filtering → PAM scan → tumor-normal scoring → candidate shortlist. Cite GDC methylation data structure and annotation workflow. (GDC docs)
Boxplots / violin plots for top 5 candidate loci
Tumor versus normal beta values per locus, with sample sizes. Highlight the chosen site. (GDC beta values)
Pilot cleavage assay
Methylated vs unmethylated substrate cleavage for the top locus. Mechanistically aligned with the ThermoCas9 paper's in vitro design. (Nature 2026)
What would count as success
Success is not finding the perfect clinical target. Success is:
- a reproducible ranking pipeline
- one logically selected site
- one pilot assay showing the right directional behavior
- a clear explanation of limitations
That is already a very good undergraduate summer result.
Recommended most-practical version
- Cancer type
- Breast cancer
- Search space
- 20 to 30 breast-cancer-relevant genes or regulatory regions
- Data
- TCGA / GDC methylation beta values
- Computational endpoint
- Ranked list of CpG-containing ThermoCas9 PAM candidates
- Experimental endpoint
- One methylated versus unmethylated cleavage assay on the top candidate
Stays aligned with the only strong current translational proof of concept and avoids unnecessary complexity.
Sources
- Roth M.O., Shu Y., Zhao Y., Trasanidou D., Hoffman R.D., et al. Molecular basis for methylation-sensitive editing by Cas9. Nature (2026). DOI 10.1038/s41586-026-10384-z. Open access (CC BY-NC-ND 4.0).
- NCI Genomic Data Commons (GDC) Data Portal — TCGA methylation beta values.
- GDC Methylation Analysis Pipeline documentation — beta values, SeSAMe processing, probe annotation.
- Illumina HumanMethylation450 (HM450) BeadChip datasheet — over 450,000 probe-associated CpG sites.
- GDC: Improved DNA Methylation Array Probe Annotation — hg38 coordinates and probe-mask resources.