Glioma Subtype Classification & Survival Modeling from DNA Methylation
End-to-end ML pipeline classifying brain-tumor subtypes and predicting survival risk from CpG methylation profiles
Problem
Diffuse gliomas are highly heterogeneous and notoriously difficult to classify and prognosticate. Traditional histopathology suffers from documented interobserver variability, especially for ambiguous tumors. The current reference-standard methylation classifier (DKFZ/Heidelberg) is a black box built on proprietary infrastructure, which limits clinical interpretability and adoption. Existing prognostic signatures often rely on single cohorts and narrow probe sets, raising overfitting concerns. The challenge: build something transparent, reproducible, externally validated, and biologically interpretable — covering both diagnosis and survival in one pipeline.
Approach
Two parallel branches over harmonized 450k + EPIC methylation data with strict train/test isolation. Classification: a multi-class XGBoost trained on a 462-sample 450k cohort with variance thresholding, Bayesian hyperparameter optimization, and a two-stage feature selection (importance filter + CV-guided top-k) that reduced the input from ~404,000 probes to a parsimonious 70-feature model. Validated externally on independent 450k and EPIC cohorts, plus two large public 450k GBM cohorts (GSE36278, GSE200647) never seen during training. Survival: pooled all 711 samples (256 events), aggregated raw CpGs to gene-region features (promoter and gene-body, by UCSC RefGene mapping) for interpretability, and fit two complementary penalized Cox models — a pooled stratified Elastic-Net Cox sharing coefficients across cancers but with per-histology baseline hazards, and per-cancer Elastic-Net Cox + CoxBoost models. Evaluated under a 5×5 nested cross-validation with 5 repeats, with Sure Independence Screening (SIS) re-fit inside every training fold to prevent feature-selection leakage.
Outcome
Classifier: 97.4–98.3% accuracy on the held-out 450k partition; 100% on an external 450k cohort and 93.3% on an external EPIC cohort for Astro/Oligo; 97.4% and 99.1% on two unseen public 450k Glioblastoma cohorts — robust across both array technologies. Survival: pooled stratified Elastic-Net Cox achieved C-index = 0.796 ± 0.034 and time-dependent AUC = 0.844 ± 0.036 in cross-validation, the strongest discrimination yet observed in this cohort and outperforming all per-cancer models. Per-cancer Kaplan–Meier hazard ratios showed dramatic risk separation: Astrocytoma HR = 6.32, Oligodendroglioma HR = 19.63, Glioblastoma HR = 3.76. The pipeline recovered biologically plausible candidate prognostic loci including HMGA2, RHOBTB3, POMT1, HSPG2, and HOXC9 — a finding consistent across two distinct penalization schemes (Elastic-Net Cox and CoxBoost). Manuscript co-authored with Dr. Yasin Mamatjan.
Lessons
Strict leakage control is the difference between optimistic-looking results and honest ones. SIS feature screening, scaling, and hyperparameter tuning must all be re-fit inside every training fold — otherwise the held-out folds aren’t really held out. The largest practical win came from gene-region aggregation: collapsing hundreds of thousands of CpGs to per-gene promoter and body summaries improved interpretability without sacrificing performance, and produced feature lists that biologists could actually engage with. Pooled-stratified modelling captured a single dominant cross-cancer risk axis with very few features, while per-cancer models surfaced richer histology-specific biology — different questions, different right tools.