← All projects
Featured First author · ML lead · 2024 — present · Model

Glioma Subtype Classification & Survival Modeling from DNA Methylation

End-to-end ML pipeline classifying brain-tumor subtypes and predicting survival risk from CpG methylation profiles

Glioma Subtype Classification & Survival Modeling from DNA Methylation
97–99% external validation accuracy across 450k and EPIC cohorts
0.796 cross-validated C-index for pooled survival model
70 CpG features in the final classifier (down from ~404K probes)

Problem

Diffuse gliomas are highly heterogeneous and notoriously difficult to classify and prognosticate. Traditional histopathology suffers from documented interobserver variability, especially for ambiguous tumors. The current reference-standard methylation classifier (DKFZ/Heidelberg) is a black box built on proprietary infrastructure, which limits clinical interpretability and adoption. Existing prognostic signatures often rely on single cohorts and narrow probe sets, raising overfitting concerns. The challenge: build something transparent, reproducible, externally validated, and biologically interpretable — covering both diagnosis and survival in one pipeline.

Approach

Two parallel branches over harmonized 450k + EPIC methylation data with strict train/test isolation. Classification: a multi-class XGBoost trained on a 462-sample 450k cohort with variance thresholding, Bayesian hyperparameter optimization, and a two-stage feature selection (importance filter + CV-guided top-k) that reduced the input from ~404,000 probes to a parsimonious 70-feature model. Validated externally on independent 450k and EPIC cohorts, plus two large public 450k GBM cohorts (GSE36278, GSE200647) never seen during training. Survival: pooled all 711 samples (256 events), aggregated raw CpGs to gene-region features (promoter and gene-body, by UCSC RefGene mapping) for interpretability, and fit two complementary penalized Cox models — a pooled stratified Elastic-Net Cox sharing coefficients across cancers but with per-histology baseline hazards, and per-cancer Elastic-Net Cox + CoxBoost models. Evaluated under a 5×5 nested cross-validation with 5 repeats, with Sure Independence Screening (SIS) re-fit inside every training fold to prevent feature-selection leakage.

Outcome

Classifier: 97.4–98.3% accuracy on the held-out 450k partition; 100% on an external 450k cohort and 93.3% on an external EPIC cohort for Astro/Oligo; 97.4% and 99.1% on two unseen public 450k Glioblastoma cohorts — robust across both array technologies. Survival: pooled stratified Elastic-Net Cox achieved C-index = 0.796 ± 0.034 and time-dependent AUC = 0.844 ± 0.036 in cross-validation, the strongest discrimination yet observed in this cohort and outperforming all per-cancer models. Per-cancer Kaplan–Meier hazard ratios showed dramatic risk separation: Astrocytoma HR = 6.32, Oligodendroglioma HR = 19.63, Glioblastoma HR = 3.76. The pipeline recovered biologically plausible candidate prognostic loci including HMGA2, RHOBTB3, POMT1, HSPG2, and HOXC9 — a finding consistent across two distinct penalization schemes (Elastic-Net Cox and CoxBoost). Manuscript co-authored with Dr. Yasin Mamatjan.

Lessons

Strict leakage control is the difference between optimistic-looking results and honest ones. SIS feature screening, scaling, and hyperparameter tuning must all be re-fit inside every training fold — otherwise the held-out folds aren’t really held out. The largest practical win came from gene-region aggregation: collapsing hundreds of thousands of CpGs to per-gene promoter and body summaries improved interpretability without sacrificing performance, and produced feature lists that biologists could actually engage with. Pooled-stratified modelling captured a single dominant cross-cancer risk axis with very few features, while per-cancer models surfaced richer histology-specific biology — different questions, different right tools.

Stack

PythonRXGBoostglmnet (Elastic-Net Cox)CoxBoostscikit-learnpandasNumPyminfiBayesian optimization