Predicting PHQ-9 Depression Scores from Multi-Clinic EHR Data
Benchmarking and predicting patient-reported depression severity at scale
Problem
Mental-health clinicians administer the PHQ-9 questionnaire to track patient depression severity over time, but the scores arrive irregularly and don’t always map cleanly to other clinical signals already captured in the EHR. The company wanted to know two things: how well can existing baseline approaches actually predict PHQ-9 from the rest of the clinical record, and is there room for a temporal model to do better than per-visit snapshots?
Approach
Pulled multi-clinic patient data from the company’s Snowflake warehouse across roughly fourteen related tables — demographics, visit history, diagnostic codes, prior questionnaire responses, and prescription patterns. Built the feature pipeline in Python with pandas, with feature engineering specifically designed to preserve temporal ordering. Trained a Random Forest as the benchmark model, then a Temporal Neural Network that explicitly modeled the sequence of visits per patient. Evaluated both with held-out patients (not held-out visits) to avoid leakage from same-patient correlation.
Outcome
The Temporal Neural Network improved PHQ-9 prediction accuracy by approximately 30% over the Random Forest benchmark on the held-out patient set. Delivered an executive summary, a detailed technical report, and a presentation to my supervising professors and the company’s CEO; the findings fed into subsequent product-design decisions about which clinical signals are worth surfacing in the platform’s clinician-facing views.
Lessons
Held-out splits matter more than model architecture in clinical ML — the early Random Forest looked deceptively strong until the validation strategy switched from held-out visits to held-out patients. The biggest practical win was investing in the data pipeline before the model: most of the accuracy gain came from feature engineering on the warehouse side, not from the temporal architecture itself.