CODA JHU Team
JHU team
Yiqun T. Chen Abhirup (Abhi) Datta Li Liu Last updated April 6, 2026

Cause-of-death modeling progress update

Headline: the strongest case-level model on CHAMPS is an ensemble of XGBoost on coded VA + clinical features and XGBoost on narrative embeddings, reaching 66.7% top-1 and 82.6% top-3. Across every comparable head-to-head, gradient boosting beats both logistic regression and the zero-shot Gemini LLM. The remaining ceiling is driven by rare causes and by the fact that fewer than half the cases carry usable free-text narrative.

Evaluation setting
Random 5-fold CV
Used throughout the headline results to keep comparisons fair.
Comparison cohort
9,149 matched cases
Direct head-to-head across structured, LLM, combined, and embedding approaches.
Full cohort
9,153 cases
Current processed dataset across 9 site codes.
Best top-1 (case-level)
66.7%
Ensemble of XGBoost (structured) + XGBoost (narrative embedding).
Narrative text available
44.6%
Less than half of cases carry usable free-text narrative.
Best CSMF (population-level)
92.8%
Age-stratified logistic regression (6 age strata).

Key Takeaways

Four short points with the main numbers: best model family, best ensemble result, main challenge area, and the role of narrative text.

Takeaway 1 · Model family

XGBoost is the strongest model family in the current runs.

On coded features it reaches 66.2% top-1, versus 58.9% for the matched logistic-regression baseline. On Gemma narrative embeddings it reaches 57.8%, and it also stays ahead of the zero-shot Gemini baseline at 55.6%.

Takeaway 2 · Ensembling

The structured-plus-embedding ensemble gives the best case-level result.

The best current run reaches 66.7% top-1 and 82.6% top-3, compared with 66.2% top-1 for the standalone structured XGBoost. The gain is modest, but it is still the strongest case-level result in the current pipeline.

Takeaway 3 · Rare causes

Rare causes remain the main bottleneck.

The top two causes still account for 54.8% of the cohort. Common perinatal causes drive much of the accuracy, while rarer infectious and mixed categories remain substantially harder.

Takeaway 4 · Limits of text

Narrative helps, but it does not replace coded VA.

Text-only embedding models remain below the structured XGBoost baseline: 57.8% for Gemma and 55.6% for Qwen. Coverage is the second limit: only 44.6% of cases have usable narrative.

Results At A Glance

One row per model family. All numbers are 5-fold random CV on the same processed CHAMPS cohort, so the comparisons are apples-to-apples. The detailed view exposes the additional supplementary runs.

Model glossary (used consistently throughout this page).
  • Structured · XGBoost — gradient-boosted trees on coded VA + clinical / lab features.
  • Structured · Logistic regression — regularized linear baseline on the same coded features.
  • Narrative embedding · Gemma / Qwen — classifier (XGBoost or logistic regression) on top of large-LM embeddings of the free-text VA narrative.
  • Zero-shot LLM (Gemini) — Gemini reading the serialized patient record and emitting a cause label, no training.
  • Age-stratified — same structured features, separate model per age stratum (3 or 6 strata).
  • Ensemble (Structured + Narrative) — the headline model: stacks structured XGBoost with narrative-embedding XGBoost.

Cross-Model Metric Snapshot

Default view shows the best representative model per family (one card per family). Use the selector to drill into the structured, narrative-embedding, age-stratified, or ensemble families, or to see every compared run on one screen.

Top-1 Top-3 CSMF
Structured · XGBoost
(+ MITS / lab features)
standalone
Top-1 63.7%
Top-3 83.1%
CSMF 85.2%
Narrative embedding (Gemma)
· XGBoost
narrative
Top-1 57.8%
Top-3 78.5%
CSMF 76.2%
Zero-shot LLM
(Gemini)
no training
Top-1 55.6%
Macro-F1 0.074
Age-stratified · XGBoost
(3 strata)
routing
Top-1 61.9%
Top-3 80.1%
CSMF 84.4%
Age-stratified · Logistic regression
(6 strata)
population
Top-1 56.3%
Top-3 79.3%
CSMF 92.8%
Ensemble
Structured + Narrative
XGBoost
headline
Top-1 66.7%
Top-3 82.6%
CSMF 82.2%
Model familyHeadline metricWhy it matters
Ensemble (Structured + Narrative, XGBoost)66.7% top-1 / 82.6% top-3Best case-level result. Stacks structured XGBoost with XGBoost on Gemma narrative embeddings.
Structured · XGBoost66.2% top-1Best single-family model. Coded VA + clinical/lab features, no narrative.
Structured · Logistic regression58.9% top-1Same features, linear baseline. Always lower than XGBoost.
Narrative embedding (Gemma) · XGBoost57.8% top-1Best text-only model. Plateaus well below structured XGBoost.
Zero-shot LLM (Gemini)55.6% top-1Readable and flexible, but the weakest serious approach in the comparison.
Age-stratified · Logistic regression (6 strata)92.8% CSMFBest population-level CSMF. Note its case-level top-1 (56.3%) is much lower than the ensemble.
The case-level winner is Ensemble (Structured + Narrative, XGBoost) at 66.7% top-1 / 82.6% top-3. The population-level winner is a different model family (age-stratified logistic regression) optimized for cause-specific mortality fractions, not individual case prediction.

Cohort Overview

The dataset is broad, but the label distribution is still dominated by a small number of perinatal causes.

Top causes dominate the task

The top two causes already make up 54.8% of the cohort.

Perinatal asphyxia/hypoxia
40.8%
3,733
Neonatal preterm birth complications
14.0%
1,284
Congenital birth defects
7.9%
721
Malnutrition
5.2%
479
Neonatal sepsis
4.8%
437
Undetermined
4.1%
375
Malaria
3.6%
332
Lower respiratory infections
3.2%
294
Congenital infection
2.2%
203
Diarrheal Diseases
1.8%
165

Age groups are highly imbalanced

Stillbirth and neonatal deaths make up most of the current processed dataset.

Stillbirth
38.0%
3,481
Death in the first 24 hours
14.6%
1,338
Early Neonate (1 to 6 days)
16.5%
1,507
Late Neonate (7 to 27 days)
6.9%
634
Infant (28 days to less than 12 months)
12.7%
1,158
Child (12 months to less than 60 Months)
11.3%
1,035

Geography is broad but not uniform

The core footprint is spread across multiple African sites, with a small Pakistan tail.

MZ
17.1%
1,562
ET
16.4%
1,498
ZA
15.7%
1,440
KE
15.3%
1,400
SL
13.4%
1,224
BD
12.9%
1,178
ML
8.0%
733
PK
1.2%
107
NG
0.1%
11

Matched-cohort baseline ranking

Top-1 on the direct head-to-head cohort (older comparison set, all linear/zero-shot baselines). The headline XGBoost ensemble in the leaderboard above sits well above all of these.

Ensemble (Logistic regression + Gemini)
Best linear-baseline ensemble in the matched analysis.
61.3%
Structured · Logistic regression
Linear structured baseline.
56.7%
Zero-shot LLM (Gemini)
Gemini reading the serialized patient record, no training.
55.6%
Narrative embedding (Qwen) · Logistic regression
Narrative-only baseline.
51.0%

Where Results Are Stronger And Weaker

Per-cause accuracy is the binding constraint. Common perinatal categories already classify at 85%+, and the ensemble improves several rare-cause buckets, but absolute accuracy on long-tail infectious and mixed categories is still where the headroom lives.

Performance On Common Causes

Darker cells are better within each row.

CauseNStructured · Logistic regressionZero-shot LLM (Gemini)Narrative embedding (Qwen) · Logistic regressionEnsemble (Logistic regression + Gemini)
perinatal asphyxia/hypoxia3,73188.9%84.7%90.6%88.8%
neonatal preterm birth complications1,28465.2%92.1%65.7%87.9%
congenital birth defects72135.6%24.5%0.3%30.7%
malnutrition47848.1%23.4%73.4%60.0%
neonatal sepsis43714.9%20.8%0.0%36.8%
undetermined3754.8%5.3%0.0%5.3%
malaria33168.0%46.2%26.6%62.5%
lower respiratory infections29416.7%18.4%0.0%24.8%
congenital infection2031.0%0.0%0.0%4.9%
diarrheal diseases16531.5%27.3%0.0%35.2%
HIV16225.3%19.1%0.0%19.8%
other neonatal disorders1472.0%3.4%0.0%1.4%

Persistent Challenge Areas

These are the causes that remain difficult across model families.

CauseNStructured · Logistic regressionZero-shot LLM (Gemini)Narrative embedding (Qwen) · Logistic regressionEnsemble (Logistic regression + Gemini)
undetermined3754.8%5.3%0.0%5.3%
lower respiratory infections29416.7%18.4%0.0%24.8%
congenital infection2031.0%0.0%0.0%4.9%
other neonatal disorders1472.0%3.4%0.0%1.4%
sepsis1343.0%4.5%0.0%9.0%
neonatal aspiration syndromes923.3%3.3%0.0%3.3%
other infections692.9%0.0%0.0%0.0%

Illustrative Synthetic Interview Examples

Side artifact, not part of the headline result. We can rewrite a structured case into a clean INTERVIEWER / CAREGIVER transcript that covers symptoms, timing, care-seeking, and charted findings without naming the diagnosis. Useful as a communication aid; the head-to-head augmentation check has not yet produced a performance gain.

Narrative validation
200/200
Short synthetic narratives passed validation.
Transcript validation
200/200
Long interview transcripts also validated cleanly.
Median transcript length
1,353 words
Transcript mode is long-form and presentation-ready.
Head-to-head agreement
65.7%
Real vs synthetic predictions agree fairly often.
Current conclusion: on the 99-case synthetic head-to-head subset, real narratives scored 47.5% top-1, synthetic narratives scored 48.5%, and the baseline condition scored 49.5%. This is promising enough to keep exploring, but not yet strong enough to headline as a performance win.