Cause-of-death modeling progress update (CODA JHU Team)

Cause-of-death modeling progress update

Headline: the strongest case-level model on CHAMPS is an ensemble of XGBoost on coded VA + clinical features and XGBoost on narrative embeddings, reaching 66.7% top-1 and 82.6% top-3. Across every comparable head-to-head, gradient boosting beats both logistic regression and the zero-shot Gemini LLM. The remaining ceiling is driven by rare causes and by the fact that fewer than half the cases carry usable free-text narrative.

Takeaway 1 · Model family

XGBoost is the strongest model family in the current runs.

On coded features it reaches 66.2% top-1, versus 58.9% for the matched logistic-regression baseline. On Gemma narrative embeddings it reaches 57.8%, and it also stays ahead of the zero-shot Gemini baseline at 55.6%.

Takeaway 2 · Ensembling

The structured-plus-embedding ensemble gives the best case-level result.

The best current run reaches 66.7% top-1 and 82.6% top-3, compared with 66.2% top-1 for the standalone structured XGBoost. The gain is modest, but it is still the strongest case-level result in the current pipeline.

Takeaway 3 · Rare causes

Rare causes remain the main bottleneck.

The top two causes still account for 54.8% of the cohort. Common perinatal causes drive much of the accuracy, while rarer infectious and mixed categories remain substantially harder.

Takeaway 4 · Limits of text

Narrative helps, but it does not replace coded VA.

Text-only embedding models remain below the structured XGBoost baseline: 57.8% for Gemma and 55.6% for Qwen. Coverage is the second limit: only 44.6% of cases have usable narrative.

Model family	Headline metric	Why it matters
Ensemble (Structured + Narrative, XGBoost)	66.7% top-1 / 82.6% top-3	Best case-level result. Stacks structured XGBoost with XGBoost on Gemma narrative embeddings.
Structured · XGBoost	66.2% top-1	Best single-family model. Coded VA + clinical/lab features, no narrative.
Structured · Logistic regression	58.9% top-1	Same features, linear baseline. Always lower than XGBoost.
Narrative embedding (Gemma) · XGBoost	57.8% top-1	Best text-only model. Plateaus well below structured XGBoost.
Zero-shot LLM (Gemini)	55.6% top-1	Readable and flexible, but the weakest serious approach in the comparison.
Age-stratified · Logistic regression (6 strata)	92.8% CSMF	Best population-level CSMF. Note its case-level top-1 (56.3%) is much lower than the ensemble.

Model family

Headline metric

Why it matters

Ensemble (Structured + Narrative, XGBoost)

66.7% top-1 / 82.6% top-3

Best case-level result. Stacks structured XGBoost with XGBoost on Gemma narrative embeddings.

Structured · XGBoost

66.2% top-1

Best single-family model. Coded VA + clinical/lab features, no narrative.

Structured · Logistic regression

58.9% top-1

Same features, linear baseline. Always lower than XGBoost.

Narrative embedding (Gemma) · XGBoost

57.8% top-1

Best text-only model. Plateaus well below structured XGBoost.

Zero-shot LLM (Gemini)

55.6% top-1

Readable and flexible, but the weakest serious approach in the comparison.

Age-stratified · Logistic regression (6 strata)

92.8% CSMF

Best population-level CSMF. Note its case-level top-1 (56.3%) is much lower than the ensemble.

Where Results Are Stronger And Weaker

Per-cause accuracy is the binding constraint. Common perinatal categories already classify at 85%+, and the ensemble improves several rare-cause buckets, but absolute accuracy on long-tail infectious and mixed categories is still where the headroom lives.

Model	N	Top-1	Macro-F1	Interpretation
Ensemble (Structured logistic regression + Gemini)	9,149	61.3%	0.118	Earlier matched-cohort ensemble. Superseded by the XGBoost-based ensemble in the executive view.
Structured · Logistic regression	9,149	56.7%	0.111	Linear structured baseline on the matched cohort.
Zero-shot LLM (Gemini)	9,149	55.6%	0.074	Gemini reading serialized patient records, no training.
Narrative embedding (Qwen) · Logistic regression	9,149	51.0%	0.032	Narrative-only embedding baseline.

Variant	N	Top-1	Top-3	Top-5	Macro-F1	Note
Structured · Logistic regression (full features)	9,542	58.9%	79.4%	86.4%	0.088	Strongest flat logistic regression baseline.
Structured · Logistic regression (VA + demographics)	9,153	56.8%	77.1%	—	0.075	Simpler logistic regression baseline.
Structured · Logistic regression (+ MITS / lab)	9,153	56.4%	78.7%	—	0.097	Adds MITS / lab-style signals to the linear baseline.
Narrative embedding (Gemma) · Logistic regression	9,153	54.8%	75.8%	84.2%	0.040	Best linear narrative-only baseline.
Narrative embedding (Qwen) · Logistic regression	9,153	51.0%	73.2%	—	0.032	Smaller narrative-only model.
Ensemble (confidence-margin rule)	9,140	59.6%	—	—	—	Earlier rule-based ensemble; +2.9 pp vs the logistic regression baseline.
Age-stratified ensemble	9,542	62.2%	—	—	0.107	Best top-1 before the structured + narrative XGBoost ensemble was added.

Cause	N	Structured · Logistic regression	Zero-shot LLM (Gemini)	Narrative embedding (Qwen) · Logistic regression	Ensemble (Logistic regression + Gemini)
perinatal asphyxia/hypoxia	3,731	88.9%	84.7%	90.6%	88.8%
neonatal preterm birth complications	1,284	65.2%	92.1%	65.7%	87.9%
congenital birth defects	721	35.6%	24.5%	0.3%	30.7%
malnutrition	478	48.1%	23.4%	73.4%	60.0%
neonatal sepsis	437	14.9%	20.8%	0.0%	36.8%
undetermined	375	4.8%	5.3%	0.0%	5.3%
malaria	331	68.0%	46.2%	26.6%	62.5%
lower respiratory infections	294	16.7%	18.4%	0.0%	24.8%
congenital infection	203	1.0%	0.0%	0.0%	4.9%
diarrheal diseases	165	31.5%	27.3%	0.0%	35.2%
HIV	162	25.3%	19.1%	0.0%	19.8%
other neonatal disorders	147	2.0%	3.4%	0.0%	1.4%

Cause

Structured · Logistic regression

Zero-shot LLM (Gemini)

Narrative embedding (Qwen) · Logistic regression

Ensemble (Logistic regression + Gemini)

perinatal asphyxia/hypoxia

3,731

88.9%

84.7%

90.6%

88.8%

neonatal preterm birth complications

1,284

65.2%

92.1%

65.7%

87.9%

congenital birth defects

721

35.6%

24.5%

0.3%

30.7%

malnutrition

478

48.1%

23.4%

73.4%

60.0%

neonatal sepsis

437

14.9%

20.8%

0.0%

36.8%

undetermined

375

4.8%

5.3%

0.0%

5.3%

malaria

331

68.0%

46.2%

26.6%

62.5%

lower respiratory infections

294

16.7%

18.4%

0.0%

24.8%

congenital infection

203

1.0%

0.0%

4.9%

diarrheal diseases

165

31.5%

27.3%

0.0%

35.2%

HIV

162

25.3%

19.1%

0.0%

19.8%

other neonatal disorders

147

2.0%

3.4%

0.0%

1.4%

Cause	N	Structured · Logistic regression	Zero-shot LLM (Gemini)	Narrative embedding (Qwen) · Logistic regression	Ensemble (Logistic regression + Gemini)
undetermined	375	4.8%	5.3%	0.0%	5.3%
lower respiratory infections	294	16.7%	18.4%	0.0%	24.8%
congenital infection	203	1.0%	0.0%	0.0%	4.9%
other neonatal disorders	147	2.0%	3.4%	0.0%	1.4%
sepsis	134	3.0%	4.5%	0.0%	9.0%
neonatal aspiration syndromes	92	3.3%	3.3%	0.0%	3.3%
other infections	69	2.9%	0.0%	0.0%	0.0%

Cause

Structured · Logistic regression

Zero-shot LLM (Gemini)

Narrative embedding (Qwen) · Logistic regression

Ensemble (Logistic regression + Gemini)

undetermined

375

4.8%

5.3%

0.0%

5.3%

lower respiratory infections

294

16.7%

18.4%

0.0%

24.8%

congenital infection

203

1.0%

0.0%

4.9%

other neonatal disorders

147

2.0%

3.4%

0.0%

1.4%

sepsis

134

3.0%

4.5%

0.0%

9.0%

neonatal aspiration syndromes

3.3%

0.0%

3.3%

other infections

2.9%

0.0%

Illustrative Synthetic Interview Examples

Side artifact, not part of the headline result. We can rewrite a structured case into a clean INTERVIEWER / CAREGIVER transcript that covers symptoms, timing, care-seeking, and charted findings without naming the diagnosis. Useful as a communication aid; the head-to-head augmentation check has not yet produced a performance gain.

Cause-of-death modeling progress update

Key Takeaways

XGBoost is the strongest model family in the current runs.

The structured-plus-embedding ensemble gives the best case-level result.

Rare causes remain the main bottleneck.

Narrative helps, but it does not replace coded VA.

Results At A Glance

Cross-Model Metric Snapshot

Additional Model Results

Cohort Overview

Top causes dominate the task

Age groups are highly imbalanced

Geography is broad but not uniform

Matched-cohort baseline ranking

Where Results Are Stronger And Weaker

Performance On Common Causes

Persistent Challenge Areas

Review One Challenge Area

Illustrative VA Narrative Prediction Examples

Illustrative Synthetic Interview Examples

Selected interview example