CODA JHU Team
JHU team
Yiqun T. Chen Abhirup (Abhi) Datta Li Liu Last updated April 6, 2026

Cause-of-death modeling progress update

Across current random 5-fold runs, best observed performance reaches 62.2% top-1, 80.6% top-3, and 92.8% CSMF accuracy, with interpretation still shaped by strong age-group and cause-class imbalance.

Evaluation setting
Random 5-fold CV
Used throughout the headline results to keep comparisons fair.
Comparison cohort
9,149 matched cases
Direct head-to-head across structured, LLM, combined, and embedding approaches.
Full cohort
9,153 cases
Current processed dataset across 9 site codes.
Best current top-1
62.2%
Age-aware ensemble on the random split.
Narrative text available
44.6%
Only part of the cohort has usable free-text narrative data.
Facility deaths
87.4%
Most records include facility-based structured clinical information.

Key Takeaways

Most important points.

Takeaway 1

The clearest current result is the combined ensemble.

On the matched random 5-fold comparison cohort, the Combined model (structured log-reg + Gemini) reaches 61.3% top-1 accuracy, beating the Structured model (regularized log-reg) (56.7%) by +4.5 pp and the LLM model (Gemini) (55.6%) by +5.6 pp.

Takeaway 2

Age-aware routing is the strongest next-step improvement.

On the full-feature random split, the best age-aware ensemble reaches 62.2% top-1, versus 58.9% for the flat structured model (regularized log-reg) (+3.3 pp).

Takeaway 3

Embedding models help, but they are not yet the lead result.

Embedding model (Gemma + log-reg) reaches 54.8% top-1, and Embedding model (Qwen + log-reg) reaches 51.0%. Both remain behind the strongest structured model (regularized log-reg) results.

Takeaway 4

Rarer causes remain the main challenge.

The two biggest causes account for 54.8% of the cohort. Results are much stronger on common perinatal causes than on rarer infectious or mixed categories.

Takeaway 5

Synthetic interviews are useful for illustration, not yet for a headline performance gain.

Transcript generation validated cleanly on 200/200 examples, but the head-to-head augmentation check is still near baseline: synthetic top-1 48.5% vs baseline 49.5% (-1.0 pp).

Results At A Glance

The executive view below is meant for direct sharing. Here, the structured baseline is regularized logistic regression on coded features, the combined model ensembles that baseline with Gemini, and the embedding variants pair narrative embeddings with logistic regression.

ApproachTop-1Why It Matters
Combined model (structured log-reg + Gemini)61.3%Best current directly matched result and the clearest external headline.
Structured model (regularized log-reg)56.7%Strong baseline using coded VA, clinical, and lab-style features.
LLM model (Gemini)55.6%Readable and flexible, but not as strong as the combined approach.
Age-aware ensemble62.2%Best experimental extension in the current repo.
Embedding model (Gemma + log-reg)54.8%Narrative-only model with stronger representations than the smaller embedding run.
Best directly matched result: 61.3% top-1 for the Combined model (structured log-reg + Gemini).
Recommended framing: lead externally with the matched Combined model (structured log-reg + Gemini) result (61.3% top-1) and keep the age-aware result (62.2%) as the strongest forward-looking extension.

Cohort Overview

The dataset is broad, but the label distribution is still dominated by a small number of perinatal causes.

Top causes dominate the task

The top two causes already make up 54.8% of the cohort.

Perinatal asphyxia/hypoxia
40.8%
3,733
Neonatal preterm birth complications
14.0%
1,284
Congenital birth defects
7.9%
721
Malnutrition
5.2%
479
Neonatal sepsis
4.8%
437
Undetermined
4.1%
375
Malaria
3.6%
332
Lower respiratory infections
3.2%
294
Congenital infection
2.2%
203
Diarrheal Diseases
1.8%
165

Age groups are highly imbalanced

Stillbirth and neonatal deaths make up most of the current processed dataset.

Stillbirth
38.0%
3,481
Death in the first 24 hours
14.6%
1,338
Early Neonate (1 to 6 days)
16.5%
1,507
Late Neonate (7 to 27 days)
6.9%
634
Infant (28 days to less than 12 months)
12.7%
1,158
Child (12 months to less than 60 Months)
11.3%
1,035

Geography is broad but not uniform

The core footprint is spread across multiple African sites, with a small Pakistan tail.

MZ
17.1%
1,562
ET
16.4%
1,498
ZA
15.7%
1,440
KE
15.3%
1,400
SL
13.4%
1,224
BD
12.9%
1,178
ML
8.0%
733
PK
1.2%
107
NG
0.1%
11

Main model ranking

Top-1 accuracy on the directly matched comparison cohort.

Combined model (structured log-reg + Gemini)
Best matched result on the direct head-to-head cohort.
61.3%
Structured model (regularized log-reg)
Best structured log-reg baseline in the matched analysis.
56.7%
LLM model (Gemini)
LLM review of serialized patient records.
55.6%
Embedding model (Qwen + log-reg)
Narrative embedding model with logistic regression.
51.0%

Where Results Are Stronger And Weaker

The combined approach adds value, but the improvements are not uniform. Common neonatal and perinatal patterns are much easier than rarer infectious or mixed categories.

Combined ensemble accuracy by age group

Age groups where the structured log-reg + Gemini ensemble is strongest.

Stillbirth
78.0%
Death in the first 24 hours
60.3%
Early Neonate (1 to 6 days)
56.9%
Late Neonate (7 to 27 days)
49.1%
Child (12 months to less than 60 Months)
43.4%
Infant (28 days to less than 12 months)
27.1%
Stillbirth is easiest; infant and later-child categories remain much harder.

How the leak-safe ensemble improved

A stepwise view from the leak-safe ablation. Accuracy plateaus around calibration and weight tuning, with the embedding add-on contributing only a marginal final lift.

Structured model (regularized log-reg)
macro-F1 0.129
58.1%
Structured model (log-reg) + weight tuning
macro-F1 0.132
61.1%
Structured model (log-reg) + calibration
macro-F1 0.136
61.5%
Structured model (log-reg) + balance tuning
macro-F1 0.136
61.5%
Combined ensemble + embedding signal
macro-F1 0.134
61.6%

Recovered cases over structured log-reg, by age group

Counts of cases fixed by the combined ensemble compared with the structured log-reg baseline.

Early Neonate (1 to 6 days)
17.0% of group
256
Death in the first 24 hours
14.0% of group
187
Infant (28 days to less than 12 months)
15.5% of group
179
Child (12 months to less than 60 Months)
13.5% of group
140
Late Neonate (7 to 27 days)
19.9% of group
126
Stillbirth
3.2% of group
111

Recovered cases over structured log-reg, by site

The gain is geographically broad rather than concentrated in a single site.

ZA
12.9% of site
186
KE
12.6% of site
177
MZ
10.9% of site
171
ET
10.5% of site
157
SL
10.9% of site
133
BD
7.4% of site
87
ML
9.1% of site
67
PK
19.6% of site
21

Illustrative Synthetic Interview Examples

These examples are most useful as communication aids. Each transcript is generated by asking the model to rewrite the full structured case into an INTERVIEWER / CAREGIVER exchange that covers symptoms, timing, care-seeking, environmental context, and charted clinical findings without naming the diagnosis.

Narrative validation
200/200
Short synthetic narratives passed validation.
Transcript validation
200/200
Long interview transcripts also validated cleanly.
Median transcript length
1,353 words
Transcript mode is long-form and presentation-ready.
Head-to-head agreement
65.7%
Real vs synthetic predictions agree fairly often.
Current conclusion: on the 99-case synthetic head-to-head subset, real narratives scored 47.5% top-1, synthetic narratives scored 48.5%, and the baseline condition scored 49.5%. This is promising enough to keep exploring, but not yet strong enough to headline as a performance win.