Cause-of-death modeling progress update (CODA JHU Team)

Cause-of-death modeling progress update

Across current random 5-fold runs, best observed performance reaches 62.2% top-1, 80.6% top-3, and 92.8% CSMF accuracy, with interpretation still shaped by strong age-group and cause-class imbalance.

Takeaway 1

The clearest current result is the combined ensemble.

On the matched random 5-fold comparison cohort, the Combined model (structured log-reg + Gemini) reaches 61.3% top-1 accuracy, beating the Structured model (regularized log-reg) (56.7%) by +4.5 pp and the LLM model (Gemini) (55.6%) by +5.6 pp.

Takeaway 2

Age-aware routing is the strongest next-step improvement.

On the full-feature random split, the best age-aware ensemble reaches 62.2% top-1, versus 58.9% for the flat structured model (regularized log-reg) (+3.3 pp).

Takeaway 3

Embedding models help, but they are not yet the lead result.

Embedding model (Gemma + log-reg) reaches 54.8% top-1, and Embedding model (Qwen + log-reg) reaches 51.0%. Both remain behind the strongest structured model (regularized log-reg) results.

Takeaway 4

Rarer causes remain the main challenge.

The two biggest causes account for 54.8% of the cohort. Results are much stronger on common perinatal causes than on rarer infectious or mixed categories.

Takeaway 5

Synthetic interviews are useful for illustration, not yet for a headline performance gain.

Transcript generation validated cleanly on 200/200 examples, but the head-to-head augmentation check is still near baseline: synthetic top-1 48.5% vs baseline 49.5% (-1.0 pp).

Results At A Glance

The executive view below is meant for direct sharing. Here, the structured baseline is regularized logistic regression on coded features, the combined model ensembles that baseline with Gemini, and the embedding variants pair narrative embeddings with logistic regression.

Approach	Top-1	Why It Matters
Combined model (structured log-reg + Gemini)	61.3%	Best current directly matched result and the clearest external headline.
Structured model (regularized log-reg)	56.7%	Strong baseline using coded VA, clinical, and lab-style features.
LLM model (Gemini)	55.6%	Readable and flexible, but not as strong as the combined approach.
Age-aware ensemble	62.2%	Best experimental extension in the current repo.
Embedding model (Gemma + log-reg)	54.8%	Narrative-only model with stronger representations than the smaller embedding run.

Approach

Top-1

Why It Matters

Combined model (structured log-reg + Gemini)

61.3%

Best current directly matched result and the clearest external headline.

Structured model (regularized log-reg)

56.7%

Strong baseline using coded VA, clinical, and lab-style features.

LLM model (Gemini)

55.6%

Readable and flexible, but not as strong as the combined approach.

Age-aware ensemble

62.2%

Best experimental extension in the current repo.

Embedding model (Gemma + log-reg)

54.8%

Narrative-only model with stronger representations than the smaller embedding run.

Illustrative Synthetic Interview Examples

Model	N	Top-1	Macro-F1	Interpretation
Combined model (structured log-reg + Gemini)	9,149	61.3%	0.118	Best matched result on the direct head-to-head cohort.
Structured model (regularized log-reg)	9,149	56.7%	0.111	Best structured log-reg baseline in the matched analysis.
LLM model (Gemini)	9,149	55.6%	0.074	LLM review of serialized patient records.
Embedding model (Qwen + log-reg)	9,149	51.0%	0.032	Narrative embedding model with logistic regression.

Variant	N	Top-1	Top-3	Top-5	Macro-F1	Note
Structured model (full features, regularized log-reg)	9,542	58.9%	79.4%	86.4%	0.088	Best flat structured log-reg baseline.
Structured model (VA + demographics, regularized log-reg)	9,153	56.8%	77.1%	—	0.075	Simpler structured log-reg baseline.
Structured model (+ MITS/lab, regularized log-reg)	9,153	56.4%	78.7%	—	0.097	Adds MITS and lab-style signals.
Embedding model (Gemma + log-reg)	9,153	54.8%	75.8%	84.2%	0.040	Best standalone embedding artifact currently in the repo.
Embedding model (Qwen + log-reg)	9,153	51.0%	73.2%	—	0.032	Smaller narrative-only model; useful but not the lead result.
Combined model (confidence-margin ensemble)	9,140	59.6%	—	—	—	Combined decision rule; +2.9 pp vs the structured log-reg baseline.
Age-aware ensemble	9,542	62.2%	—	—	0.107	Best overall top-1 in the current random-split artifacts.

Cause	N	Structured model (regularized log-reg)	LLM model (Gemini)	Combined model (structured log-reg + Gemini)	Embedding model (Qwen + log-reg)
perinatal asphyxia/hypoxia	3,731	88.9%	84.7%	88.8%	90.6%
neonatal preterm birth complications	1,284	65.2%	92.1%	87.9%	65.7%
congenital birth defects	721	35.6%	24.5%	30.7%	0.3%
malnutrition	478	48.1%	23.4%	60.0%	73.4%
neonatal sepsis	437	14.9%	20.8%	36.8%	0.0%
undetermined	375	4.8%	5.3%	5.3%	0.0%
malaria	331	68.0%	46.2%	62.5%	26.6%
lower respiratory infections	294	16.7%	18.4%	24.8%	0.0%
congenital infection	203	1.0%	0.0%	4.9%	0.0%
diarrheal diseases	165	31.5%	27.3%	35.2%	0.0%
HIV	162	25.3%	19.1%	19.8%	0.0%
other neonatal disorders	147	2.0%	3.4%	1.4%	0.0%

Cause	N	Structured model (regularized log-reg)	LLM model (Gemini)	Combined model (structured log-reg + Gemini)	Embedding model (Qwen + log-reg)
undetermined	375	4.8%	5.3%	5.3%	0.0%
lower respiratory infections	294	16.7%	18.4%	24.8%	0.0%
congenital infection	203	1.0%	0.0%	4.9%	0.0%
other neonatal disorders	147	2.0%	3.4%	1.4%	0.0%
sepsis	134	3.0%	4.5%	9.0%	0.0%
neonatal aspiration syndromes	92	3.3%	3.3%	3.3%	0.0%
other infections	69	2.9%	0.0%	0.0%	0.0%

These examples are most useful as communication aids. Each transcript is generated by asking the model to rewrite the full structured case into an INTERVIEWER / CAREGIVER exchange that covers symptoms, timing, care-seeking, environmental context, and charted clinical findings without naming the diagnosis.

Cause-of-death modeling progress update

Key Takeaways

The clearest current result is the combined ensemble.

Age-aware routing is the strongest next-step improvement.

Embedding models help, but they are not yet the lead result.

Rarer causes remain the main challenge.

Synthetic interviews are useful for illustration, not yet for a headline performance gain.

Results At A Glance

Additional Model Results

Cohort Overview

Top causes dominate the task

Age groups are highly imbalanced

Geography is broad but not uniform

Main model ranking

Where Results Are Stronger And Weaker

Combined ensemble accuracy by age group

How the leak-safe ensemble improved

Recovered cases over structured log-reg, by age group

Recovered cases over structured log-reg, by site

Performance On Common Causes

Persistent Challenge Areas

Review One Challenge Area

Illustrative VA Narrative Prediction Examples

Illustrative Synthetic Interview Examples

Selected interview example