Headline: the strongest case-level model on CHAMPS is an
ensemble of XGBoost on coded VA + clinical features and XGBoost on narrative embeddings,
reaching 66.7% top-1 and 82.6% top-3. Across every comparable head-to-head,
gradient boosting beats both logistic regression and the zero-shot Gemini LLM. The remaining ceiling is
driven by rare causes and by the fact that fewer than half the cases carry usable free-text narrative.
Used throughout the headline results to keep comparisons fair.
Comparison cohort
9,149 matched cases
Direct head-to-head across structured, LLM, combined, and embedding approaches.
Full cohort
9,153 cases
Current processed dataset across 9 site codes.
Best top-1 (case-level)
66.7%
Ensemble of XGBoost (structured) + XGBoost (narrative embedding).
Narrative text available
44.6%
Less than half of cases carry usable free-text narrative.
Best CSMF (population-level)
92.8%
Age-stratified logistic regression (6 age strata).
Key Takeaways
Four short points with the main numbers: best model family, best ensemble result, main challenge area, and the role of narrative text.
Takeaway 1 · Model family
XGBoost is the strongest model family in the current runs.
On coded features it reaches 66.2% top-1, versus 58.9% for the
matched logistic-regression baseline. On Gemma narrative embeddings it reaches 57.8%,
and it also stays ahead of the zero-shot Gemini baseline at 55.6%.
Takeaway 2 · Ensembling
The structured-plus-embedding ensemble gives the best case-level result.
The best current run reaches 66.7% top-1 and 82.6% top-3,
compared with 66.2% top-1 for the standalone structured XGBoost. The gain is modest,
but it is still the strongest case-level result in the current pipeline.
Takeaway 3 · Rare causes
Rare causes remain the main bottleneck.
The top two causes still account for 54.8% of the cohort. Common perinatal causes drive
much of the accuracy, while rarer infectious and mixed categories remain substantially harder.
Takeaway 4 · Limits of text
Narrative helps, but it does not replace coded VA.
Text-only embedding models remain below the structured XGBoost baseline: 57.8% for Gemma
and 55.6% for Qwen. Coverage is the second limit: only 44.6% of cases
have usable narrative.
Results At A Glance
One row per model family. All numbers are 5-fold random CV on the same processed CHAMPS cohort, so the
comparisons are apples-to-apples. The detailed view exposes the additional supplementary runs.
Model glossary (used consistently throughout this page).
Structured · XGBoost — gradient-boosted trees on coded VA + clinical / lab features.
Structured · Logistic regression — regularized linear baseline on the same coded features.
Narrative embedding · Gemma / Qwen — classifier (XGBoost or logistic regression) on top of large-LM embeddings of the free-text VA narrative.
Zero-shot LLM (Gemini) — Gemini reading the serialized patient record and emitting a cause label, no training.
Age-stratified — same structured features, separate model per age stratum (3 or 6 strata).
Ensemble (Structured + Narrative) — the headline model: stacks structured XGBoost with narrative-embedding XGBoost.
Cross-Model Metric Snapshot
Default view shows the best representative model per family (one card per family). Use the selector to drill
into the structured, narrative-embedding, age-stratified, or ensemble families, or to see every compared run
on one screen.
Best case-level result. Stacks structured XGBoost with XGBoost on Gemma narrative embeddings.
Structured · XGBoost
66.2% top-1
Best single-family model. Coded VA + clinical/lab features, no narrative.
Structured · Logistic regression
58.9% top-1
Same features, linear baseline. Always lower than XGBoost.
Narrative embedding (Gemma) · XGBoost
57.8% top-1
Best text-only model. Plateaus well below structured XGBoost.
Zero-shot LLM (Gemini)
55.6% top-1
Readable and flexible, but the weakest serious approach in the comparison.
Age-stratified · Logistic regression (6 strata)
92.8% CSMF
Best population-level CSMF. Note its case-level top-1 (56.3%) is much lower than the ensemble.
The case-level winner is Ensemble (Structured + Narrative, XGBoost) at
66.7% top-1 / 82.6% top-3. The population-level winner is a different model
family (age-stratified logistic regression) optimized for cause-specific mortality fractions, not individual
case prediction.
Adds MITS / lab-style signals to the linear baseline.
Narrative embedding (Gemma) · Logistic regression
9,153
54.8%
75.8%
84.2%
0.040
Best linear narrative-only baseline.
Narrative embedding (Qwen) · Logistic regression
9,153
51.0%
73.2%
—
0.032
Smaller narrative-only model.
Ensemble (confidence-margin rule)
9,140
59.6%
—
—
—
Earlier rule-based ensemble; +2.9 pp vs the logistic regression baseline.
Age-stratified ensemble
9,542
62.2%
—
—
0.107
Best top-1 before the structured + narrative XGBoost ensemble was added.
Cohort Overview
The dataset is broad, but the label distribution is still dominated by a small number of perinatal causes.
Top causes dominate the task
The top two causes already make up 54.8% of the cohort.
Perinatal asphyxia/hypoxia
40.8%
3,733
Neonatal preterm birth complications
14.0%
1,284
Congenital birth defects
7.9%
721
Malnutrition
5.2%
479
Neonatal sepsis
4.8%
437
Undetermined
4.1%
375
Malaria
3.6%
332
Lower respiratory infections
3.2%
294
Congenital infection
2.2%
203
Diarrheal Diseases
1.8%
165
Age groups are highly imbalanced
Stillbirth and neonatal deaths make up most of the current processed dataset.
Stillbirth
38.0%
3,481
Death in the first 24 hours
14.6%
1,338
Early Neonate (1 to 6 days)
16.5%
1,507
Late Neonate (7 to 27 days)
6.9%
634
Infant (28 days to less than 12 months)
12.7%
1,158
Child (12 months to less than 60 Months)
11.3%
1,035
Geography is broad but not uniform
The core footprint is spread across multiple African sites, with a small Pakistan tail.
MZ
17.1%
1,562
ET
16.4%
1,498
ZA
15.7%
1,440
KE
15.3%
1,400
SL
13.4%
1,224
BD
12.9%
1,178
ML
8.0%
733
PK
1.2%
107
NG
0.1%
11
Matched-cohort baseline ranking
Top-1 on the direct head-to-head cohort (older comparison set, all linear/zero-shot baselines).
The headline XGBoost ensemble in the leaderboard above sits well above all of these.
Ensemble (Logistic regression + Gemini)
Best linear-baseline ensemble in the matched analysis.
61.3%
Structured · Logistic regression
Linear structured baseline.
56.7%
Zero-shot LLM (Gemini)
Gemini reading the serialized patient record, no training.
55.6%
Narrative embedding (Qwen) · Logistic regression
Narrative-only baseline.
51.0%
Where Results Are Stronger And Weaker
Per-cause accuracy is the binding constraint. Common perinatal categories already classify at 85%+, and the
ensemble improves several rare-cause buckets, but absolute accuracy on long-tail infectious and mixed
categories is still where the headroom lives.
Performance On Common Causes
Darker cells are better within each row.
Cause
N
Structured · Logistic regression
Zero-shot LLM (Gemini)
Narrative embedding (Qwen) · Logistic regression
Ensemble (Logistic regression + Gemini)
perinatal asphyxia/hypoxia
3,731
88.9%
84.7%
90.6%
88.8%
neonatal preterm birth complications
1,284
65.2%
92.1%
65.7%
87.9%
congenital birth defects
721
35.6%
24.5%
0.3%
30.7%
malnutrition
478
48.1%
23.4%
73.4%
60.0%
neonatal sepsis
437
14.9%
20.8%
0.0%
36.8%
undetermined
375
4.8%
5.3%
0.0%
5.3%
malaria
331
68.0%
46.2%
26.6%
62.5%
lower respiratory infections
294
16.7%
18.4%
0.0%
24.8%
congenital infection
203
1.0%
0.0%
0.0%
4.9%
diarrheal diseases
165
31.5%
27.3%
0.0%
35.2%
HIV
162
25.3%
19.1%
0.0%
19.8%
other neonatal disorders
147
2.0%
3.4%
0.0%
1.4%
Persistent Challenge Areas
These are the causes that remain difficult across model families.
Cause
N
Structured · Logistic regression
Zero-shot LLM (Gemini)
Narrative embedding (Qwen) · Logistic regression
Ensemble (Logistic regression + Gemini)
undetermined
375
4.8%
5.3%
0.0%
5.3%
lower respiratory infections
294
16.7%
18.4%
0.0%
24.8%
congenital infection
203
1.0%
0.0%
0.0%
4.9%
other neonatal disorders
147
2.0%
3.4%
0.0%
1.4%
sepsis
134
3.0%
4.5%
0.0%
9.0%
neonatal aspiration syndromes
92
3.3%
3.3%
0.0%
3.3%
other infections
69
2.9%
0.0%
0.0%
0.0%
Review One Challenge Area
Select a category below to show a quick summary of where the main approaches still struggle.
Selected challenge area
lower respiratory infections
294 records
Common clinical overlap with sepsis and other infections still makes this category hard.
These de-identified VA narratives are shown in full because the ambiguity often sits in the complete sequence of symptoms,
pregnancy history, delivery course, and care-seeking details rather than in a short excerpt.
MisclassifiedTrue neonatal preterm birth complicationsPredicted perinatal asphyxia/hypoxiaSite SLAge Early Neonate (1 to 6 days)
Preterm and perinatal asphyxia often share the same narrative signals around gestational age, delivery complications, and a very short neonatal course.
MisclassifiedTrue neonatal preterm birth complicationsPredicted perinatal asphyxia/hypoxiaSite SLAge Death in the first 24 hours
Preterm and perinatal asphyxia often share the same narrative signals around gestational age, delivery complications, and a very short neonatal course.
MisclassifiedTrue neonatal preterm birth complicationsPredicted perinatal asphyxia/hypoxiaSite SLAge Late Neonate (7 to 27 days)
Preterm and perinatal asphyxia often share the same narrative signals around gestational age, delivery complications, and a very short neonatal course.
MisclassifiedTrue neonatal preterm birth complicationsPredicted perinatal asphyxia/hypoxiaSite SLAge Death in the first 24 hours
Preterm and perinatal asphyxia often share the same narrative signals around gestational age, delivery complications, and a very short neonatal course.
MisclassifiedTrue perinatal asphyxia/hypoxiaPredicted neonatal preterm birth complicationsSite SLAge Early Neonate (1 to 6 days)
Even detailed pregnancy and delivery narratives can still pull the model toward prematurity when the neonatal decline is brief and unstable.
MisclassifiedTrue perinatal asphyxia/hypoxiaPredicted neonatal preterm birth complicationsSite SLAge Early Neonate (1 to 6 days)
Even detailed pregnancy and delivery narratives can still pull the model toward prematurity when the neonatal decline is brief and unstable.
MisclassifiedTrue perinatal asphyxia/hypoxiaPredicted neonatal preterm birth complicationsSite ETAge Early Neonate (1 to 6 days)
Even detailed pregnancy and delivery narratives can still pull the model toward prematurity when the neonatal decline is brief and unstable.
MisclassifiedTrue perinatal asphyxia/hypoxiaPredicted neonatal preterm birth complicationsSite SLAge Late Neonate (7 to 27 days)
Even detailed pregnancy and delivery narratives can still pull the model toward prematurity when the neonatal decline is brief and unstable.
Correctly classifiedTrue lower respiratory infectionsPredicted lower respiratory infectionsSite SLAge Infant (28 days to less than 12 months)
This longer narrative is a useful counterexample: the model sometimes does stay on the right cause when the symptom pattern is specific enough.
Correctly classifiedTrue malariaPredicted malariaSite SLAge Child (12 months to less than 60 Months)
This longer narrative is a useful counterexample: the model sometimes does stay on the right cause when the symptom pattern is specific enough.
Correctly classifiedTrue congenital birth defectsPredicted congenital birth defectsSite SLAge Death in the first 24 hours
This longer narrative is a useful counterexample: the model sometimes does stay on the right cause when the symptom pattern is specific enough.
Correctly classifiedTrue malnutritionPredicted malnutritionSite SLAge Child (12 months to less than 60 Months)
This longer narrative is a useful counterexample: the model sometimes does stay on the right cause when the symptom pattern is specific enough.
Correctly classifiedTrue diarrheal diseasesPredicted diarrheal diseasesSite SLAge Infant (28 days to less than 12 months)
This longer narrative is a useful counterexample: the model sometimes does stay on the right cause when the symptom pattern is specific enough.
Illustrative Synthetic Interview Examples
Side artifact, not part of the headline result. We can rewrite a structured case into a clean
INTERVIEWER / CAREGIVER transcript that covers symptoms, timing, care-seeking, and
charted findings without naming the diagnosis. Useful as a communication aid; the head-to-head augmentation
check has not yet produced a performance gain.
Narrative validation
200/200
Short synthetic narratives passed validation.
Transcript validation
200/200
Long interview transcripts also validated cleanly.
Median transcript length
1,353 words
Transcript mode is long-form and presentation-ready.
Head-to-head agreement
65.7%
Real vs synthetic predictions agree fairly often.
Current conclusion: on the 99-case synthetic head-to-head subset,
real narratives scored 47.5% top-1,
synthetic narratives scored 48.5%,
and the baseline condition scored 49.5%.
This is promising enough to keep exploring, but not yet strong enough to headline as a performance win.
Selected interview example
The full transcript is shown below in a scrollable box rather than excerpted, so the construction and level of detail are visible.