Across current random 5-fold runs, best observed performance reaches 62.2% top-1,
80.6% top-3, and 92.8% CSMF accuracy, with
interpretation still shaped by strong age-group and cause-class imbalance.
Used throughout the headline results to keep comparisons fair.
Comparison cohort
9,149 matched cases
Direct head-to-head across structured, LLM, combined, and embedding approaches.
Full cohort
9,153 cases
Current processed dataset across 9 site codes.
Best current top-1
62.2%
Age-aware ensemble on the random split.
Narrative text available
44.6%
Only part of the cohort has usable free-text narrative data.
Facility deaths
87.4%
Most records include facility-based structured clinical information.
Key Takeaways
Most important points.
Takeaway 1
The clearest current result is the combined ensemble.
On the matched random 5-fold comparison cohort, the Combined model (structured log-reg + Gemini) reaches 61.3% top-1 accuracy, beating the Structured model (regularized log-reg) (56.7%) by +4.5 pp and the LLM model (Gemini) (55.6%) by +5.6 pp.
Takeaway 2
Age-aware routing is the strongest next-step improvement.
On the full-feature random split, the best age-aware ensemble reaches 62.2% top-1, versus 58.9% for the flat structured model (regularized log-reg) (+3.3 pp).
Takeaway 3
Embedding models help, but they are not yet the lead result.
Embedding model (Gemma + log-reg) reaches 54.8% top-1, and Embedding model (Qwen + log-reg) reaches 51.0%. Both remain behind the strongest structured model (regularized log-reg) results.
Takeaway 4
Rarer causes remain the main challenge.
The two biggest causes account for 54.8% of the cohort. Results are much stronger on common perinatal causes than on rarer infectious or mixed categories.
Takeaway 5
Synthetic interviews are useful for illustration, not yet for a headline performance gain.
Transcript generation validated cleanly on 200/200 examples, but the head-to-head augmentation check is still near baseline: synthetic top-1 48.5% vs baseline 49.5% (-1.0 pp).
Results At A Glance
The executive view below is meant for direct sharing. Here, the structured baseline is regularized logistic regression on coded features,
the combined model ensembles that baseline with Gemini, and the embedding variants pair narrative embeddings with logistic regression.
Approach
Top-1
Why It Matters
Combined model (structured log-reg + Gemini)
61.3%
Best current directly matched result and the clearest external headline.
Structured model (regularized log-reg)
56.7%
Strong baseline using coded VA, clinical, and lab-style features.
LLM model (Gemini)
55.6%
Readable and flexible, but not as strong as the combined approach.
Age-aware ensemble
62.2%
Best experimental extension in the current repo.
Embedding model (Gemma + log-reg)
54.8%
Narrative-only model with stronger representations than the smaller embedding run.
Best directly matched result: 61.3% top-1 for the Combined model (structured log-reg + Gemini).
Model
N
Top-1
Macro-F1
Interpretation
Combined model (structured log-reg + Gemini)
9,149
61.3%
0.118
Best matched result on the direct head-to-head cohort.
Structured model (regularized log-reg)
9,149
56.7%
0.111
Best structured log-reg baseline in the matched analysis.
LLM model (Gemini)
9,149
55.6%
0.074
LLM review of serialized patient records.
Embedding model (Qwen + log-reg)
9,149
51.0%
0.032
Narrative embedding model with logistic regression.
Additional Model Results
These rows stay within the same random-split regime but come from slightly different eligible cohorts.
Variant
N
Top-1
Top-3
Top-5
Macro-F1
Note
Structured model (full features, regularized log-reg)
9,542
58.9%
79.4%
86.4%
0.088
Best flat structured log-reg baseline.
Structured model (VA + demographics, regularized log-reg)
9,153
56.8%
77.1%
—
0.075
Simpler structured log-reg baseline.
Structured model (+ MITS/lab, regularized log-reg)
9,153
56.4%
78.7%
—
0.097
Adds MITS and lab-style signals.
Embedding model (Gemma + log-reg)
9,153
54.8%
75.8%
84.2%
0.040
Best standalone embedding artifact currently in the repo.
Embedding model (Qwen + log-reg)
9,153
51.0%
73.2%
—
0.032
Smaller narrative-only model; useful but not the lead result.
Combined model (confidence-margin ensemble)
9,140
59.6%
—
—
—
Combined decision rule; +2.9 pp vs the structured log-reg baseline.
Age-aware ensemble
9,542
62.2%
—
—
0.107
Best overall top-1 in the current random-split artifacts.
Recommended framing: lead externally with the matched Combined model (structured log-reg + Gemini) result
(61.3% top-1) and keep the age-aware result
(62.2%) as the strongest forward-looking extension.
Cohort Overview
The dataset is broad, but the label distribution is still dominated by a small number of perinatal causes.
Top causes dominate the task
The top two causes already make up 54.8% of the cohort.
Perinatal asphyxia/hypoxia
40.8%
3,733
Neonatal preterm birth complications
14.0%
1,284
Congenital birth defects
7.9%
721
Malnutrition
5.2%
479
Neonatal sepsis
4.8%
437
Undetermined
4.1%
375
Malaria
3.6%
332
Lower respiratory infections
3.2%
294
Congenital infection
2.2%
203
Diarrheal Diseases
1.8%
165
Age groups are highly imbalanced
Stillbirth and neonatal deaths make up most of the current processed dataset.
Stillbirth
38.0%
3,481
Death in the first 24 hours
14.6%
1,338
Early Neonate (1 to 6 days)
16.5%
1,507
Late Neonate (7 to 27 days)
6.9%
634
Infant (28 days to less than 12 months)
12.7%
1,158
Child (12 months to less than 60 Months)
11.3%
1,035
Geography is broad but not uniform
The core footprint is spread across multiple African sites, with a small Pakistan tail.
MZ
17.1%
1,562
ET
16.4%
1,498
ZA
15.7%
1,440
KE
15.3%
1,400
SL
13.4%
1,224
BD
12.9%
1,178
ML
8.0%
733
PK
1.2%
107
NG
0.1%
11
Main model ranking
Top-1 accuracy on the directly matched comparison cohort.
Combined model (structured log-reg + Gemini)
Best matched result on the direct head-to-head cohort.
61.3%
Structured model (regularized log-reg)
Best structured log-reg baseline in the matched analysis.
56.7%
LLM model (Gemini)
LLM review of serialized patient records.
55.6%
Embedding model (Qwen + log-reg)
Narrative embedding model with logistic regression.
51.0%
Where Results Are Stronger And Weaker
The combined approach adds value, but the improvements are not uniform. Common neonatal and perinatal patterns
are much easier than rarer infectious or mixed categories.
Combined ensemble accuracy by age group
Age groups where the structured log-reg + Gemini ensemble is strongest.
Stillbirth
78.0%
Death in the first 24 hours
60.3%
Early Neonate (1 to 6 days)
56.9%
Late Neonate (7 to 27 days)
49.1%
Child (12 months to less than 60 Months)
43.4%
Infant (28 days to less than 12 months)
27.1%
Stillbirth is easiest; infant and later-child categories remain much harder.
How the leak-safe ensemble improved
A stepwise view from the leak-safe ablation. Accuracy plateaus around calibration and weight tuning,
with the embedding add-on contributing only a marginal final lift.
Structured model (regularized log-reg)
macro-F1 0.129
58.1%
Structured model (log-reg) + weight tuning
macro-F1 0.132
61.1%
Structured model (log-reg) + calibration
macro-F1 0.136
61.5%
Structured model (log-reg) + balance tuning
macro-F1 0.136
61.5%
Combined ensemble + embedding signal
macro-F1 0.134
61.6%
Recovered cases over structured log-reg, by age group
Counts of cases fixed by the combined ensemble compared with the structured log-reg baseline.
Early Neonate (1 to 6 days)
17.0% of group
256
Death in the first 24 hours
14.0% of group
187
Infant (28 days to less than 12 months)
15.5% of group
179
Child (12 months to less than 60 Months)
13.5% of group
140
Late Neonate (7 to 27 days)
19.9% of group
126
Stillbirth
3.2% of group
111
Recovered cases over structured log-reg, by site
The gain is geographically broad rather than concentrated in a single site.
ZA
12.9% of site
186
KE
12.6% of site
177
MZ
10.9% of site
171
ET
10.5% of site
157
SL
10.9% of site
133
BD
7.4% of site
87
ML
9.1% of site
67
PK
19.6% of site
21
Performance On Common Causes
Darker cells are better within each row.
Cause
N
Structured model (regularized log-reg)
LLM model (Gemini)
Combined model (structured log-reg + Gemini)
Embedding model (Qwen + log-reg)
perinatal asphyxia/hypoxia
3,731
88.9%
84.7%
88.8%
90.6%
neonatal preterm birth complications
1,284
65.2%
92.1%
87.9%
65.7%
congenital birth defects
721
35.6%
24.5%
30.7%
0.3%
malnutrition
478
48.1%
23.4%
60.0%
73.4%
neonatal sepsis
437
14.9%
20.8%
36.8%
0.0%
undetermined
375
4.8%
5.3%
5.3%
0.0%
malaria
331
68.0%
46.2%
62.5%
26.6%
lower respiratory infections
294
16.7%
18.4%
24.8%
0.0%
congenital infection
203
1.0%
0.0%
4.9%
0.0%
diarrheal diseases
165
31.5%
27.3%
35.2%
0.0%
HIV
162
25.3%
19.1%
19.8%
0.0%
other neonatal disorders
147
2.0%
3.4%
1.4%
0.0%
Persistent Challenge Areas
These are the causes that remain difficult across model families.
Cause
N
Structured model (regularized log-reg)
LLM model (Gemini)
Combined model (structured log-reg + Gemini)
Embedding model (Qwen + log-reg)
undetermined
375
4.8%
5.3%
5.3%
0.0%
lower respiratory infections
294
16.7%
18.4%
24.8%
0.0%
congenital infection
203
1.0%
0.0%
4.9%
0.0%
other neonatal disorders
147
2.0%
3.4%
1.4%
0.0%
sepsis
134
3.0%
4.5%
9.0%
0.0%
neonatal aspiration syndromes
92
3.3%
3.3%
3.3%
0.0%
other infections
69
2.9%
0.0%
0.0%
0.0%
Review One Challenge Area
Select a category below to show a quick summary of where the main approaches still struggle.
Selected challenge area
lower respiratory infections
294 records
Common clinical overlap with sepsis and other infections still makes this category hard.
Structured model (regularized log-reg)16.7%
LLM model (Gemini)18.4%
Combined model (structured log-reg + Gemini)24.8%
Embedding model (Qwen + log-reg)0.0%
Selected challenge area
congenital infection
203 records
Rare and clinically mixed neonatal infection patterns are still difficult to separate cleanly.
Structured model (regularized log-reg)1.0%
LLM model (Gemini)0.0%
Combined model (structured log-reg + Gemini)4.9%
Embedding model (Qwen + log-reg)0.0%
Selected challenge area
undetermined
375 records
Ambiguous records remain challenging by definition and are unlikely to have a single clean signal source.
Structured model (regularized log-reg)4.8%
LLM model (Gemini)5.3%
Combined model (structured log-reg + Gemini)5.3%
Embedding model (Qwen + log-reg)0.0%
Selected challenge area
other neonatal disorders
147 records
Catch-all buckets remain hard for every model family.
Structured model (regularized log-reg)2.0%
LLM model (Gemini)3.4%
Combined model (structured log-reg + Gemini)1.4%
Embedding model (Qwen + log-reg)0.0%
Selected challenge area
sepsis
134 records
Broader sepsis patterns remain difficult outside the highest-volume categories.
Structured model (regularized log-reg)3.0%
LLM model (Gemini)4.5%
Combined model (structured log-reg + Gemini)9.0%
Embedding model (Qwen + log-reg)0.0%
Illustrative VA Narrative Prediction Examples
These de-identified VA narratives are shown in full because the ambiguity often sits in the complete sequence of symptoms,
pregnancy history, delivery course, and care-seeking details rather than in a short excerpt.
MisclassifiedTrue neonatal preterm birth complicationsPredicted perinatal asphyxia/hypoxiaSite SLAge Early Neonate (1 to 6 days)
Preterm and perinatal asphyxia often share the same narrative signals around gestational age, delivery complications, and a very short neonatal course.
MisclassifiedTrue neonatal preterm birth complicationsPredicted perinatal asphyxia/hypoxiaSite SLAge Death in the first 24 hours
Preterm and perinatal asphyxia often share the same narrative signals around gestational age, delivery complications, and a very short neonatal course.
MisclassifiedTrue neonatal preterm birth complicationsPredicted perinatal asphyxia/hypoxiaSite SLAge Late Neonate (7 to 27 days)
Preterm and perinatal asphyxia often share the same narrative signals around gestational age, delivery complications, and a very short neonatal course.
MisclassifiedTrue neonatal preterm birth complicationsPredicted perinatal asphyxia/hypoxiaSite SLAge Death in the first 24 hours
Preterm and perinatal asphyxia often share the same narrative signals around gestational age, delivery complications, and a very short neonatal course.
MisclassifiedTrue perinatal asphyxia/hypoxiaPredicted neonatal preterm birth complicationsSite SLAge Early Neonate (1 to 6 days)
Even detailed pregnancy and delivery narratives can still pull the model toward prematurity when the neonatal decline is brief and unstable.
MisclassifiedTrue perinatal asphyxia/hypoxiaPredicted neonatal preterm birth complicationsSite SLAge Early Neonate (1 to 6 days)
Even detailed pregnancy and delivery narratives can still pull the model toward prematurity when the neonatal decline is brief and unstable.
MisclassifiedTrue perinatal asphyxia/hypoxiaPredicted neonatal preterm birth complicationsSite ETAge Early Neonate (1 to 6 days)
Even detailed pregnancy and delivery narratives can still pull the model toward prematurity when the neonatal decline is brief and unstable.
MisclassifiedTrue perinatal asphyxia/hypoxiaPredicted neonatal preterm birth complicationsSite SLAge Late Neonate (7 to 27 days)
Even detailed pregnancy and delivery narratives can still pull the model toward prematurity when the neonatal decline is brief and unstable.
Correctly classifiedTrue lower respiratory infectionsPredicted lower respiratory infectionsSite SLAge Infant (28 days to less than 12 months)
This longer narrative is a useful counterexample: the model sometimes does stay on the right cause when the symptom pattern is specific enough.
Correctly classifiedTrue malariaPredicted malariaSite SLAge Child (12 months to less than 60 Months)
This longer narrative is a useful counterexample: the model sometimes does stay on the right cause when the symptom pattern is specific enough.
Correctly classifiedTrue congenital birth defectsPredicted congenital birth defectsSite SLAge Death in the first 24 hours
This longer narrative is a useful counterexample: the model sometimes does stay on the right cause when the symptom pattern is specific enough.
Correctly classifiedTrue malnutritionPredicted malnutritionSite SLAge Child (12 months to less than 60 Months)
This longer narrative is a useful counterexample: the model sometimes does stay on the right cause when the symptom pattern is specific enough.
Correctly classifiedTrue diarrheal diseasesPredicted diarrheal diseasesSite SLAge Infant (28 days to less than 12 months)
This longer narrative is a useful counterexample: the model sometimes does stay on the right cause when the symptom pattern is specific enough.
Illustrative Synthetic Interview Examples
These examples are most useful as communication aids. Each transcript is generated by asking the model to rewrite the full
structured case into an INTERVIEWER / CAREGIVER exchange that covers symptoms, timing, care-seeking,
environmental context, and charted clinical findings without naming the diagnosis.
Narrative validation
200/200
Short synthetic narratives passed validation.
Transcript validation
200/200
Long interview transcripts also validated cleanly.
Median transcript length
1,353 words
Transcript mode is long-form and presentation-ready.
Head-to-head agreement
65.7%
Real vs synthetic predictions agree fairly often.
Current conclusion: on the 99-case synthetic head-to-head subset,
real narratives scored 47.5% top-1,
synthetic narratives scored 48.5%,
and the baseline condition scored 49.5%.
This is promising enough to keep exploring, but not yet strong enough to headline as a performance win.
Selected interview example
The full transcript is shown below in a scrollable box rather than excerpted, so the construction and level of detail are visible.