Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study

Vitenskapelig artikkel

Publiseringsår: 2026
Tidsskrift: Information
Eksterne nettsted: DOI; Nasjonalt vitenarkiv
NIVA-involverte: Isabel Seifert-Dähnn; Andrea Merlina; Mahla Rashidian; Samuel A. S. Welch; Jemmima Knight
Forfattere: Maximilian Nawrath, Andrea Merlina, Jemmima Knight, Sam A. Welch, Mahla Rashidian, Isabel Seifert-Dähnn

Sammendrag

Literature screening is a major bottleneck in systematic reviews, yet Large Language Models (LLMs) can substantially reduce workloads. However, performance varies across models and is sensitive to evaluation metrics, particularly in low-prevalence screening contexts. We validated five LLMs (GPT-4.1, Claude 3.5 Sonnet, Gemini 2.0 Flash, DeepSeek V3, and Mistral Large) against a 500-record gold-standard dataset (8 inclusions; 1.6% prevalence) using a conservative zero-shot prompt aligned with standard systematic review workflows. Performance was assessed through classification metrics (sensitivity, specificity, precision), logistic regression (GLM; Firth-penalised where separation occurred), and agreement indices (Cohen’s κ, MCC, PABAK, Gwet’s AC1). Gemini 2.0 Flash and Mistral Large showed no false negatives (1.00) but differed in specificity (0.858 vs. 0.697) and accuracy (0.860 vs. 0.702). GPT-4.1 and Claude 3.5 Sonnet performed identically (sensitivity 0.875; specificity 0.876; accuracy 0.876). In contrast, DeepSeek V3 maximised specificity (0.980) and accuracy (0.970) but demonstrated lower sensitivity (0.375). Regression analyses confirmed strong positive associations with human decisions (OR 28.9–49.5). Agreement indices revealed the expected low-prevalence artefact, with Cohen’s κ low despite high concordance while MCC, PABAK, and AC1 indicated substantially stronger agreement. Our results highlight a fundamental sensitivity-specificity trade-off, with conclusions dependent on the evaluation framework chosen. LLMs may meaningfully support title-abstract screening as decision-support tools, provided that human oversight is maintained and validation is transparent and reproducible.