Til hovedinnhold
English
Publikasjoner

Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study

Vitenskapelig artikkel
Publiseringsår
2026
Tidsskrift
Information
Eksterne nettsted
DOI
Nasjonalt vitenarkiv
Forfattere
Maximilian Nawrath, Andrea Merlina, Jemmima Knight, Sam A. Welch, Mahla Rashidian, Isabel Seifert-Dähnn

Sammendrag

Literature screening is a major bottleneck in systematic reviews, yet Large Language Models (LLMs) can substantially reduce workloads. However, performance varies across models and is sensitive to evaluation metrics, particularly in low-prevalence screening contexts. We validated five LLMs (GPT-4.1, Claude 3.5 Sonnet, Gemini 2.0 Flash, DeepSeek V3, and Mistral Large) against a 500-record gold-standard dataset (8 inclusions; 1.6% prevalence) using a conservative zero-shot prompt aligned with standard systematic review workflows. Performance was assessed through classification metrics (sensitivity, specificity, precision), logistic regression (GLM; Firth-penalised where separation occurred), and agreement indices (Cohen’s κ, MCC, PABAK, Gwet’s AC1). Gemini 2.0 Flash and Mistral Large showed no false negatives (1.00) but differed in specificity (0.858 vs. 0.697) and accuracy (0.860 vs. 0.702). GPT-4.1 and Claude 3.5 Sonnet performed identically (sensitivity 0.875; specificity 0.876; accuracy 0.876). In contrast, DeepSeek V3 maximised specificity (0.980) and accuracy (0.970) but demonstrated lower sensitivity (0.375). Regression analyses confirmed strong positive associations with human decisions (OR 28.9–49.5). Agreement indices revealed the expected low-prevalence artefact, with Cohen’s κ low despite high concordance while MCC, PABAK, and AC1 indicated substantially stronger agreement. Our results highlight a fundamental sensitivity-specificity trade-off, with conclusions dependent on the evaluation framework chosen. LLMs may meaningfully support title-abstract screening as decision-support tools, provided that human oversight is maintained and validation is transparent and reproducible.