Benchmarking Multiple Large Language Models for Automated Clinical Trial Data Extraction in Aging Research
Abstract
Large-language models (LLMs) show promise for automating evidence synthesis, yet head-to-head evaluations remain scarce. We benchmarked five state-of-the-art LLMs—openai/o1-mini, x-ai/grok-2-1212, meta-llama/Llama-3.3-70B-Instruct, google/Gemini-Flash-1.5-8B, and deepseek/DeepSeek-R1-70B-Distill—on extracting protocol details from transcranial direct-current stimulation (tDCS) trials enrolling older adults. A multi-LLM ensemble pipeline ingested ClinicalTrials.gov records, applied a structured JSON schema, and generated comparable outputs from unstructured text. The pipeline retrieved 83 aging-related tDCS trials—roughly double the yield of a conventional keyword search. Across models, agreement was almost perfect for the binary field brain stimulation used (Fleiss κ ≈ 0.92) and substantial for the categorical primary target (κ ≈ 0.71). Numeric parameters such as stimulation intensity and session duration showed excellent consistency when explicitly reported (ICC 0.95–0.96); secondary targets and free-text duration phrases remained challenging (κ ≈ 0.61; ICC ≈ 0.35). An ensemble consensus (majority vote or averaging) resolved most disagreements and delivered near-perfect reliability on core stimulation attributes (κ = 0.94). These results demonstrate that multi-LLM ensembles can markedly expand trial coverage and reach expert-level accuracy on well-defined fields while still requiring human oversight for nuanced or sparsely reported details. The benchmark and open-source workflow set a solid baseline for future advances in prompt engineering, model specialization, and ensemble strategies aimed at fully automated evidence synthesis in neurostimulation research involving aging populations. Overall, the five-model multi-LLM ensemble doubled the number of eligible aging-related tDCS trials retrieved versus keyword searching and achieved near-perfect agreement on core stimulation parameters (κ ≈ 0.94), demonstrating expert-level extraction accuracy.