Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.

Can, Elif; Uller, Wibke; Vogt, Katharina; Doppler, Michael C; Busch, Felix; Bayerl, Nadine; Ellmann, Stephan; Kader, Avan; Elkilany, Aboelyazid; Makowski, Marcus R; Bressem, Keno K; Adams, Lisa C

doi:10.1016/j.acra.2024.09.041

2024

Zurück
Zurück zum Anfang der Trefferliste
Dauerhafter Link zum angezeigten Objekt

Titel:: Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.
Dokumenttyp:: Journal Article
Autor(en):: Can, Elif; Uller, Wibke; Vogt, Katharina; Doppler, Michael C; Busch, Felix; Bayerl, Nadine; Ellmann, Stephan; Kader, Avan; Elkilany, Aboelyazid; Makowski, Marcus R; Bressem, Keno K; Adams, Lisa C
Abstract:: PURPOSE: To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports. METHODS: Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis. RESULTS: Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84). CONCLUSIONS: GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation. CLINICAL RELEVANCE/APPLICATIONS: With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications.
Zeitschriftentitel:: Acad Radiol
Jahr:: 2024
Volltext / DOI:: doi:10.1016/j.acra.2024.09.041
PubMed:: http://view.ncbi.nlm.nih.gov/pubmed/39353826
Print-ISSN:: 1076-6332
TUM Einrichtung:: Institut für Diagnostische und Interventionelle Radiologie (Prof. Makowski)
BibTeX

Vorkommen:

mediaTUM Gesamtbestand Hochschulbibliographie 2024 Schools und Fakultäten TUM School of Medicine and Health Institut für Radiologie

mediaTUM Gesamtbestand Einrichtungen Schools TUM School of Medicine and Health Departments Clinical Medicine Institut für Diagnostische und Interventionelle Radiologie (Prof. Makowski)2024