The Triage and Diagnostic Accuracy of Frontier Large Language Models: Updated Comparison to Physician Performance

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

The medical capabilities of large language models (LLMs) are progressing rapidly [1-3]. Benchmarking LLMs against human performance with clinically relevant tasks enables tracking current capabilities and progress. The triage (level/urgency of care to seek) and diagnostic accuracy of the GPT-3 model were recently compared with 5000 lay individuals using the internet and 21 practicing primary care physicians [4]. The triage ability of GPT-3 was significantly inferior to that of physicians, having similar accuracy to lay individuals. The diagnostic ability was close to but below that of physicians [4]. It is uncertain whether more recent frontier LLMs are still inferior to physicians on this benchmark.
Original languageEnglish
Article numbere67409
Number of pages3
JournalJournal of Medical Internet Research
Volume26
DOIs
Publication statusPublished - 6 Dec 2024

Keywords

  • accuracy
  • AI
  • ChatGPT
  • diagnosis
  • diagnostic
  • generative artificial intelligence
  • internet
  • large language models
  • LLMs
  • medical care
  • physician
  • physicians
  • prediction
  • primary care
  • triage

Fingerprint

Dive into the research topics of 'The Triage and Diagnostic Accuracy of Frontier Large Language Models: Updated Comparison to Physician Performance'. Together they form a unique fingerprint.

Cite this