Abstract
The medical capabilities of large language models (LLMs) are progressing rapidly [1-3]. Benchmarking LLMs against human performance with clinically relevant tasks enables tracking current capabilities and progress. The triage (level/urgency of care to seek) and diagnostic accuracy of the GPT-3 model were recently compared with 5000 lay individuals using the internet and 21 practicing primary care physicians [4]. The triage ability of GPT-3 was significantly inferior to that of physicians, having similar accuracy to lay individuals. The diagnostic ability was close to but below that of physicians [4]. It is uncertain whether more recent frontier LLMs are still inferior to physicians on this benchmark.
Original language | English |
---|---|
Article number | e67409 |
Number of pages | 3 |
Journal | Journal of Medical Internet Research |
Volume | 26 |
DOIs | |
Publication status | Published - 6 Dec 2024 |
Keywords
- accuracy
- AI
- ChatGPT
- diagnosis
- diagnostic
- generative artificial intelligence
- internet
- large language models
- LLMs
- medical care
- physician
- physicians
- prediction
- primary care
- triage