TY - JOUR
T1 - Generative AI chatbots for reliable cancer information
T2 - Evaluating web-search, multilingual, and reference capabilities of emerging large language models
AU - Menz, Bradley D.
AU - Modi, Natansh D.
AU - Abuhelwa, Ahmad Y.
AU - Ruanglertboon, Warit
AU - Vitry, Agnes
AU - Gao, Yuan
AU - Li, Lee X.
AU - Chhetri, Rakchha
AU - Chu, Bianca
AU - Bacchi, Stephen
AU - Kichenadasse, Ganessan
AU - Shahnam, Adel
AU - Rowland, Andrew
AU - Sorich, Michael J.
AU - Hopkins, Ashley M.
PY - 2025/3/11
Y1 - 2025/3/11
N2 - Recent advancements in large language models (LLMs) enable real-time web search, improved referencing, and multilingual support, yet ensuring they provide safe health information remains crucial. This perspective evaluates seven publicly accessible LLMs—ChatGPT, Co-Pilot, Gemini, MetaAI, Claude, Grok, Perplexity—on three simple cancer-related queries across eight languages (336 responses: English, French, Chinese, Thai, Hindi, Nepali, Vietnamese, and Arabic). None of the 42 English responses contained clinically meaningful hallucinations, whereas 7 of 294 non-English responses did. 48 % (162/336) of responses included valid references, but 39 % of the English references were.com links reflecting quality concerns. English responses frequently exceeded an eighth-grade level, and many non-English outputs were also complex. These findings reflect substantial progress over the past 2-years but reveal persistent gaps in multilingual accuracy, reliable reference inclusion, referral practices, and readability. Ongoing benchmarking is essential to ensure LLMs safely support global health information dichotomy and meet online information standards.
AB - Recent advancements in large language models (LLMs) enable real-time web search, improved referencing, and multilingual support, yet ensuring they provide safe health information remains crucial. This perspective evaluates seven publicly accessible LLMs—ChatGPT, Co-Pilot, Gemini, MetaAI, Claude, Grok, Perplexity—on three simple cancer-related queries across eight languages (336 responses: English, French, Chinese, Thai, Hindi, Nepali, Vietnamese, and Arabic). None of the 42 English responses contained clinically meaningful hallucinations, whereas 7 of 294 non-English responses did. 48 % (162/336) of responses included valid references, but 39 % of the English references were.com links reflecting quality concerns. English responses frequently exceeded an eighth-grade level, and many non-English outputs were also complex. These findings reflect substantial progress over the past 2-years but reveal persistent gaps in multilingual accuracy, reliable reference inclusion, referral practices, and readability. Ongoing benchmarking is essential to ensure LLMs safely support global health information dichotomy and meet online information standards.
KW - Artificial intelligence
KW - Cancer enquiries
KW - English
KW - Health enquiries
KW - Language
KW - Large language model
UR - http://www.scopus.com/inward/record.url?scp=85217028980&partnerID=8YFLogxK
UR - http://purl.org/au-research/grants/NHMRC/2030913
UR - http://purl.org/au-research/grants/NHMRC/2008119
U2 - 10.1016/j.ejca.2025.115274
DO - 10.1016/j.ejca.2025.115274
M3 - Article
AN - SCOPUS:85217028980
SN - 0959-8049
VL - 218
JO - European Journal of Cancer
JF - European Journal of Cancer
M1 - 115274
ER -