Is ChatGPT able to be your next medical advisor?

Health

Is ChatGPT able to be your next medical advisor?

admin

October 4, 2023

Is ChatGPT able to be your next medical advisor?

In a recent study published in JAMA Network Open, a team of researchers from Vanderbilt University examined the potential role of the Chat-Generative Pre-Trained Transformer (ChatGPT) in providing medical information to patients and health professionals.

Study: Accuracy and Reliability of Chatbot Responses to Physician Questions. Image Credit: CkyBe / Shutterstock

Background

ChatGPT is widely used for various purposes nowadays. This massive language model (LLM) has been trained on articles, books, and other sources across the online. ChatGPT understands requests from human users and provides answers in text and, now, image formats. Unlike natural language processing (NLP) models that got here before it, this chatbot can learn by itself through ‘self-supervised learning.’

ChatGPT synthesizes immense amounts of data rapidly, making it a useful reference tool. Medical professionals could use this application to attract inferences from medical data and be told about complex clinical decisions. This is able to make healthcare more efficient, as physicians wouldn’t have to look up multiple references to acquire vital information. Similarly, patients would give you the option to access medical information with no need to rely solely on their doctor.

Nonetheless, the utility of ChatGPT in medicine, to doctors and patients, lies in whether it may well provide accurate and complete information. Many cases have been documented where the chatbot ‘hallucinated’ or produced convincing responses that were entirely incorrect. It’s crucial to evaluate its accuracy in responding to health-related queries.

“Our study provides insights into model performance in addressing medical questions developed by physicians from a various range of specialties; these questions are inherently subjective, open-ended, and reflect the challenges and ambiguities that physicians and, in turn, patients encounter clinically.”

In regards to the study

Thirty-three physicians, faculty, and up to date graduates from the Vanderbilt University Medical Center devised an inventory of 180 questions that belonged to 17 pediatric, surgical, and medical specialties. Two additional query sets included queries on melanomas, immunotherapy, and customary medical conditions. In total, 284 questions were chosen.

The questions were designed to have clear answers based on the medical guidelines of early 2021 (when the training set for the chatbot version 3.5 ended). Questions could possibly be binary (with yes/no answers) or descriptive. Based on difficulty, they were classified as easy, medium, or hard.

An investigator entered each query into the chatbot, and the response to every query was assessed by the physician who had designed it. The accuracy and completeness were scored using Likert scales. Each query was scored from 1-6 for accuracy, where 1 indicated ‘completely incorrect’ and 6 ‘completely correct.’ Similarly, completeness was graded from 1-3, where 3 was probably the most comprehensive, and 1 was the least. A totally incorrect answer was not assessed for completeness.

Rating results were reported as median [interquartile range (IQR)] and mean [standard deviation (SD)]. Differences between groups were assessed using Mann-Whitney U tests, Kruskal-Wallis tests, and Wilcoxon signed-rank tests. When multiple physician scored a selected query, interrater agreement was also checked.

Incorrectly answered questions were asked a second time, between one and three weeks later, to ascertain if the outcomes were reproducible over time. All immunotherapy and melanoma-based questions were also rescored to evaluate the performance of probably the most recent model, ChatGPT version 4.

Findings

By way of accuracy, the chatbot had a median rating of 5 (IQR: 1-6) for the primary set of 180 multispecialty questions, indicating that the median answer was “nearly all correct.” Nonetheless, the mean rating was lower, at 4.4 [SD: 1.7]. While the median completeness rating was 3 (“ comprehensive”), the mean rating was lower at 2.4 [SD: 0.7]. Thirty-six answers were classified as inaccurate, having scored 2 or less.

For the primary set, completeness and accuracy were also barely correlated, with a correlation coefficient of 0.4. There have been no significant differences within the completeness and accuracy of ChatGPT’s answers across the straightforward, moderate, and hard questions and between descriptive and binary questions.

For the reproducibility evaluation, 34 out of the 36 were rescored. The chatbot’s performance improved markedly, with 26 being more accurate, 7 remaining constant, and just one being less accurate than before. The median rating for accuracy increased from 2 to 4.

The immunotherapy and melanoma-related questions were assessed twice. In the primary round, the median rating was 6 (IQR: 5-6), and the mean rating was 5.2 (SD: 1.3). The chatbot performed higher within the second round, improving its mean rating to five.7 (SD: 0.8). Completeness scores also increased, and the chatbot also scored highly on the questions related to common conditions.

“This study indicates that 3 months into its existence, chatbot has promise for providing accurate and comprehensive medical information. Nonetheless, it stays well in need of being completely reliable.”

Conclusions

Overall, ChatGPT performed well by way of completeness and accuracy. Nonetheless, the mean rating was noticeably lower than the median rating, suggesting that a couple of highly inaccurate answers (“hallucinations”) pulled the common down. Since these hallucinations are delivered in the identical convincing and authoritative tone, they’re difficult to tell apart from correct answers.

ChatGPT improved markedly over the short period between assessments. This means the importance of constantly updating and refining algorithms and using repeated user feedback to bolster factual accuracy and verified sources. Increasing and diversifying training datasets (inside medical sources) will allow ChatGPT to parse nuances in medical concepts and terms.

Moreover, the chatbot couldn’t distinguish between ‘high-quality’ sources like PubMed-index journal articles and medical guidelines and ‘low-quality’ sources comparable to social media pieces – it weighs them equally. With time, ChatGPT can grow to be a precious tool for medical practitioners and patients, but it surely shouldn’t be there yet.

Background

In regards to the study

Findings

Conclusions

LEAVE A REPLY Cancel reply