ChatGPT-4 outperforms GPT-3.5 and Google Bard in neurosurgery oral board exam

Health

ChatGPT-4 outperforms GPT-3.5 and Google Bard in neurosurgery oral board exam

admin

April 19, 2023

ChatGPT-4 outperforms GPT-3.5 and Google Bard in neurosurgery oral board exam

In a recent study posted to the medRxiv* preprint server, researchers in the US assessed the performance of three general Large Language Models (LLMs), ChatGPT (or GPT-3.5), GPT-4, and Google Bard, on higher-order questions, specifically representing the American Board of Neurological Surgery (ABNS) oral board examination. As well as, they interpreted the differences of their performance and accuracy after various query characteristics.

Study: Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Query Bank. Image Credit: Login / Shutterstock

*Vital notice: medRxiv publishes preliminary scientific reports that aren’t peer-reviewed and, due to this fact, shouldn’t be thought to be conclusive, guide clinical practice/health-related behavior, or treated as established information.

Background

All three LLMs assessed on this study have shown the aptitude to pass medical board exams with multiple-choice questions. Nevertheless, no previous studies have tested or compared the performance of multiple LLMs on predominantly higher-order questions from a high-stake medical subspecialty domain, e.g., neurosurgery.

A previous study showed that ChatGPT passed a 500-question module imitating the neurosurgery written board exams with a rating of 73.4%. Its updated model, GPT-4, became available for public use on March 14, 2023, and similarly attained passing scores in >25 standardized exams. Studies documented that GPT-4 showed >20% performance improvements on the US Medical Licensing Exam (USMLE).

One other artificial intelligence (AI)-based chatbot, Google Bard, had real-time web crawling capabilities, thus, could offer more contextually relevant information while generating responses for standardized exams in fields of medication, business, and law. The ABNS neurosurgery oral board examination, considered a more rigorous assessment than its written counterpart, is taken by doctors two to 3 years after residency graduation. It comprises three sessions of 45 minutes each, and its pass rate has not exceeded 90% since 2018.

Concerning the study

In the current study, researchers assessed the performance of GPT-3.5, GPT-4, and Google Bard on a 149-question module imitating the neurosurgery oral board exam.

The Self-Assessment Neurosurgery Exam (SANS) indications exam covered intriguing questions on relatively difficult topics, equivalent to neurosurgical indications and interventional decision-making. The team assessed questions in a single best-answer multiple-choice query format. Since all three LLMs currently do not need multimodal input, they tracked responses with ‘hallucinations’ for questions with medical imaging data, scenarios where an LLM asserts inaccurate facts it falsely believes are correct. In all, 51 questions incorporated imaging into the query stem.

Moreover, the team used linear regression to question correlations between performance on different query categories. They assessed variations in performance using chi-squared, Fisher’s exact, and logistic regression tests with a single variable, where p<0.05 was considered statistically significant.

Study findings

On a 149-question bank of mainly higher-order diagnostic and management multiple-choice questions designed for neurosurgery oral board exams, GPT-4 attained a rating of 82.6% and outperformed ChatGPT’s rating of 62.4%. Moreover, GPT-4 demonstrated a markedly higher performance than ChatGPT within the Spine subspecialty (90.5% vs. 64.3%).

Google Bard generated correct responses for 44.2% (66/149) of questions. While it generated incorrect responses to 45% (67/149) of questions, it declined to reply 10.7% (16/149) of questions. GPT-3.5 and GPT-4 never declined to reply a text-based query, whereas Bard even declined to reply 14 test-based questions. The truth is, GPT-4 outshone Google Bard in all categories and demonstrated improved performance in query categories for which ChatGPT showed lower accuracy. Interestingly, while GPT-4 performed higher on imaging-related questions than ChatGPT (68.6% vs. 47.1%), its performance was comparable to Google Bard (68.6% vs. 66.7%).

Nevertheless, notably, GPT-4 showed reduced rates of hallucination and the flexibility to navigate difficult concepts like declaring medical futility. Nevertheless, it struggled in other scenarios, equivalent to factoring in patient-level characteristics, e.g., frailty.

Conclusions

There may be an urgent have to develop more trust in LLM systems, thus, rigorous validation of their performance on increasingly higher-order and open-ended scenarios should proceed. It could make sure the protected and effective integration of those LLMs into clinical decision-making processes.

Methods to quantify and understand hallucinations remain vital, and eventually, only those LLMs can be incorporated into clinical practice that might minimize and recognize hallucinations. Further, the study findings underscore the urgent need for neurosurgeons to remain informed on emerging LLMs and their various performance levels for potential clinical applications.

Multiple-choice examination patterns might turn out to be obsolete in medical education, while verbal assessments will gain more importance. With advancements within the AI domain, neurosurgical trainees might use and rely on LLMs for board preparation. For example, LLMs-generated responses might provide recent clinical insights. They might also function a conversational aid to rehearse various clinical scenarios on difficult topics for the boards.

Journal reference:

Preliminary scientific report.
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Query Bank, Rohaid Ali, Oliver Y. Tang, Ian D. Connolly, Jared S. Fridley, John H. Shin, Patricia L. Zadnik Sullivan, Deus Cielo, Adetokunbo A. Oyelese, Curtis E. Doberstein, Albert E. Telfeian, Ziya L. Gokaslan, Wael F. Asaad, medRxiv preprint 2023.04.06.23288265; DOI: https://doi.org/10.1101/2023.04.06.23288265, https://www.medrxiv.org/content/10.1101/2023.04.06.23288265v1

Background

Concerning the study

Study findings

Conclusions

LEAVE A REPLY Cancel reply