In March 2023, the latest version of ChatGPT debuted. Arguably the most recognized large language model (LLM) of the world, ChatGPT -4 continues the steady improvements made on earlier versions, with its enhanced capabilities to train on more inputs and newer data, enabling it to respond to increasingly broader queries with more accuracy.
Months later, OpenAI, the company that owns ChatGPT, announced the availability of ChatGPT-4 Turbo, which, true to its name, is even faster, supports longer inputs, and is trained on more recent data (up to April 2023, compared to September 2021 in previous models). Multimodal applications integrating ChatGPT with the image-generating LLM Dall-E and, most recently, the video-generating Sora, expand the creative uses of this technology. Other LLMs, notably Microsoft’s Copilot and Google’s Gemini (previously named Bard), are also making headlines, further cementing the transformative potential of this technology, in much the same way that internet search engines did in the 1990s.
In the time between the writing and printing of this article, new versions or updates to these models and updated use applications will be in the news. Trying to keep up with the rapid evolution of this technology is challenging, only adding to the daunting task of implementation—particularly in a field like medicine, which relies so heavily on informed judgment and clinical expertise to fulfill its mission to first do no harm.
As with any powerful new technology, excitement over the real and potential benefits of LLMs within healthcare will need to be continually evaluated against real and potential risks. With the launch of ChatGPT for general usage, the time has arrived to weigh in on this balancing act as more people adopt the technology.
We definitely have crossed the threshold into a new era of AI, and we probably have already crossed the threshold of ubiquity of using AI on a day-to-day basis for a lot of individuals. — Alfred-Marc Iloreta, Jr., MD
“We definitely have crossed the threshold into a new era of AI, and we probably have already crossed the threshold of ubiquity of using AI on a day-to-day basis for a lot of individuals,” said Alfred-Marc Iloreta, Jr., MD, assistant professor of artificial intelligence and emerging technologies in the Graduate School of Biomedical Sciences at the Icahn School of Medicine at Mount Sinai Hospital in New York City, where he is also an assistant professor of otolaryngology–head and neck surgery and neurosurgery and co-directs the Endoscopic Skull Program.
Although uptake of AI in healthcare isn’t yet ubiquitous, as suggested by a December 2023 AMA survey of over 1,000 physicians in which only 38% of respondents said they currently use AI (AMA Augmented Intelligence Research. Nov. 2023), the easy access and no to low cost of software like ChatGPT assures its quick growth. Early users report its real-time benefits for everyday workflow activities such as administrative tasks and documentation (generating discharge notes, care plans, progress notes, clinical notes, and preauthorization letters), as was discussed in the in the March 2024 ENTtoday Tech Talk article. Other areas under investigation are conducting education and research and generating patient materials. For higher order clinical activities, such as diagnosis, triage, and treatment decisions, research is ongoing to understand the safe, beneficial uses and limitations of LLMs.
Below is a brief sampling of the research that’s underway on implementing LLMs, and ChatGPT in particular, in otolaryngology education and patient communications.
ChatGPT for Otolaryngology Education
Habib Zalzal, MD, assistant professor of otolaryngology and pediatrics at Children’s National Medical Center, The George Washington University, Washington, D.C., sees adoption happening at a rapid rate among medical students, residents, and attendings who in recent months have incorporated LLMs into their daily life for educational purposes like studying or journal club summaries. “With this mass adoption, it’s only a matter of time before ChatGPT or other browser-based LLMs become a daily habit in our work day,” he said.
He cautioned, however, that ChatGPT and other LLMs don’t and cannot replace traditional learning sources such as textbooks and journal articles but should be seen as a supplement to these. In particular, he’s concerned about an overreliance on ChatGPT for educating students who as yet don’t have the prerequisite medical knowledge base on which to build critical thinking and judgment. “Reliance on early ChatGPT versions, much like a bad habit, is harder to break if the proper knowledge base isn’t there,” he said.
Part of Dr. Zalzal’s caution comes from data showing the limitations of ChatGPT for educational purposes. Following reports showing the ability of ChatGPT to exceed the passing score of the medical licensing exam (PLOS Digit. Health. 2023. doi.org/10.1371/journal.pdig.0000198; Sci Rep. 2023. doi.org/10.1038/s41598-023-43436-9), he and his colleagues undertook a study to quantify how well ChatGPT 3.5 concurred with expert otolaryngologists when asked high-level questions requiring both rote memorization and critical thinking (OTO Open. 2023. doi:10.1002/oto2.94). The tool performed better on open-ended questions (56.7% accuracy) than on multiple-choice questions (43.3% accuracy), but, overall, wasn’t sufficient as a stand-alone educational tool. Its lower accuracy in answering multiple-choice questions was attributed to ChatGPT’s default in providing some form of answer even if it didn’t know the answer, which can easily generate a false or made-up response, called a hallucination.
“LLMs can sometimes generate plausible yet incorrect answers that may mislead or harm users,” he said. “Even if the training data were created using validated sources, the risk of hallucination by the AI model could lead to the spread of misinformation or misuse that cannot be easily controlled.”
Improved accuracy was reported in more recent studies using an LLM trained in a comprehensive knowledge database of otolaryngology-specific information integrated into ChatGPT-4 (JMIR Med Educ. 2024. doi:10.2196/49970. Lancet. 2023. doi:10.2139/ssrn.4571725 (preprint)). Called ChatENT, the model was developed by researchers at the University of Alberta and Copula AI and, according to the authors, is the first specialty-specific LLM in the medical field.
When challenged with practice questions for board certifying exams in Canada and the United States, ChatENT scored 87.2% for accuracy on open-ended, short answer questions and 80% for multiple-choice questions, outperforming ChatGPT-4 with fewer hallucinations and errors identified.
Lead author of the study, Cai Long, MD, an otolaryngology surgical resident at the University of Alberta, said the model, still in its early beta stage, is being continuously updated and improved and that its current state may not fully represent its most refined or comprehensive version. “Future iterations and research are expected to address these aspects, further enhancing the model’s robustness and applicability,” he said. Users who want to test the model can access it at https://www.chatent.net.
Long-cited potential applications of ChatENT include medical education, patient education, and clinical decision support, which, said Dr. Long, has yet to be studied for efficacy.
Eric Gantwerker, MD, MSc, MS, a pediatric otolaryngologist and associate professor at Northwell Health in New York City who regularly teaches students and faculty how to use ChatGPT, views apparent limitations such as hallucinations as part of educating students on the strengths and weaknesses of the technology. He also shows them how they can leverage it to test their own knowledge by judging the validity of outputs from the platform.
People don’t realize that with subsequent updated models, limitations like hallucinations are going to go away. — Eric Gantwerker, MD, MS
Dr. Gantwerker has his students use the free version of ChatGPT (3.5) that allows for easy access and helps train them to identify the limitations and potential drawbacks of the technology like hallucinations and bias. To illustrate programmed bias, for example, he uses the prompt, “Create a picture of doctors playing games” that generates a picture of White male doctors. “In education, I can use these limitations of ChatGPT to our advantage,” he said.
To illustrate how rapidly this technology is improving, Dr. Gantwerker shows students the difference between content generated by ChatGPT versions 3.5 and 4.0, with the latter showing demonstrably more depth, creativity, and robustness. “People don’t realize that with subsequent updated models, limitations like hallucinations are going to go away,” he said. He encouraged other physicians to try ChatGPT and cited the paper “Writing with ChatGPT: an illustration of its capacity, limitations & implications for academic writers” (Perspect. Med. Educ. 2023. doi:10.5334/pme.1072) as a good resource to use to jump in.
ChatGPT for and by Patients
Using ChatGPT to develop patient materials, including translating them into other languages, is another potential role for this technology. Researchers at the University of Kansas Medical Center in Kansas City showed the ability of ChatGPT to generate presurgical educational information for patients undergoing head and neck surgery (Laryngoscope. 2023. doi:10.1002/lary.31243). When compared to online resources like publicly available websites that patients access for such information, the study found that ChatGPT content had similar readability, knowledge content,
accuracy, thoroughness, and number of medical errors.
Senior author of the study, Andres Bur, MD, an associate professor of otolaryngology–head and neck surgery and director of robotics and minimally invasive head and neck surgery at the University of Kansas Medical Center, called the results powerful but cautioned that they are also very new. “We need more experience with it to know that it’s providing the correct recommendations for our patients so we can start recommending it as a tool for them,” he said.
For patients who use ChatGPT in the same way that some patients use Google to ask about an otolaryngologic health concern, a study by Habib G. Zalzal, MD, and his colleagues found that ChatGPT answered with a high degree of accuracy (98.3%) but fared lower in patient confidence in the responses (79.8%) (Laryngoscope Investig Otolaryngol. 2024. doi:10.1002/lio2.1193). “It’s important for us physicians to know that at least the public still value our expertise, so we have a duty to not rely on LLM and still serve as a separate knowledgeable entity for educating and treating our patients on their otolaryngologic conditions,” said Dr. Zalzal.
Another study, conducted by Daniel J. Campbell, MD, and his colleagues, found that ChatGPT correctly answered nearly 70% of questions on thyroid nodules (Thyroid. 2023. doi:10.1089/thy.2023.0491). The responses were at a college reading level, higher than the level used for patient education materials, however, potentially making them more difficult to understand.
When to Adopt ChatGPT in Practice
One of the questions that otolaryngologists and otolaryngology practices need to ask themselves is when and for what tasks they should adopt ChatGPT into their practices. Relying on traditional studies may not be feasible, given the rapid evolution of the technology. “The hard part of doing a study where you’re comparing recommendations made by a human clinician and those made by AI, for example, is that because the algorithms so rapidly change and adapt and learn, a study we do today will be different from what we do in six months,” said Dr. Iloreta, adding that AI studies, therefore, won’t necessarily be reproducible.
But the time may be close when AI won’t be able to adapt much more, he added, suggesting that its rapid evolution will slow, providing opportunity for clearer assessment.
Other potential ways to know when it’s time to adopt LLMs can come from watching what others are doing. Stat News (www.statnews.com), for instance, offers an online tracker on the real-world use of generative AI and its impact on medicine.
Another way may be to view AI through the lens of a concept called the Gartner Hype Cycle, a five-stage model that shows stages of technology adoption to help guide organizations. (See the sidebar “The Gartner Hype Cycle.”) Dr. Bur, whose research is on machine learning to personalize care in head and neck oncology, views AI through this lens and sees generative AI as at its peak. Per the Gartner Hype Cycle, the peak is used deliberately to mean a relatively new technology that has generated a lot of buzz but has more hype than proof that it can offer what it claims.
Further help in discerning when and how to adopt AI can be found in a set of broad guiding principles developed and recently published by multidisciplinary experts from around the world (N Engl J Med. 2024. doi:10.1056/AIp2400036). Key recommendations cover policy issues to consider, clinical aspects including the patient-clinician relationship, incorporation of patient data into training AI models, patient education about medical advice from AI, and information on who pays for AI developments. Zak Kohane, MD, PhD, chair of the department of biomedical informatics in the Blavatnik Institute at Harvard Medical School in Boston and editor-in-chief of The New England Journal of Medicine AI, a longtime advocate of AI’s potential to change medicine, underscored the imperative for caution with the arrival of generative AI tools like ChatGPT that he called “mind-blowing.”
“Despite their promise,” said Dr. Kohane in a press release, “ChatGPT and tools like it are immature and evolving. We need to figure out how to trust their abilities but verify their output.”
The Gartner Hype Cycle
The Gartner Hype Cycle includes five phases of a technology’s life cycle that provide a way for potential users of a new technology to understand its opportunities and risks. Organizations can use the cycle to decide when and how to adopt a new technology. The five phases include:
1. Innovation Trigger: Technological breakthrough or product launch.
2. Peak of Inflated Expectations: Increase in product use, but more hype than proof that the innovation can deliver.
3. Trough of Disillusionment: The original excitement wears off; early adopters report performance issues and low return on investment.
4. Slope of Enlightenment: Early adopters see benefits, and others begin to understand how to adopt the innovation.
5. Plateau of Productivity: Users see real-world benefits, and the innovation goes mainstream.
Online AI Tutorials
These online tutorials can help you gain a better understanding of how AI is being used in healthcare today:
• Artificial Intelligence in Health Care. MIT. More Advanced Course – In-Depth Look at AI in Healthcare and Future Applications. https://executive.mit.edu/course/artificial-intelligence-in-health-care/a056g00000URaaTAAT.html
• ChatGPT Tutorial for Complete Beginners 2023. Udemy. https://www.udemy.com/course/chatgpt-tutorial-for-complete-beginners-2023/?couponCode=LETSLEARNNOWPP
• Overview of AI in Healthcare: Free for American Medical Association (AMA) members. https://edhub.ama-assn.org/change-med-ed/interactive/18827029
• Prompt Engineering for ChatGPT. Vanderbilt University. Coursera. https://www.coursera.org/learn/prompt-engineering?specialization=prompt-engineering
Mary Beth Nierengarten is a freelance medical writer based in Minnesota.