"𝐶ℎ𝑎𝑡𝐺𝑃𝑇 𝑝𝑟𝑜𝑣𝑒𝑑 𝑏𝑒𝑠𝑡 𝑖𝑛 𝑚𝑎𝑘𝑖𝑛𝑔 𝑎 𝑓𝑖𝑛𝑎𝑙 𝑑𝑖𝑎𝑔𝑛𝑜𝑠𝑖𝑠, 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝐴𝐼 ℎ𝑎𝑑 𝟩𝟩% 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑡𝑢𝑑𝑦, 𝑓𝑢𝑛𝑑𝑒𝑑 𝑖𝑛 𝑝𝑎𝑟𝑡 𝑏𝑦 𝑡ℎ𝑒 𝑁𝑎𝑡𝑖𝑜𝑛𝑎𝑙 𝐼𝑛𝑠𝑡𝑖𝑡𝑢𝑡𝑒 𝑜𝑓 𝐺𝑒𝑛𝑒𝑟𝑎𝑙 𝑀𝑒𝑑𝑖𝑐𝑎𝑙 𝑆𝑐𝑖𝑒𝑛𝑐𝑒𝑠.
𝐼𝑡 𝑤𝑎𝑠 𝑙𝑜𝑤𝑒𝑠𝑡-𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑖𝑛𝑔 𝑖𝑛 𝑚𝑎𝑘𝑖𝑛𝑔 𝑑𝑖ff𝑒𝑟𝑒𝑛𝑡𝑖𝑎𝑙 𝑑𝑖𝑎𝑔𝑛𝑜𝑠𝑒𝑠, 𝑤ℎ𝑒𝑟𝑒 𝑖𝑡 𝑤𝑎𝑠 𝑜𝑛𝑙𝑦 𝟨𝟢% 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒, 𝑎𝑛𝑑 𝑖𝑛 𝑐𝑙𝑖𝑛𝑖𝑐𝑎𝑙 𝑚𝑎𝑛𝑎𝑔𝑒𝑚𝑒𝑛𝑡 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑠, 𝑢𝑛𝑑𝑒𝑟𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑖𝑛𝑔 𝑎𝑡 𝟨𝟪% 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑏𝑎𝑠𝑒𝑑 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑙𝑖𝑛𝑖𝑐𝑎𝑙 𝑑𝑎𝑡𝑎 𝑡ℎ𝑒 𝐿𝐿𝑀 𝑤𝑎𝑠 𝑡𝑟𝑎𝑖𝑛𝑒𝑑 𝑜𝑛."
Today in health, it mass general, Brigham puts chatty PT to the test on clinical decision accuracy. And today we're going to take a look at that and see what the response was. My name is bill Russell. I'm a former CIO for 16 hospital system and creator of this week health. Set of channels and events dedicated to leveraging the power of community to propel healthcare forward. We want to thank our show sponsors who are investing in developing the next generation of health leaders. Short tests are decide parlance, certified health, notable and service. Now check them out at this week. Health. Dot com. Slash today. Hey, one of the things you can do to help us out, share this podcast with a friend or colleague. Use it as a foundation for daily or weekly discussions on the topics that are relevant to you. And the industry and let them know they can subscribe wherever you listen to podcasts.
Okay, before we get into it, don't forget. We have a great webinar this week, September 7th, 1:00 PM. Eastern time. 10:00 AM. Pacific time. It's part of our leader series. It's on our AI journey in healthcare thus far, we have three great guests, Michael Pfeffer from Stanford health care. We have Brett lamb from UNC North Carolina health, and we have Chris long Hearst. , from a UC San Diego, and we're going to be talking about our AI journey thus far and what we are finding as we move forward. Through this path and relevant to that is our topic for today. All right. So we have story. , that appeared in healthcare. It news. Which I haven't hit in a while. So it's a, it's interesting to actually get over here and see what they're writing about. This is Andrea Fox, August 29th, 2023. And it is Chacha PT score. 72% in clinical decision accuracy. According to a mass general Brigham study. Okay. The largest language models performance was steady across both primary and emergency care for all medical specialties, but struggled with differential diagnosis. According to the new research by mass general. Brigham, let me go into it a little bit. So I'm putting chat GBT to the test to see if AI can work through an entire clinical encounter with patient. Recommending a diagnostic workup, deciding a course of action or making a final, final diagnosis. Mass general Brigham researchers. I've found the large language models to have impressive accuracy, despite limitations, including possible. Hallucinations as we know that that is possible. , they, , they put it through, , 36 published clinical vignettes and, , compared its accuracy on differential diagnosis, diagnostic testing, final diagnosis and management based on patient age, gender, and case acuity.
They, , their findings, no real benchmark exists, but we estimate this performance to be on a level of someone who just graduated from medical school, such as an intern or resident. And again, not to get too overly hyped on this, but this is a model that wasn't necessarily trained specifically on healthcare. And it's performing at a level of somebody who just graduated from medical school. All right. So this is from Dr. Mark. Susi S U C C I. Associate chair of innovation and commercialization and strategic innovation leader at MGB and executive director of its mesh incubators innovation in operations research group. Wow. That's a big title. The researchers said that Chatswood, PTO. Achieved an overall accuracy of 71% in clinical decision. Making across all 36 clinical vignettes. Chad should be T came up with a possible diagnosis and made final diagnoses. And care management decisions. They measured the popular LLMs. Accuracy on differential diagnosis, diagnostic testing, final diagnosis and management and structured. Blinded process awarding points for correct answers to questions post. Researchers then use linear regression to assess the relationship. Between chats UPTs performance and the vignettes demographic information. According to the study published this past week. And the journal of medical internet research Chatsy PT proved best in making final diagnosis where the AI had 77% accuracy. In this study funded in part by national Institute. , of general medical sciences. It was low performing in making differential diagnosis where it was only 60% accurate. And in clinical management decisions underperforming at 68%. Accuracy based on clinical data. The LLM was trained on the good news for those who have questioned, whether Chiechi PT can really outshine doctors. Expertise, Chad GPT struggled with diff differential diagnosis, which is the meat and potatoes of medicine. When a physician has to figure out what to do. Sushi said. That is important because it tells us where physicians are truly experts and adding the most value. In the early stages of patient care with little presenting information, when he lists a possible diagnosis is needed. Before tools like chats, UPT can be considered for integration into medical or clinical care, more benchmark research and regulatory guidance is needed. According to NCBT. MGB next. , is looking at whether AI tools can improve patient care and outcomes and hospital resources and constrained areas. This is interesting. I, you know, I look at this and I love the fact that this research is going on. I love the fact that we're building this body of knowledge. To see where the value is added either by the technology or by the clinician. And we can then optimize the overall use of the technology in a safe way. Across the board. This is going to be part of our learning curve. That's going to be part of our. Journey. Our AI journey. If you will, as we try to figure out where this can be used. In a safe and effective manner. And, you know, what's my, so what on this. I am again, surprised at the pace. I mean, it was really a little over a year ago that this became very common chat. GPT became part of the conversation. Do you remember sitting around the. Thanksgiving table or Christmas break. You ended up having conversations about chatty beauty. Could you believe what this thing is doing? And here we are just, I don't know, nine months later, and we're talking about it actually diagnosing patients and those kinds of things. And it's that kind of thing that is, that is really lending itself to the hype that is being created. Now we have to stay measured as we look at these. Outcomes. We have to understand what is required. , in order to utilize this technology safely and effectively. And, you know, this is why we're talking a fair amount with decision makers around policies and the policies they're putting in place. And the protections they're putting in place on the use of the technologies. And also pushing them on the other side, in terms of experimentation, where are you experimenting with this technology? So my, so what is. Put the guardrails up. Experiment experiment as much as possible within your health system, given your budget. And your resources.
All right. That's all for today. Don't forget to share this podcast with a friend or colleague. And we want to thank our channel sponsors who are investing in our mission to develop the next generation of health leaders, short tests, artists, I parlance certified health, 📍 notable and service. Now check them out at this week. health.com/today. Thanks for listening. That's all for now.