Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying
NEJM AI
|
Summary
The study, titled "Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying," conducted by Ali Soroush et al., investigates the performance of various large language models (LLMs) including GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b Chat in the task of medical code querying. The research demonstrates that these LLMs are generally ineffective at accurately generating medical billing codes such as ICD-9-CM, ICD-10-CM, and CPT from descriptions, with even the best performing model, GPT-4, failing to achieve high accuracy. Factors such as code frequency, brevity of code descriptions, and exactness of match were analyzed to understand performance disparities. The findings suggest that these LLMs, in their current state, are unreliable for medical coding tasks, often producing imprecise or entirely fabricated codes, which could undermine medical billing and record-keeping if used in clinical settings without further dedicated research and refinement.