How Not to Measure the Accuracy of AI Models: A Clinical Coding Examples
This article highlights the complexity of clinical coding processes and the substantial disagreement among coders. It emphasizes the significance of improving agreement rates and outlines measures.
I recently met an entrepreneur who is building a product that automates part of the clinical coding process. The task is to read free-text clinical notes that summarize a patient encounter and return the right structured codes that define what was done (e.g., removing the appendix) and the diagnosis it required (infected appendicitis). The topic of discussion was whether using deep contextual embeddings can improve the system’s accuracy, which has been stuck around 70% accuracy for a while despite the best effort of a highly capable data science team.
However, a short discussion uncovered that there was little that the data science team could do to resolve the situation. The key to unblocking the project lay elsewhere.
The team had access to a reasonably large set of labeled data -- clinical notes and the codes that a team of professional coders assigned to it. The problem is that clinical coding is hard, resulting in substantial disagreement between coders on how to best code the same cases:
• Accuracy and Completeness of Clinical Coding Using ICD-10 for Ambulatory Visits summarized that “Just over a half of entered codes were appropriate for a given scenario and about a quarter were omitted.”
• Assessment of the reproducibility of clinical coding in routinely collected hospital activity data found that "Internal and external coders agreed exactly in 43% of the admissions examined and agreed ‘approximately’ in 55%."
• Coding OSICS sports injury diagnoses in epidemiological studies found that two coders, “Assigned the same four-digit OSICS-10 code for only 46% of the 1,082 injuries.”
• Lack of Agreement in Pediatric Emergency Department Discharge Diagnoses found that "Overall, 67% of diagnoses from the administrative and abstracted sources were within the same diagnosis group."
• Causes of injuries resulting in hospitalization in Australia reports that “68% agreement for complete codes and 74% agreement for 3-character codes."
In fact, there’s such a disparity in how the same cases can be coded by different coders that the question of whether these codes are clinically useful at all has been actively researched. The common answer is yes, but it's important to take a low-accuracy bar into account. For example, Interrater Reliability of ICD-10 Guidelines for the Diagnosis of Personality Disorders concluded “satisfactory inter-rater reliability (r > 0.70) for all except 7 out of the 56 guidelines.” In plain speak, this means that a 70% agreement rate is considered satisfactory and the best you can usually expect.
You’re Can’t Measure Accuracy, Only Agreement
The implication? Seventy percent is, in this case, an upper bound on what a predictive model’s accuracy can be. Anything better is, by definition, overfitting to the random choices made by a particular set of labelers.
This is true in general. Consider a different problem, where you are asked to “predict” what the double of a number will be. If I say four, then you should say eight. If I say 50, then you should say 100, and so on. However, the number I’m looking for isn’t exactly a double -- I randomly make it 20% higher or lower. So the “correct” answer for 50 may be 100, 112, or 93 depending on my random choice. Under such terms, no deep learning, black magic, or alien technology can help you. You can only get, on average, within 20% of the right number. Even having full knowledge of how to multiply numbers (“a perfect Oracle” in AI-speak) can’t help.
This is a very common problem in practice:
• If you’re building a fraud detection algorithm, can you get all of your tax experts to agree on which tax returns are audit-worthy or not?
• If you’re building a search engine, can you get all of your relevance judges to agree if each search result for Benedict Cumberbatch is “hardly,” “somewhat” or “highly” relevant?
• If you’re building a customer service app, can you get all your agents to agree on which calls made the customer happy and which ones didn't?
The answer to all of these questions is no. It’s the same answer if you need your judges to decide if an email is spam or not; if a sentence has a positive or negative sentiment; if a product violates your online marketplace’s guidelines; how many stars a free-text restaurant review translates to; whether a CT image requires follow-up with an oncologist; or whether an image is sexually explicit.
Taking Steps to Improve Agreement
The first thing to do is realize that your “golden sets” aren’t pure gold -- they’re more of an alloy. Now that you know, there is a lot that you can and should do:
• Measure the disagreement rate between your judges. This isn’t hard -- just give the same cases to different judges and count how often you get the same result.
• Once your data science team builds a model that reaches that number, there’s little point in asking them to improve it. If you’re way above it, then you’re overfitting.
• Take active steps to improve your disagreement rate: Write more detailed guidelines, invest more in training judges, have weekly reviews to discuss edge cases with judges and use a two-tier process with senior judges reviewing junior (or crowd-sourced) ones.
That last point is the key to solving the problem. Sometimes a strong operations or project manager who can get your labeler disagreement rate from 70% to 85% will add more accuracy to your model than an army of doctorate-level data scientists.