Real-World Limits to the Accuracy of Medical Data
Recent research found errors in widely used datasets, especially in healthcare. NLP technology helps, but limitations exist. Trust in AI requires reproducibility and recognition of its limitations.
Algorithms are only as good as the quality of data they’re being fed. This is not a new concept, but as we begin to rely more heavily on data-driven technologies, such as artificial intelligence (AI) and other automation tools and applications, it’s becoming a more important one.
Recent research from MIT found a high number of errors in publicly available datasets that are widely used for training models. An average of 3.3% errors were found in the test sets of 10 of the most widely used computer vision, natural language processing (NLP), and audio datasets.
Given those accuracy baselines are often at or above 90%, this means that a lot of research innovation amounts to chance — or overfitting to errors. Data science practitioners should exercise caution when choosing which models to deploy based on small accuracy gains on such datasets.
These findings are particularly concerning regarding AI applications in high-stakes industries like healthcare. Outcomes in this field have the ability to prevent disease, accelerate the development of life-saving medicine and help us understand the spread of disease and other critical health trends. While accuracy in healthcare is vital to success, it’s also rife with complexities that make this extremely challenging.
One of the reasons for this is the data source. More than half of the clinically relevant data for applications like recommending a course of treatment, finding actionable genomic biomarkers, or matching patients to clinical trials is only found in free text. This includes physicians’ notes, diagnostic imaging, pathology reports, lab reports, and other sources not available as structured data within electronic health records (EHR). These information sources include nuances and data quality issues that make it hard to connect the dots and get a full picture of a patient.
Another barrier exists in the limitations of what's in the data itself. Because there are no shared standards for data collection across hospitals and healthcare systems, inconsistencies and inaccuracies are common. Between different organizations collecting different information and records not being updated on a consistent basis, it’s difficult to know how accurate the data is — especially if it’s being moved and updated among different providers.
It’s not just providers to blame, either — inaccuracies come directly from the patients themselves. A recent study from The Journal of General Internal Medicine shows just how prevalent this can be. When exploring the accuracy of race, ethnicity and language preference in EHRs, the study found that 30% of whites self-reported identification with at least one other racial or ethnic group, as did 37% of Hispanics and 41% of African Americans. Patients were also less likely to complete the survey in Spanish than the language preference noted in the EHR would have suggested.
There’s clearly a need for better data collection practices in healthcare and beyond. Accurate information can help the medical community understand more about social determinants of health, patient risk prediction, clinical trial matching, and more. Standardizing how this data is collected and recorded can ensure the clean data gets shared and analyzed correctly. This is both a medical and social challenge. For example, what is the “correct” race to fill in? When exactly is someone considered a smoker? This is also partly a technology challenge, as we’re already way beyond the limit of what’s reasonable to ask providers and patients to manually input.
There are also data quality issues outside our direct control, such as fraud and abuse. The National Health Care Anti-Fraud Association estimates that "healthcare fraud costs the nation about $68 billion annually — about 3% of the nation's $2.26 trillion in healthcare spending. Other estimates range as high as 10% of annual healthcare expenditure, or $230 billion." While we can account for error rates within the data, it’s an imperfect science at the end of the day, and it’s important to understand its limitations.
That said, it’s not all doom and gloom when it comes to quality data or the algorithms we use. Technology that can automatically understand the nuances of unstructured text and images, as well as reconcile conflicting and missing data points, is gradually maturing. NLP, for example, can address many pitfalls of data quality, such as uncovering disparities in an EHR versus a doctor’s transcript or what a patient self-reports. In recent years, newer algorithms and models can apply the context, medium, and intent of each data source to infer useful semantic answers.
This is especially useful when you consider how specific clinical language is. Take how we indicate triple-negative breast cancer (TNBC), for instance. While the acronym TNCB isn’t hard to identify, the condition can also be denoted as Er-/pr-/h2-, (er pr her2) negative, tested negative for the following: er, pr, h2 and triple-negative neoplasm of the upper left breast, to name a few. NLP can identify variations of these terms when they are in context — and healthcare-specific deep learning models have gotten very good at this.
Current state-of-the-art, peer-reviewed, publicly reproducible accuracy benchmarks on both competitive academic benchmarks and real-world production deployments have been steadily improving over the last five years. Libraries like Spark NLP surpass 90% accuracy on a variety of clinical and biomedical text understanding tasks. Reproducibility of results, consistency of applying clinical guidelines at scale, and the ability to easily tune models to a specific clinical use case or setting are three keys to successful implementations and to building broader trust in AI technology.
Healthcare is a complex field, and so, too, is its data. When using data to make any decision in this field, technology that helps will keep improving. But it’s critical to remember the fundamental limitations of data quality and accuracy that power these algorithms. Simply put, it’s unsafe to assume that a piece of data is correct because someone typed it into a computer.