Operating AI Models Safely in Production

Deploying and maintaining natural language processing (NLP) models in production comes with its challenges, especially in ensuring model accuracy over time in real-world environments.

Nov 17, 2021

Getting natural language processing (NLP) models into production is a lot like buying a car. In both cases, you set your parameters for your desired outcome, test several approaches, likely retest them, and the minute you drive off the lot, value starts to plummet. Like having a car, having NLP or AI-enabled products has many benefits, but the maintenance never stops — at least to function properly over time, it shouldn’t.

While productionizing AI is hard enough, ensuring the accuracy of models down the line in a real-world environment can present even bigger governance challenges. Model accuracy degrades the moment it hits the market, as the predictable research environment it was trained on behaves differently in real life. Just as the highway is a different scenario than the lot at the dealership.

It’s called concept drift — meaning when variables change, the learned concept may no longer be precise — and while it’s nothing new in the field of AI and machine learning (ML), it’s something that continues to challenge users. It’s also a contributing factor as to why, despite huge investments in AI and NLP in recent years, only around 13% of data science projects actually make it into production (VentureBeat).

So what does it take to move products safely from research to production? Arguably just as important, what does it take to keep them in production accurately with the changing tides? There are a few considerations that enterprises should keep in mind to make sure their AI investments actually see the light of day.

Getting AI models into production

Model governance is a key component in productionizing NLP initiatives and a common reason so many products remain projects. Model governance covers how a company tracks activity, access, and behavior of models in a given production environment. It’s important to monitor this to mitigate risk, troubleshoot, and maintain compliance. This concept is well understood among the AI global community, but it’s also a thorn in their side.

Data from the 2021 NLP Industry Survey showed that high-accuracy tools that are easy to tune and customize were a top priority among respondents. Tech leaders echoed this, noting that accuracy, followed by production readiness, and scalability, was vital when evaluating NLP solutions. Constant tuning is key to models performing accurately over time, but it’s also the biggest challenge practitioners face.

NLP projects involve pipelines, in which the results from a previous task and pre-trained model are used downstream. Often, models need to be tuned and customized for their specific domains and applications. For example, a healthcare model trained on academic papers or medical journals will not perform the same when used by a media company to identify fake news.

Better searchability and collaboration among the AI community will play a key role in standardizing model governance practices. This includes storing modeling assets in a searchable catalog, including notebooks, datasets, resulting measurements, hyper-parameters, and other metadata. Enabling reproducibility and sharing of experiments across data science team members is another area that will be advantageous to those trying to get their projects to production-grade.

More tactically, rigorous testing and retesting is the best way to ensure models behave the same in production as they do in research — two very different environments. Versioning models that have advanced beyond an experiment to a release candidate, testing those candidates for accuracy, bias, and stability, and validating models before launching in new geographies or populations are factors that all practitioners should be exercising.

With any software launch, security and compliance should be baked into the strategy from the start, and AI projects are no different. Role-based access control and an approval workflow for model release and storing and providing all metadata needed for a full audit trail are some of the security measures necessary for a model to be considered production-ready.

These practices can significantly improve the chances of AI projects moving from ideation to production. More importantly, they help set the foundation for practices that should be applied once a product is customer-ready.

Keeping AI models in production

Back to the car analogy: There’s no definitive “check engine” light for AI in production, so data teams need to be constantly monitoring their models. Unlike traditional software projects, it’s important to keep data scientists and engineers on the project, even after the model is deployed.

From an operational standpoint, this requires more resources, both human capital and cost-wise, which may be why so many organizations fail to do this. The pressure to keep up with the pace of business and move onto the ‘next thing’ also factors in, but perhaps the biggest oversight is that even IT leaders don’t expect model degradation to be a problem.

In healthcare, for example, a model can analyze electronic medical records (EMRs) to predict a patient’s likelihood of having an emergency C-Section based upon risk factors such as obesity, smoking or drug use, and other determinants of health. If the patient is dubbed high-risk, their practitioner may ask them to come in earlier or more frequently to reduce pregnancy complications.

The expectation is that these risk factors remain constant over time, and while many of them do, the patient is less predictable. Did they quit smoking? Were they diagnosed with gestational diabetes? There are also nuances in the way the clinician asks a question and records the answer in the hospital record that could result in different outcomes.

This can become even more tricky when you consider the NLP tools most practitioners are using. A majority (83%) of respondents from the aforementioned survey stated that they used at least one of the following NLP cloud services: AWS Comprehend, Azure Text Analytics, Google Cloud Natural Language AI, or IBM Watson NLU. While the popularity and accessibility of cloud services is obvious, tech leaders cited difficulty in tuning models and cost as major challenges. Essentially, even experts are grappling with maintaining the accuracy of models in production.

Another problem is that it simply takes time to see when something’s amiss. How long that is can vary significantly. Amazon may be updating an algorithm for fraud detection and mistakenly blocks customers in the process. Within hours, maybe even minutes, customer service emails will point to an issue. In healthcare, it can take months to get enough data on a certain condition to see that a model has degraded.

Essentially, to keep models accurate you need to apply the same rigor of testing, automating retrain pipelines, and measurement that was conducted before the model was deployed. When dealing with AI and ML models in production, It’s more pertinent to expect problems than it is to expect optimal performance several months out.

When you consider all the work it takes to get models into production and keep them there safely, it’s understandable why 87% of data projects never make it to market. Despite this, 93% of tech leaders indicated that their NLP budgets grew by 10-30% compared to last year (Gradient Flow). It’s encouraging to see growing investments in NLP technology, but it’s all for naught if businesses don’t take stock in the expertise, time, and continual updating required to deploy successful NLP projects.

Share AI in Healthcare

Discussion about this post

Ready for more?