Bringing Natural Language Processing (NLP) models into production is a lot like buying a car. Either way, you’re setting your parameters for the desired result, testing several approaches, potentially retesting them, and the minute you leave the quantity, the value starts dropping. Like owning a car, owning NLP or AI-enabled products has many benefits, but maintenance never stops — at least to function properly over time, it shouldn’t.
While producing AI is challenging enough, ensuring the accuracy of models in the future in a real-world environment can present even greater governance challenges. The accuracy of the model decreases the moment it reaches the market, because the predictable and trained research environment behaves differently in real life. Just as the highway is a different scenario than many at the dealership.
It’s called the concept of drift – meaning when variables change, the acquired concept may not be accurate – and while it’s not new to AI and machine learning (ML), it’s something that continues to challenge users. It’s also a contributing factor to why, despite huge investments in artificial intelligence and natural language processing in recent years, only about 13% of data science projects actually make it into production (VentureBeat).
So what does it take to safely move products from research to production? Equally important, arguably, what does it take to keep them in production precisely with the changing tides? There are a few considerations that organizations need to keep in mind to ensure that their AI investments actually see the light of day.
Introducing artificial intelligence models into production
Model management is a major component of producing NLP initiatives and a common reason why many products remain projects. Model management covers how a company tracks the activity, access, and behavior of models in a given production environment. It is important to monitor this to reduce risk, troubleshoot and maintain compliance. This concept is well understood among the global AI community, but it is also a thorn in their side.
data from 2021 NLP Industry Survey showed that high-precision tools that are easy to adjust and customize were a top priority among respondents. Technology leaders echoed this, noting that accuracy, followed by production readiness, and scalability, were vital when evaluating NLP solutions. Continuous tuning is fundamental to accurately performing models over time, but it is also the biggest challenge that practitioners face.
NLP projects include pipelines, where the results of a previous task and a pre-trained model are used in the downstream. Oftentimes, models need to be fine-tuned and customized to their specific domains and applications. For example, a healthcare model trained on academic papers or medical journals will not perform as well when used by a media company to identify fake news.
Better research and collaboration among the AI community will play a key role in standardizing exemplary governance practices. This involves storing modeling assets in a searchable catalog, including notebooks, data sets, resulting measurements, hyperparameters, and other metadata. Enabling experiments to be replicated and shared across data science team members is another area that will be useful for those trying to take their projects to production.
Tactically, rigorous testing and retesting is the best way to ensure that models behave the same way in production as they do in research – two completely different environments. Release models that are applied after trial to a release candidate, testing these candidates for accuracy, bias, and stability, and validating models before releasing them into new geographies or populations are factors that all practitioners should practice.
With any software launch, security and compliance must be built into the strategy from the start, and AI projects are no different. Role-based access control, approval workflows for model versioning and storage, and provision of all the metadata needed for a complete audit trail are some of the security measures needed for a model to be production ready.
These practices can greatly improve the chances of AI projects moving from thinking to production. Most importantly, it helps lay the foundation for the practices to be implemented once the product is ready for the customer.
Keep AI models in production
Going back to the car analogy: There is no AI-specific “check engine” light in production, so data teams need to constantly monitor their models. Unlike traditional software projects, it is important to keep data scientists and engineers on the project, even after the model has been deployed.
From an operational point of view, this requires more resources, both in human capital and in terms of cost, and this may be the reason why many organizations fail to do so. The pressure to keep up with the pace of business and move on to the “next thing” is a factor as well, but perhaps the biggest oversight is that even IT leaders don’t expect model deterioration to be an issue.
In health care, for example, the model can analyze electronic medical records (EMRs) to predict a patient’s likelihood of getting an emergency C-section based on risk factors such as obesity, smoking or drug use, and other determinants of health. If a patient is described as high-risk, the practitioner may ask her to come early or more frequently to reduce pregnancy complications.
These risk factors are expected to remain constant over time, and while many do, the patient is less predictable. Did they quit smoking? Were they diagnosed with gestational diabetes? There are also nuances in the way the doctor asks a question and records the answer in the hospital log that may lead to different results.
This can become more difficult when considering the NLP tools that most practitioners use. The majority (83%) of respondents from the above survey stated that they have used at least one of the following cloud NLP services: AWS Comprehend, Azure Text Analytics, Google Cloud Natural Language AI, or IBM Watson NLU. While the popularity and accessibility of cloud services is clear, technology leaders have cited the difficulty of modeling and cost tuning as major challenges. Essentially, even experts struggle to maintain the accuracy of the models in production.
Another problem is that it simply takes time to see if something is wrong. What duration can vary greatly. Amazon may update its fraud detection algorithm and accidentally block customers in the process. Within hours, maybe even minutes, customer service emails will indicate a problem. In health care, it can take months to get enough data on a specific condition to tell that the pattern has deteriorated.
Essentially, to maintain the accuracy of the models, you need to apply the same rigor of testing and automation of the retraining and scaling pipelines that was done before the model was deployed. When dealing with AI and machine learning models in production, it is more appropriate to anticipate problems than to expect optimal performance after several months.
When you think about all the work it takes to bring models into production and keep them secure, it’s understandable why 87% of data projects never make it to market. Despite this, 93% of tech leaders indicated that their NLP budgets had grown 10-30% compared to last year (gradient flow). It is encouraging to see increased investment in NLP technology, but it is all for nothing if companies do not evaluate the expertise, time, and constant updating required to deploy successful NLP projects.