Setting a Higher Standard for State-Of-The-Art Applied AI
We'll explore why "State-of-the-art" in academia differs from industry standards, requiring reproducibility, deployment, and open code. Transparency is crucial for trustworthy AI.
Over the past four years, I’ve participated in the technical due diligence of nearly twenty companies claiming some proprietary artificial intelligence (AI) “secret sauce.” After evaluating them, the results were split evenly between those showing smoke and mirrors, those on their way there in a year or two, and those with a sound application of machine learning.
Every company affirmed that its artificial intelligence (AI) was next-generation, world-class, cutting-edge, breakthrough, enterprise-grade, market-leading, or some other gobbledygook. Given the uniform distribution of actual AI capabilities, it’s no wonder that the average technology buyer has little trust in such assertions. My dataset, although small, suggests they should.
The term “state-of-the-art,” on the other hand, has real, concrete meaning in academia: The best documented, peer-reviewed results obtained on a problem for a reproducible benchmark. It’s not a claim you can make without proof — or sustain over time without continuing to innovate. In AI, the website Papers with Code curates more than 3,600 state-of-the-art benchmarks and almost 40,000 papers that are ranked for the results they produce, covering computer vision, language, speech, music, games, robotics, and more.
However, it has become clear that the bar for state-of-the-art applied AI used in real industry systems must be higher. After all, most academic papers will never see the light of day in terms of actual industrial applications. Delivering real-world AI systems requires more than beating an academic benchmark in a controlled experimental setting.
To help organizations with this, it’s important to understand what makes state-of-the-art applied AI and the criteria used to define it. Here, we’ll explore the three benchmarks technology leaders should consider before selecting the best solution for their business needs or building their own solution that lives up to its state-of-the-art promises.
It’s peer-reviewed and reproducible
The first criterion requires your state-of-the-art software to deliver the best accuracy on benchmarks that are public, reproducible, and trainable. Benchmarks should be designed by a third party, not by the vendor themselves or an affiliated team. It must have a public baseline that keeps rising as multiple teams compete to improve it. For example, the NLP-progress website tracks such benchmarks in natural language processing.
Second, the solution should be reproducible, meaning anyone outside the provider’s team should be able to reproduce the same results from scratch. This should include the choice of accuracy metric, hyperparameters, train/test split, software version or hardware used, and so on.
Lastly, trainability is an important factor. It should be possible to reproduce both the model training and the inference stages. In practice, a top-ranked solution may not match your use case. In a healthcare setting, for example, you may care about identifying cardiology-specific terms, which no current benchmark specializes in. It’s also likely that new papers will outperform the current state-of-the-art results within a few months, so keep that in mind when evaluating a solution.
It’s in production at multiple companies
You cannot claim an AI system to be “applied state-of-the-art” if it isn’t “applied” in multiple, real production systems. Real-world data is different from academic data — it’s more diverse, noisy, dynamic, and biased. The model that performs best academically is not always the best performer in practice. This is why the industry needs data scientists, tools, and processes that train custom models. While academic benchmarks are useful, they have limitations.
Additionally, production readiness has its own set of requirements. In this case, multiple independent teams will have evaluated the solution’s code quality, error handling, logging, monitoring, scalability, security, privacy, deployment, upgrade process, compute, and memory use — plus aspects of bias, explainability, and concept drift.
Having multiple deployments in multiple organizations also validates that you haven’t built a one-off custom solution. There’s nothing wrong with that, but generalizing one custom solution to a reusable software package requires a different level of expertise. Having models that generalize is required for claiming state-of-the-art applied AI.
It’s Open
The third criterion for state-of-the-art applied AI is that a material portion of the codebase should be open. It doesn’t have to be freely available under a permissive license, but others should be able to inspect it independently. This is important because it shows that you, or the solution you’re evaluating, actually built it. Many have claimed deep AI expertise while their code is called an existing cloud API or pre-trained model. But it’s misleading to allege you’re a computer vision expert because you can search TensorFlow Hub. There is nothing wrong with providing an easy-to-use solution for end users — just be forthcoming about it.
Providing an open-source or open-core solution also validates whether other people are independently choosing to use it. Claiming that your solution is useful or easy to use is one thing, but getting others to stake their projects on it is another. This requires you to provide the right documentation, integration, examples, and community support.
Another advantage of making source code open is enabling others to evaluate the code and model quality. Public source code encourages a higher standard of software engineering — from unit tests and minimal dependencies to machine learning aspects of trainable, robust, and explainable models.
This level of transparency and third-party evaluation will uncover that your software is far from perfect: It only plays nice as part of certain architectures, requires tradeoffs between accuracy and speed, reuses other software packages, only scales well to a certain point, isn’t cost-effective at all scale levels and has a few experimental features. This is all fine and expected — all software is like that. Real state-of-the-art solutions on this.
If the AI industry wants to shake off buyers’ perception of being sold snake oil, it should stop selling it
For users, it’s essential to be aware of what makes a solution truly state-of-the-art and which is just sprinkling AI on as an afterthought. Let’s set a high bar for what great applied AI means and take the long path to achieve it.