Data, Gen AI Wed 13th November, 2024
How accurate is accurate enough? A case study in holding AI to an unrealistic standard
How accurate is accurate enough? If AI is able to process thousands of requests in seconds, plus cite the query and data used for human-verification, isn’t that a momentous step forward?
90% accuracy.
It’s a number that seems to hold almost magical significance in the human psyche.
We as humans are very error-prone; our judgement, emotions, and vulnerabilities inject error into our judgement – and it’s what makes us human. But are we holding machines to higher standards because they don’t have emotions, or because they can do calculations much faster?
You might argue that, if time and space weren’t an issue and we were in control of our emotions, we’d be as good as machines.
This isn’t just about numbers. It’s about our deep-seated biases and the seemingly arbitrary standards we set. Why do we fixate on 90%? Is it our decimal-based thinking, or our need for near-perfection before we trust machines over humans?
It’s time to confront this paradox and rethink what “good enough” really means in the age of AI.
Scenario
At the 2024 Enterprise Tech Leadership Summit I witnessed a moment that exposes a double standard in tech adoption.
At the conference, one company showed an AI-powered audit tool designed to process thousands of due diligence documents and create a comprehensive audit trail, with the ability to ask deeper questions on the content. The tool’s performance was impressive, boasting a 90% accuracy rate. However, despite this high level of precision, the business stakeholders were reluctant to sign off on its implementation.
This prompted me to ask a crucial question: “What percent accuracy are your human auditors achieving today?”
The response was telling. The presenters glanced at each other with a short pause — they didn’t have a clear answer. This moment of realization highlighted a significant disconnect in how we evaluate AI versus human performance.
The double standard
This scenario unveils a potential double standard in how we approach AI adoption, in particular the KPIs for go/no-go decisions:
- Quantified AI performance: We demand precise metrics from AI systems, expecting near-perfect accuracy before we’re willing to trust them.
- Unquantified Human Performance: Paradoxically, we often lack concrete data on human performance in the same tasks, yet we implicitly trust human judgement.
- Unrealistic Expectations: We may be setting the bar unrealistically high for AI, expecting a level of perfection we don’t demand — or even measure — in human performance.
The implications: Unintended consequences of our AI skepticism
In a previous role, I ran an AI Lab where I was tasked to add scientific rigor around developing an innovative AI-powered text-to-SQL tool. This tool, in part, was to convert natural language questions into SQL queries, extract information from databases, and then use RAG to formulate human-readable answers.
Our AI model achieved a repeatable accuracy of 82% leveraging the latest advancements in language models, but it was frowned upon as it didn’t hit the expected 90–95% accuracy.
When I presented the tool to potential clients, the response was often lukewarm. “What about the 20% of queries the tool gets wrong?” one CIO asked. “We can’t risk making business decisions based on potentially incorrect data,” another executive chimed in.
The irony? When we dug deeper into how these companies currently handled database queries, we found that their processes were far from perfect. Many relied on a small team of data analysts who manually converted requests into SQL queries. These human experts, while skilled, were not infallible. They made errors in query construction and also produced incorrect results. Moreover, the process was slow, creating bottlenecks in data-driven decision making.
Despite these advantages, the company was hesitant to adopt the AI solution, unable to get past “only” 80% accuracy.
This scenario highlights several critical implications of our AI double standard:
- Delayed innovation: The reluctance to adopt this AI system means that a potentially transformative tool for data democratization is sitting unused. How many insights are being missed, and how many decisions are being delayed while waiting for a level of perfection that even human experts can’t achieve?
- Missed opportunities for synergy: Instead of viewing AI as a replacement for human expertise, imagine the possibilities if it were used as a complementary tool. Data analysts equipped with this AI assistant could potentially achieve accuracy rates higher than either human or machine could alone, while dramatically increasing their productivity.
- Erosion of trust in AI: Each time we reject an AI system that outperforms humans, we reinforce the narrative that AI isn’t trustworthy. This creates a cycle of skepticism that can be hard to break, even as the technology continues to improve.
- Overlooked human errors: Our laser focus on the AI’s 20% error rate obscures the fact that humans are making errors at least as often, if not more. This oversight could lead to complacency about improving human performance and processes.
- Ethical dilemmas: If we know that an AI system can provide faster, more consistent, and potentially more accurate results in data analysis, is there an ethical obligation to use it, especially when decisions based on this data could have significant impacts?
So the question becomes a multitude of questions:
How accurate is accurate enough? How do we hold AI to a higher standard? If it’s able to process thousands of requests in seconds, plus cite the query and data used for human-verification, isn’t that a momentous step forward?
Moving forward: A balanced approach
The path forward isn’t about lowering our standards for AI. Rather, it’s about applying the same rigorous, balanced evaluation to all solutions — human or artificial. Only then can we ensure that we’re making decisions that truly serve our goals, whether in data analysis, business intelligence, or any other field touched by the promise of AI.
To address this issue, I propose the following strategies:
- Establish human baselines: Before implementing AI systems, organizations should invest in measuring and understanding the accuracy of their current human-driven processes.
- Contextual evaluation: Evaluate AI performance not in isolation, but in comparison to current human performance in the same tasks.
- Incremental adoption: Consider implementing AI systems alongside human workers, allowing for a gradual transition and ongoing comparison of performance.
- Continuous Improvement: Focus on the potential for improvement over time rather than expecting perfection from the outset.
- Holistic assessment: Look beyond just accuracy as the single go/no-go metric for success. Consider factors like consistency, speed, scalability, and the ability to handle large volumes of data.
Final thoughts
As we stand at the crossroads of human expertise and AI, our journey reveals a crucial truth: the way we evaluate and adopt AI technologies is fundamentally shaping our future.
The stories we’ve explored are likely not isolated incidents. They are symptomatic of a broader challenge we face in the AI era: our struggle to reconcile the promise of AI with our deeply ingrained trust in human judgment.
The next time you’re evaluating an AI solution, I urge you to ask these questions:
- Are we holding this technology to a fair and realistic standard?
- Have we accurately assessed our current human-driven processes?
- Could this AI, despite its imperfections, significantly improve our outcomes?
- Are we missing opportunities for innovative human-AI collaboration?
In the end, the true measure of our success in the AI era will not be in how well we’ve preserved the status quo, but in how boldly and wisely we’ve embraced the potential of human and AI working in concert.
The future is not about choosing between human or AI — it’s about reimagining what’s possible when we leverage the strengths of both.