Data, Gen AI Thu 1st August, 2024
How businesses can move GenAI projects from proof-of-concept to production
We know many organisations are struggling to get GenAI projects beyond a proof-of-concept (PoC) and into production (Forbes confirms this, stating that approximately 90% of GenAI projects will not move into production in the near future, and some may never).
Organisations are struggling with questions like:
- How do we harness the power of LLMs to create value?
- How do we ensure accuracy and repeatability in systems that are purposefully non-deterministic?
- How do we enable good engineering practice?
- How do we build cost effective systems with GenAI models?
The power of foundational models is amazing. GPT4, Claude 3 and Gemini Ultra allow enough context to be sent to a model to learn a language previously unseen, and perform accurate translation. Multi-mode models can not only generate sound and images and video, but they can understand them as input, too. There seems to be no end to their capabilities.
But moving beyond the playground of making requests and generating seemingly magical responses, delivering real business value is difficult. Enterprise adoption of AI requires transparency, consistency, security and cost management – things that aren’t easy once we move beyond initial proof of concept.
So, the question remains: How do we move from promising AI proof of concepts that demonstrate the art of the possible to robust, repeatable and enterprise ready services that can realise the potential of improved automation and reduced operating costs?
The challenges of simplistic, monolithic AI
First off, we need a shared understanding of the challenges of moving from proof of concept to an operational and enterprise ready AI system.
We see a lot of organisations trying to leverage LLMs through a single request response cycle – or in other words, developing a single monolithic prompt that will run against an all knowing foundational model, and expecting the LLM to return a correct answer each time without need for oversight. We believe this approach is taken because developing a PoC is so simple leveraging the power of LLMs. When trying to move that PoC to production, there is a natural tendency to carry on developing the PoC further instead of starting again from scratch to look at how to develop a repeatable enterprise ready version of the system. This behaviour is understandable given the accessibility of incredibly creative and impressive models like ChatGPT and the seductive chat interface that espouses confidence.
For example, one organisation we’re working with is attempting to use LLMs to summarise customer support calls. A transcript of the call is created and then sent to GPT4 to summarise in a single prompt request to the model. However, creating an accurate summary of a call that is repeatable and reliable with valuable output that reduces operating costs is a complex task. To achieve the goals of the business, we need to make decisions such as whether we want to be abstractive (i.e. generate a more concise version of the call using different language) or extractive (i.e. just take the most relevant content from the original source as is). We might want to generate a list of next actions as part of producing a summary, or a recap of the specific issues from the call, or categorisation of the nature of the call. For each of these potential use cases, we then also need to verify the summary against the original call script to ensure that the response can be linked back to the original source material and increase our confidence that there are no hallucinations in the summary.
Once you start to understand the complexity of actually creating a useful summary within the context of the organisation, you can see how this may require multiple steps, and therefore, iterations of interactions with the model. We can see there is a lot of complexity that could be engineered into one single request response interaction. Each of those requirements would need to be specified in a single prompt and then include another set of instructions to verify the response. Similar to any complex set of tasks, trying to manage this as one monolithic system creates unnecessary risk and complexity.
Developing a single prompt to meet all of those requirements would pose challenges such as:
- Side effects in your prompts. The initial work of developing the prompt becomes a complex end-to-end process to test. Any change to the prompt could have far reaching changes across all of the requirements and lead to side effects in previously satisfied requirements.
- Big bang releases. Developing a single prompt is an all or nothing approach. The prompt can’t be used in production until the whole of the prompt is fully defined and all elements of it are tested with no side effects and hallucinations.
- Highly risky to add new features. Imagine in the customer support case above, the organisation wanted to determine whether the tone of the customer service agent was respectful and helpful or not. This would not be possible with only a single prompt approach, as an additional prompt feature would have to be added and tested to ensure that no other side effects creep into the previous instructions.
- More end-to-end testing if the underlying model changes. If your underlying model changes, your prompt may start returning different results. Again you would have to conduct more end-to-end testing and try to identify which part of the prompt is behaving differently with the changed underlying model.
- Scaling execution. What if one part of the prompt takes a particularly long time to run or is more complex than other parts? This instruction cannot be isolated and scaled independently.
So, what’s the solution? How do we wrest control back from the monolithic model to focus on architectures and engineering best practices for long-term sustainability?
The potential of Compound AI systems over simplistic, monolithic AI
We’re helping organisations develop AI pipelines that come together to create dynamic AI systems that give us the ability to call non-deterministic models in a repeatable way. We’re looking at approaches such as Compound AI, a system-based approach to generating applications that shares principles aligned to our experience of delivering software reliably. We believe that some well known principles from the context of software engineering can help increase our chances of success building compound AI systems:
- Loose coupling: Narrowing the focus of a prompt or agent to improve the repeatability of results from that prompt. This will make issues easier to debug, allow for independent scaling, allow security considerations to be addressed in isolation and makes verification and hallucination checking much simpler.
- Test automation: Developing techniques to make testing approaches for prompts and models repeated and automated (For example, we’re working on approaches to use Behaviour Driven Development to test LLM agents.)
The benefits of compound AI systems
By breaking prompts down into smaller discrete units, following good engineering principles creates a number of benefits.
Increases transparency
Compound AI systems allow organisations greater visibility into the decision-making of the algorithms. The inputs and outputs of each step in the system can be captured to increase traceability and understand why certain decisions were made and trace which parts of the system were responsible for which decisions.
Greater reliability
When the underlying LLM changes, we can isolate new behaviour in an individual step, meaning we’d know exactly where to look and which step of the process needs to be investigated, reviewed and re-validated.
Scalability
If we find that certain steps of the process require more processing power, this can be scaled independently if each part of the system can be executed independently. We can also look at different scaling approaches, such as running some parts of the system at a slower pace (as the results may not be quite as time dependent.)
Extensibility
If the output of each stage is persisted, then additional features can be added later as new use cases emerge. Taking our example of summarising call scripts, if we decide we want to add “next best action” suggestions, then we would simply design one new prompt with the required inputs. This could be tested in isolation, and when ready to be deployed, could be run across any relevant previous call scripts in isolation and also put live for future call scripts. If we had developed one monolithic prompt, then we would need to retest all of our scenarios to ensure compatibility each time we made a small change to try and introduce a new feature.
Final thoughts
GenAI is incredibly intuitive to use and simple to demonstrate the potential. It is a lot harder to produce transparent, reliable and responsible production solutions. Repeatability in non-deterministic systems is always going to be a challenge. We believe that by building upon our software engineering heritage we can deliver high quality AI systems that are suitable for deploying in enterprise environments.
Compound AI systems have discrete steps, but also benefit from the embedded knowledge of LLMs. Fine tuning of smaller models in combination with agent frameworks can build complex behaviours from smaller individually scalable components.
But the major factor in building success is that AI should not be a high profile solution looking for a problem, but rather consideration of tedious and mundane – but necessary – activities that are better suited to automation with a sprinkling of human equivalent intelligence.