Blog_Lead (28)
pedro_sousa
Pedro Sousa Data Engineer, Equal Experts Portugal

Data, Gen AI, Our Thinking, Tech Focus Thu 11th April, 2024

How to evaluate conversational assistants powered by LLMs

Why is it hard to test conversational assistants, and how can we measure their quality?

The popularity of conversational assistants is increasing, finding applications in various sectors, such as customer support and personal services. However, evaluating their performance presents a significant challenge, raising questions about how to ensure that they meet their objectives.

This blog post explores the challenges of testing conversational assistants and proposes a solution using behaviour-driven development (BDD) techniques. We will discuss the complexities of testing conversational assistants, the importance of simulating conversations, and how BDD can help in addressing these challenges.

Why is it hard to test conversational assistants?

Conversational assistants, powered by LLMs, are complex systems that require a different approach to testing. These systems engage in dialogues with users, making it challenging to assess their performance using conventional testing techniques.

Conversational assistants are inherently non-deterministic, resulting in varying outputs from the same input. This is due to the probabilistic nature of LLMs, which generate responses based on probability. As a result, testing conversational assistants requires a nuanced approach that considers this non-determinism. In fact, even the tests themselves must incorporate a level of non-determinism to accurately assess the variability of conversational responses.

Writing traditional test cases for conversational assistants is challenging due to the dynamic nature of dialogue. The assistant’s responses depend on the context of the conversation, making it difficult to predict the exact output. Creating predetermined scripts for the user is not feasible, as the assistant’s responses may differ, even if we use a zero temperature for the LLM. We need to find a way to test the assistant’s performance without relying on fixed scripts.

The necessity of simulating conversations

To effectively test conversational assistants, it is necessary to simulate conversations between users and assistants. This dynamic simulation should account for variations in user input and assistant responses. By simulating conversations, we can accurately assess the assistant’s performance in a controlled environment, evaluating its ability to comprehend user requests and deliver precise responses.

One effective method for simulating conversations involves utilising LLMs (who would have thought?) to generate user input for assistant responses. With their ability to produce human-like text, LLMs are well-suited for simulating dialogue. By harnessing the power of LLMs, we can produce authentic conversations that thoroughly evaluate the assistant’s performance in various scenarios.

Users have different personalities and communication styles. Some users may be more verbose, while others may be more concise. Defining user personas allows us to simulate diverse user profiles and evaluate the assistant’s ability to engage with different types of users. This approach allows us to test the assistant’s performance across a variety of scenarios, ensuring that it can handle a wide range of user interactions.

How to measure assistant quality

When evaluating conversational assistants, we need to consider various aspects of their performance. These aspects include:

  • Tool interactions: Verifying that the assistant correctly interacts with tools to fulfil user requests, such as booking a room in a hotel or ordering food
  • Conversational quality: Assessing the assistant’s ability to engage in dialogue effectively, maintain context, and provide relevant responses.

Assessing conversational quality requires an evaluation of how well the assistant maintains context throughout a conversation. This involves tracking the dialogue history, understanding user intent, and providing appropriate responses. Our best option is to use an LLM (again) to analyse the conversation and provide feedback on the assistant’s performance.

For tool interactions, we need to verify that the assistant correctly triggers the tools to fulfil user requests. We can achieve this by creating mock tools that simulate the functionality of the real tools and verifying that the assistant interacts with them correctly.

Exploring the use of behaviour-driven development style tests for conversational assistants

Behaviour-driven development (BDD) is a software development approach that focuses on defining the behaviour of a system from the user’s perspective. BDD encourages collaboration between technical and non-technical stakeholders, ensuring that everyone has a shared understanding of the system’s behaviour.

When applied to conversational assistants, BDD can help define the expected behaviour of the assistant in various scenarios. By writing BDD-style tests, we can specify the assistant’s interactions with users, the context of the conversation, and the expected responses. This approach provides a clear and structured way to evaluate the assistant’s performance and ensure that it meets its objectives.