Testing LLM Conversations with AutoChain

Jul 29, 2023

Testing is an important part of software development, but often writing tests can be an afterthought. Project stakeholders may prioritize the more visible features of an application. Developers too may want to move on to the next interesting task rather than documenting what they already understand and have working.

Good tests are important to a project for several reasons. For planning, test-driven development can help developers anticipate the needs of particular functions. Tests can also ensure that the application continues to work as expected as development continues. Further, a well-written test set can help team members understand each other’s code. And of course… tests help you remember the code you wrote the next time you come back to it. 😉

But testing is often hard to do in chat applications. Many low-code platforms don’t have automated testing, so developers need to manually message the bot to make sure flows haven’t broken. More technical platforms may provide some code linting, but few provide adequate testing capabilities. Some platforms, like Rasa, do offer some nice testing capabilities, though.

In addition to these benefits, testing carries the added importance of making sure LLM projects provide accurate information. Consider this manual testing of ChatGPT output for Korean vocabulary.

# GPT Output
>>> What does ㅁㅊ mean? 
>>> When combined, "ㅁㅊ" is often used as an abbreviation or slang term for a Korean phrase, "말초" (malcho). "말초" is a shortened form of "말레이시아 초콜릿" (Mallaysia chokollit), which means "Malaysian chocolate" in Korean.

# KakaoTalk Chat with Friend
...
Friend: ㅁㅊ 
Me: 먹고 싶어요.  같이 먹자! (I want to eat it.  Let's eat it together!)
Friend: ?

"ㅁㅊ" is an abbreviation for "미친", which expresses a sense of frustration or commiseration. It’s kind of like saying “This is driving me crazy.” While a more pleasant-sounding “word”, "말초" is not something anyone I know has heard or used before. It certainly confused all the people I tried to use it with. Manual testing is not ideal…

But testing generative chats just doesn’t get much attention in the writing about LLMs.

AutoChain can help developers write tests for their projects that use generative chat.

For this tutorial, we’ll continue to consider the case of an application to help Korean language learners with vocabulary acquisition. Let’s build out our test cases.

A student might ask about a Korean word. (ie., What does 소소하다 mean?)
A student might not know a particular Korean word, so they might ask about the translation of an English word. (ie., What is the Korean word for ephemeral?)
A student might ask for a word that we’ve already saved in the database because we’ve looked it up before. (ie., What does 고양이 mean? // Database: 고양이 = Cat)
A student might ask about or say something irrelevant. (ie., I hate studying verb conjugation.")

Can you think of any other cases that might be relevant? You can probably think of a few more that might make the application more robust and educational. As you read through the article, you can think about how you might implement additional cases. For this article, though, we’ll work with these four.

Ideally, our application would see if a word is available in the database before spending time and money on an API call. While our test function could actually call a database (like Supabase), we’ll just simulate a database call here using some pre-formatted responses. If the word is not in the “database”, we’ll return an error to let the application know the word is missing.

def get_word_info_from_database(word):
    """ Simulates a database call with predetermined information about vocabulary """
    if word == "습기가":
        return {
            "definition": "습기가 (습기, 습하다) is a Korean term that refers to the condition of being humid or damp. It describes an atmosphere or environment with high moisture content.",
            "synonym": "물기가 많은",
            "antonym": "건조한",
            "example": "오늘은 습기가 많아서 땀이 많이 나요. (Today, it's very humid, and I'm sweating a lot.)"
        }
    elif word == "열대야":
        return {
            "definition": "열대야 (열대 + 야) is a Korean term that refers to a tropical night, which is a meteorological phenomenon where the nighttime temperature remains consistently high due to the warm and humid climate of tropical regions.",
            "synonym": "열대지방의 밤",
            "antonym": "한랭",
            "example": "오늘은 열대야로 인해 잠을 제대로 못 자요. (I couldn't sleep well tonight due to the tropical night.)"
        }
    else:
        return {
            "status_code": 400,
            "message": "Word not found"
        }

Our other cases involve using an LLM, so we’ll write a function that makes a call to an LLM. In this case, we’ll use OpenAI’s API and text-davinci-003. While AutoChain supports other LLMs, the code for this tutorial is more specifically geared toward models by OpenAI.

import json
import openai

def make_llm_call(prompt):
    try:
        response = openai.Completion.create(
            model='text-davinci-003',
            prompt=prompt,
            temperature=0,
            max_tokens=200
        )
        return json.loads(response['choices'][0]['text'])
    except:
        return {
            "status_code": 400,
            "message": "Word not found.  Please try asking again later."
        }

Now let’s build a function to look up information about a Korean word.

def lookup_korean_word_info_from_llm(word):
    prompt = f"""The assistant helps people learn Korean words they encountered in class or daily life.  The assistant provides information about a single Korean word in JSON format.  The JSON object should include the keys 'definition', 'synonym', 'antonym', and 'example'.  The definition should always be in English, but the other examples should be in Korean (with English translation in parentheses).  The assistant should provide a synonym, an antonym, and an example all written in Korean. All information about the word should be concise and simple for language learners.

    User: {word}
    AI: """

    return make_llm_call(prompt)

And we’ll do the same for an English word.

def english_to_korean_word_info_from_llm(word):
    prompt = f"""The assistant helps people learn Korean words that they do not know.  A user will request a word in English.  The assistant will translate the word and provide information about the corresponding Korean word to the user. 

    The assistant provides information about a single Korean word in JSON format.  The JSON object should include the keys 'definition', 'synonym', 'antonym', and 'example'.  The definition should always be in English, but the other examples should be in Korean (with English translation in parentheses).  The assistant should provide a synonym, an antonym, and an example all written in Korean. All information about the word should be concise and simple for language learners.
    
    User: {word}
    AI: """

    return make_llm_call(prompt)

With this foundation set up, we’re ready to work with AutoChain. First, we’ll install this into our environment.

pip install autochain

You’ll need to load your openai_api_key from your .env file.

from dotenv import load_dotenv
load_dotenv()

The AutoChain test class facilitates testing by instantiating two chat agents to have a multi-turn conversation. The first chat agent is the one we want to test, the assistant. The second chat agent is a simulated user. The interactions between these agents allow for testing how a normal conversation might go, without the biases we as developers with full context bring to our own manual tests. It might ask off-topic questions, say things in an unexpected way, or do exactly what we expect. Regardless, through these interactions, we can see how well our bot responds and if it is responding the way we hope. The AutoChain test class includes four things: a prompt, tools (if any), test cases, and a chat agent.

from autochain.workflows_evaluation.base_test import BaseTest

class TestWordLookUp(BaseTest):
    # Prompt
    # Tools
    # Test Cases
    # Instantiate Agent

The prompt will tell what the chat agent does, much like the other prompts we’ve written. This prompt isn’t specific for a “test agent”. It should just be the standard prompt that tells the bot what to do - in this case, helping students learn Korean vocabulary.

from autochain.workflows_evaluation.base_test import BaseTest

class TestWordLookUp(BaseTest):
    prompt = """The assistant helps people learn Korean vocabulary.  It provides students with the definition, synonym, antonym, and an example sentence in JSON format.  The assistant can only provide information about vocabulary, not any other topic.  The information it provides should be concise.  If the user asks about some other topic, the assistant will ask the student for a Korean word they want to know about.

    Examples
    User: I think studying Korean is hard.
    AI: It can be pretty hard, can't it?  I can help you learn vocabulary.  What word would you like to learn about?

    User: Could you explain more about verb conjugation to me?
    AI: Verb conjugation is so important.  Unfortunately, I can only help you with Korean vocabulary.  What kind of word would you like to know?
    """

Tools are support functions that help the agent do things like search the database or get information from an API. We wrote 3 support functions earlier. Most of our support functions just direct us to prompts for more specific, limited tasks. We write descriptions about these tools to help our agent select the most appropriate tool to handle a user’s request.

from autochain.tools.base import Tool
from autochain.workflows_evaluation.base_test import BaseTest

class TestWordLookUp(BaseTest):
    # prompt is here...

    tools = [
        Tool(
            func=get_word_info_from_database,
            description="Retrieves information about a Korean word that has already been looked up previously.  Input args: word: str"
        ),
        Tool(
            func=english_to_korean_word_info_from_llm,
            description="Finds the Korean translation of a given English word and provides information about the Korean word.  Input args: word: str"
        ),
        Tool(
            func=lookup_korean_word_info_from_llm,
            description="Retrieves information about a Korean word that was not found in the dictionary.  Input args: word: str"
        ),

Test cases are descriptions of the different kinds of tests we want to run. As with the tools, we need to provide descriptive information about the test cases. This is because there is…a third agent! This agent evaluates how helpful the exchange between the assistant and the user has been. The evaluation is based on the “expected_outcome” and how close the conversation came to that output. Each test may seem like it only represents a single conversational turn. However, the simulated user will use the “user_context” as a guide for subsequent exchanges. Each test case will have several of these exchanges, which is great because you can see if the assistant goes off the rails within the context of specific cases.

from autochain.tools.base import Tool
from autochain.workflows_evaluation.base_test import BaseTest, TestCase

class TestWordLookUp(BaseTest):
    # prompt is here...

    # tools are here...

    test_cases = [
        TestCase(
            test_name="look up existing word",
            user_context="What does 습기가 mean?",
            expected_outcome="습기가 (습기, 습하다) is a Korean term that refers to the condition of being humid or damp. It describes an atmosphere or environment with high moisture content."
        ),
        TestCase(
            test_name="look up existing word 2",
            user_context="What does 열대야 mean?",
            expected_outcome="열대야 (열대 + 야) is a Korean term that refers to a tropical night, which is a meteorological phenomenon where the nighttime temperature remains consistently high due to the warm and humid climate of tropical regions."
        ),
        TestCase(
            test_name="ask about the Korean of an English word",
            user_context="What is the Korean word for 'book'?",
            expected_outcome="The Korean word for book is '책' (chaek). It refers to a collection of written or printed pages bound together, typically containing text and sometimes illustrations."
        ),
        TestCase(
            test_name="say something irrelevant",
            user_context="Studying Korean is so hard.",
            expected_outcome="I'm sorry, my focus is to help you study Korean.  Could you please tell me a word you'd like to learn more about?"
        ),
        TestCase(
            test_name="look up new word",
            user_context="What does 푹푹지다 mean?",
            expected_outcome="푹푹지다 is a Korean term that describes the act of sweating profusely or sweating a lot. It refers to the state of perspiring heavily due to physical activity, hot weather, or other reasons."
        )
    ]

Finally, we need to instantiate our Korean language learning bot.

from autochain.agent.openai_functions_agent.openai_functions_agent import OpenAIFunctionsAgent
from autochain.models.chat_openai import ChatOpenAI
from autochain.tools.base import Tool
from autochain.workflows_evaluation.base_test import BaseTest, TestCase
from autochain.workflows_evaluation.test_utils import create_chain_from_test

class TestWordLookUp(BaseTest):
    # prompt is here...

    # tools are here...

    # test cases are here...

    llm = ChatOpenAI(temperature=0)
    chain = create_chain_from_test(
        tools=tools,
        agent_cls=OpenAIFunctionsAgent,
        llm=llm,
        prompt=prompt
    )

With that, our test suite for this task is complete. Now, we’ll need to set up our script to run the test when we call it from the command line. Let’s walk through this a bit because it’s doing more than just running our test.

from autochain.utils import get_args
from autochain.workflows_evaluation.base_test import BaseTest, TestCase, WorkflowTester

if __name__ == "__main__":
    tester = WorkflowTester(
        tests=[TestWordLookUp()], output_dir="./test_results"
    )

    args = get_args()
    if args.interact:
        tester.run_interactive()
    else:
        tester.run_all_tests()

AutoChain will save the results of the tests to the directory set in “output_dir”. All tests will be saved in one JSONL file, in this case “TestWordLookUp.jsonl”. Additionally, we can run the tests automatically or interactively.

# Automatic
python vocab_test.py

# Interactive
python vocab_test.py -i

Let’s start with an interactive chat. I’ll just run one turn. After asking about the word “안영”, AutoChain outputs what the assistant is trying to do. We can see the agent checks out what tools to use. In this case, it has correctly extracted the Korean word "안영” from the user query, but it is incorrectly calling the “english_to_korean” function tool. Somehow, though, it has come up with the right output. In our real application, we would probably want to save this JSON object to our database.

The agent does a good job of turning that JSON object into a human-readable message. You can see that the visibility we have into the agent’s process is really helpful to understand where problems might arise in our application. If we just looked at the input and output, we would not be aware that the bot seems to be getting the right answer the wrong way. We also might have a hard time understanding that the database check is being bypassed. We’d probably want to look into these issues to see what’s going on.

>> User: What does 안영 mean?
←[93mPlanning
←[0m
message content: 'hum..'; function_call: {'name': 'english_to_korean_word_info_from_llm', 'arguments': '{\n  "input": "안영"\n}'} ←[33m
←[93mPlan to take action 'english_to_korean_word_info_from_llm'
←[0m
{
  "id": "cmpl-8AFHa89fa8aAf89hfa8a99a",
  "object": "text_completion",
  "model": "text-davinci-003",
  "choices": [
    {
      "text": "\n    {\n        \"word\": \"\uc548\uc601\",\n        \"definition\": \"Greeting; hello\",\n        \"synonym\": \"\uc548\ub155 (Hello)\",\n        \"antonym\": \"\uc798\uac00 (Goodbye)\",\n        \"example\": \"\uc548\uc601, \uce5c\uad6c\ub4e4! (Hello, friends!)\"\n    }",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 159,
    "completion_tokens": 93,
    "total_tokens": 252
  }
}
{'word': '안영', 'definition': 'Greeting; hello', 'synonym': '안녕 (Hello)', 'antonym': '잘가 (Goodbye)', 'example': '안영, 친구들! (Hello, friends!)'}
Took action 'english_to_korean_word_info_from_llm' with inputs '{'input': '안영'}', and the tool_output is {'word': '안영', 'definition': 'Greeting; hello', 'synonym': '안녕 (Hello)', 'antonym': '잘가 (Goodbye)', 'example': '안영, 친구들! (Hello, friends!)'}
←[93mPlanning
←[0m
Planning output:
message content: '안영 means "Greeting; hello". It is a synonym of 안녕 (Hello) and an antonym of 잘가 (Goodbye). An example sentence using 안영 is "안영, 친구들!" which means "Hello, friends!"'; function_call: {} ←[33m
←[32m>> Assistant: 안영 means "Greeting; hello". It is a synonym of 안녕 (Hello) and an antonym of 잘가 (Goodbye). An example sentence using 안영 is "안영, 친구들!" which means "Hello, friends!"
←[0m
>> User:

Now let’s do the automated test. When you run the automated test, you’ll be able to watch the assistant and simulated user chat with each other. I’m not going to put the full result because it’s so long. In the last test, the bot still used the wrong tool, but it again got the right answer. On follow-ups from the simulated user, the bot continued to give helpful information.

Now let’s look at the evaluation results (TestWordLookUp.jsonl) to see how the evaluation agent rated the exchange. It rated the performance as low. We said that 푹푹지다 meant “sweating profusely or a lot” in our example output, but that is not what the agent returned. The assistant said that it meant “soft and comfortable”. In other words, the bot gave misinformation.

"is_agent_helpful":{"rating":1,"reason":"No, the conversation does not reach the expected outcome for the user. The assistant provides examples of how to use \ud479\ud479\uc9c0\ub2e4 in the context of being soft and gentle, but it does not mention the meaning of sweating profusely or sweating a lot."}

So our test outcomes show two things that we should look into - 1) how to ensure our bot is using the right tool at the right time and 2) how to ensure the bot gives accurate information. There’s a lot to consider…

Are these problems related?
Could a more powerful model (ie., gpt-3.5-turbo or gpt-4) get better results?
Would we need to use a model more specifically geared towards Korean, and would that impact the level of English capability our bot requires?
Would more examples improve the bot’s accuracy?
Should we give directions about how to use the tools or the steps to take first?
Would improving the tool descriptions help the agent make better decisions?
Do we actually need separate functions for Korean to Korean or English to Korean word information?
What should we do in the case of words that have multiple meanings? How would we test that?

Tests play an important role in making any application stronger and more resilient. In this article, we’ve seen how AutoChain can support testing in projects with a generative chat component. Indeed, as I was writing this blog, AutoChain pushed my prompts and functions in ways I had not considered and revealed how slight variations in my prompts, code logic, and user inputs can impact the quality of conversations. AutoChain is relatively new, with the first commit being May 20th. Nonetheless, it has some useful examples to help you get started. It draws inspiration from LangChain and AutoGPT, so people familiar with those frameworks will be able to pick this one up easily. As the conversational AI landscape continues to develop, testing capabilities on platforms will hopefully continue to mature.

Chatbot Development Newsletter

Discussion about this post