Can Two AIs Play the TDD Pairing Game?


At Mechanical Orchard, we are strong proponents of the methodologies known as Extreme Programming (XP), and we embrace Pair Programming for the numerous benefits it provides.

With the rapid emergence of AI services and models, we constantly refine our development workflows and integrate AI into our daily operations. Large Language Models (LLMs) like the widely-used ChatGPT, fit seamlessly into our XP practices, as we engage with AI companions in brief development cycles, enhancing and accelerating our everyday development.

In recent months, I've devoted more time to understanding how to best pair program with an AI companion. This involves examining prompt engineering and determining the most effective ways to fully utilize generative AI. My experience has been gratifying, and the addition of a skilled companion has proven extremely valuable on more than one occasion.


While exploring ways to interact with AI, I came across a recent paper titled "Reflexion: Language Agents with Verbal Reinforcement Learning", which suggests that GPT-4 becomes 30% more accurate when directed to critique itself. The "Reflexion" described in this paper aligns closely with Test Driven Development cycles and pair programming, where developers challenge each other and exchange ideas in quick iterations.

This led me to consider an experiment to see if we could get two AI companions to interact with each other using Ping-pong Programming to create a small yet complete project. For those unfamiliar with this technique, it involves pair programming combined with TDD, where one developer writes the tests and the other proceeds with the implementation satisfying those tests (hence the Ping-Pong reference).

I understood that this would solely be an engaging experiment, as the project the AI would be working on would be relatively small and not comparable to comprehensive, realistic scenarios. In my experience, it's crucial for developers to refine the AI outputs. Nevertheless, it would have been intriguing to observe the extent to which two AI companions employing pair programming practices could autonomously produce.

Max and Oscar

I prepared prompt sequences for two distinct personas: one named Max, responsible for writing tests and guiding development; the other named Oscar, in charge of writing the implementation. Both were instructed to challenge each other's solutions and to properly adhere to TDD techniques; for example, implementing one method at a time, and ensuring that implementation only satisfies existing tests, and nothing more.

One limitation of ChatGPT (using the gpt-4-32k-0613 model) is its inability to run produced tests and code. To address this, I implemented a quick test runner called Lysa, which ran the files created by Max and Oscar and output the results. If the tests passed, the pair would be asked to proceed with the rest of the methods until completion; if not, they would need to fix the encountered issues. Lysa also coordinated the messages exchanged between Max and Oscar, making sure both personas stayed aware of the latest version of the code being tested, avoiding discrepancies.

The following video captures the event, demonstrating how Max and Oscar, with Lysa's assistance in running tests, autonomously completed the bank exercise. It was quite the spectacle.

Link to the generated code.

In this experiment I wanted to provide the pair with some specifications and then allow them to work autonomously without human interaction. The AI pair would be tasked with building a basic bank capable of holding multiple account holders and allowing them to deposit, withdraw, and check their balances.

Our current language of choice at Mechanical Orchard is Elixir; if you're unfamiliar with it, don't worry, as the concepts apply to other languages as well.

Elixir enables the definition of Typespecs that tools like Dialyzer can use to identify software inconsistencies and provide documentation for API users. For example, if I wanted to create a say_hello_to method that accepts a string parameter name and returns a greeting string, I could write the following spec:

After setting up all necessary prompts for the setup and programmed the conversation flows between Max, Oscar, and Lysa, I entered this prompt:

I want to build a bank functionality. The bank can hold multiple accounts, and every account can check their balance, deposit and withdraw money. There’s no need to plan for persistence for the moment.
This is what the spec of the bank methods look like:

Use a GenServer in your implementation.

You can see that I defined 4 methods:

  • A create_account method that takes a string account identifier.
  • A balance method that returns a balance for a given account.
  • A deposit method that allows to increase a balance for a given account.
  • And finally a withdraw method to decrease a balance for a given account.

These methods could not only succeed but also fail, returning failure reasons in the format {:error, reason}. I also provided some specifics on the desired implementation, which is to use a GenServer.

After that, I initiated the pair to begin working on the given task. I had the chance to observe their interactions on multiple occasions, and I experimented with various GPT models. While some iterations were more successful than others, I have chosen to share the following example as it aptly represents the average outcome.

Two quick observations:

  • A full iteration to complete the exercise cost around $6.
  • GPT 3 was unable to finish the task, not understanding the difference between the personas, or not properly following instructions.

What can we conclude from this experiment?

Firstly, the AI's ability to adopt roles and follow specific instructions is undeniably impressive. I can only imagine the advancements that will emerge within the next few months, if not weeks.

Secondly, while automated code generation might succeed in small experiments like this, it will likely not yield the same results in larger and more complex projects. Even within this small-scale project, there are tests and implementation details that I would have approached differently. It's crucial to emphasize the importance of developers curating AI and its outputs.

Ultimately, this confirms my belief that developers utilizing XP methodologies are on the right path. Fostering synergy between humans and AI in Test-Driven Development (TDD) and, more broadly, Extreme Programming (XP) methodologies can significantly refine, enhance, and streamline developers' daily interactions with their codebases.


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.


You might like

Go up

to our newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.