How to Make Your AI Less Flaky


They say trust is earned, but how do you trust your AI?

On the Mechanical Orchard R&D team, we’ve been using AI as much as possible to help us understand and analyze complex legacy mainframe systems.[1] [2] I’ve asked DALL-E to illustrate our general approach:

It’s mostly right—sort of. It’s not great at spelling all the words, and step 3 traditionally comes after step 2, but you get the point. Or do you?

Therein lies the problem. While LLMs can dramatically assist us in our everyday work, we can only rely on them to the extent that they are, well… reliable. And AI can be notoriously unreliable.

We recently made a bespoke AI agent that can answer complex questions about the structure and statistics of the code in our client’s mainframe system. At first, it wasn’t very accurate or reliable. Here are some of the ways we fixed it so that we could trust what the agent says.

The use case

Through some of our prior AI efforts, we have a graph database of all of the jobs, source code, files, and database tables in the mainframe, including the relations between them and generated summaries of what they all do. It’s pretty big. Here is a view of the results from querying a subset of the nodes and relationships in our graph database (we use Memgraph):

Cool, right ?

We can write queries to find out the answers to interesting questions, like “Which source file is used by the most jobs?” It can be very powerful, but the querying syntax is a bit esoteric and the results can be hard to interpret.

Now imagine that instead of this, you might have a chat interface where you can just ask your questions in plain English and get the answers you need. Furthermore, this makes it effortless to ask follow-up questions, without losing your train of thought by context-switching to how to formulate the proper query. By removing the friction of crafting queries, users can stay in the flow and focus on the business value.

The proposed solution

Sounds like a job for RAG––Retrieval Augmented Generation. RAG’s basic idea is to give the AI access to the graph database and let it figure out how to query it and interpret the results. It’s the bread and butter of the AI world, but it never quite seems to “just work.”

Where things go wrong

Let’s take a look at all of the places things could go wrong. Keep in mind: this is a multi-stage process and any errors in early stages propagate to later stages, compounding inaccuracies in the answer you are seeking.

  1. Understanding the user’s question. The question could be phrased poorly. The user might ask about something that isn’t even in the graph. The LLM could misunderstand.
  2. Query generation. The LLM might be terrible at generating queries that actually work. Or it could have been trained on a different version of the query spec, so it uses syntax that is invalid. Maybe the query works, but the LLM is giving you an answer to a question that wasn’t what you meant, and you’re unable to tell that it got the question––not the answer––wrong.
  3. Data interpretation. The results could be too large, or too obscure. Or they could be wrong or empty, but the LLM thinks they are correct and misleads the user.
  4. Flow confusion. The LLM might ‘forget’ what it is supposed to do, reverting to its explainer comfort zone instead of using your tools.

Getting grounded

We used several design patterns to give us the best chance of reliable results:

1. Multiple Steps, Multiple LLMs

While we could have easily designed our agent as a single prompt, we decided to break it up into a series of steps in a chain, with separate LLMs for each step.

This architecture has 3 phases:

  1. Understanding the user’s question, specifically in terms of the schema of the graph.
  2. Turning the question into a query and running it (fixing it if needed).
  3. Presenting the results and handling follow-up questions.

The colored boxes in the diagram represent each phase’s own LLM, complete with its custom prompt, memory context, model parameters, and underlying model.

By using separate phases, we can tailor each LLM to focus on a specific task for the best output, evaluating each phase individually and fine-tuning specific phases’ LLMs as needed.

2. Curated Context

Each LLM is only given the context it needs, for example, its relevant graph schema and the output from previous steps in a templated prompt. This keeps things more focused, as compared to having to share the full conversation history and all context with each LLM.

3. Human Input and Iteration

If you can’t trust your AI, hopefully, you can trust your users. We provide the latter with multiple opportunities to keep the LLM agent on the right track, like confirming or refining its understanding of the question before moving on.[3]

We also explicitly show the generated query at the end, so that the user can not only run it themselves if they want, but also so that they can eyeball it and see if it seems right.

4. Prompt Design Patterns

We have a few tricks up our sleeves for getting better results. Restating the question in terms of the graph schema makes generating a query more likely to succeed, compared to only basing it directly on the user’s wording of the question.[4] [5]

We use the same pattern when handling errors internally, asking for the error to be explained first, which helps the LLM correctly fix the broken query.[6]

We also pulled out some decision-making logic from our main prompts and put it into an extra “hidden” LLM, whose sole job is to decide if it has enough context to answer a follow-up question or if it needs to run a new query.[7] This makes sure the flow loops back to the beginning instead of letting the agent get stuck in Phase 3 and hallucinate follow-up answers.

The Results

We’ve had good results so far. Based on our team’s experience, the agent has reliably helped save time and provide fresh insights. We have also set up an automated evaluation loop to identify and improve any lapses in reliability.

As an example, here is an actual conversation that happened:

Hi, I am an agent with access to the mainframe dependency graph. Try asking me a question about mainframe jobs and I will do my best to answer.
I'd like to know which 3-letter prefixes our jobs have
To ensure I understand your request correctly, you want to find out the unique three-letter prefixes that are present in the 'name' property of 'Job' nodes in our database, correct?
Yes. Oh, also can you include the number of jobs per prefix?
To clarify, you're asking for the distinct three-letter prefixes from the 'name' property of 'Job' nodes and also, you'd like to know the count of jobs that start with each of these specific prefixes in our database. Is this correct?
that's right
Great, give me a minute to try to answer that for you, please wait...
The original question was:
What are the unique three-letter prefixes used in the 'name' property of 'Job' nodes in our database and what is the count of jobs that use each of these prefixes?

The query used to get this information was:
MATCH (j:"Job")
WITH substring(, 0, 3) AS Prefix, count(*) as JobsCount
RETURN Prefix, JobsCount

Based on the results of the query, the three-letter prefixes used in the 'name' property of 'Job' nodes and the count of jobs that use each of these prefixes are as follows:
KLM: Used by 343 jobs
NIT: Used by 170 jobs
UUT: Used by 80 jobs
ACS: Used by 57 jobs
BAK: Used by 45 jobs
GGA: Used by 17 jobs
IRM: Used by 5 jobs

The prefix "KLM" is the most commonly used, appearing in the names of 343 jobs.


Perhaps the takeaway here is that trusting an LLM is not that different from trusting a human – as long as you give them a way to be honest, communicate clearly, rectify their mistakes, and listen to your needs.


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.


[1]: We experiment with the technology frequently, to see how we can weave Generative AI into our XP practices. Our CTO, Roberto Ostinelli, goes into how one might perform pair programming with an AI companion on our blog.

[2]: Our VP of Product, Dan Podsedly, explains how we use generative AI to help us with some of the heavy lifting with reverse engineering in our legacy modernization process.

[3]: “Flipped interaction” pattern – Lets the LLM ask the user questions. Flipped interactions often help capture the best context for the upcoming interactions.

[4]: “Question refinement” pattern – A common, best practice in RAG applications, question refinement patterns allow the LLM to restate your prompt in a format better suited for the upcoming retrieval step.

[5]: “Plan and execute” pattern – Improves generated results. "Plan-and-execute" first plans what to do, then executing the sub-tasks, inspired by research into agents like BabyAGI.

[6]: “Self-correcting” pattern – Gives the LLM a way to recover from an error by retrying a failed task with additional context.

[7]: “LLM Router” pattern – Lets the LLM decide the most appropriate task to take next based on the current context.

You might like

Go up

to our newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.