LLM Toolkit: Validation is all you need

Jeff Schomay

May 20, 2024

blog

Forget chains—structured output and validation are all you need.

‍

At Mechanical Orchard, we’re building complex, bespoke, explainable AI agents to interrogate legacy mainframe systems autonomously. We’ve used a wide range of popular AI libraries and frameworks to bring LLM best practices into our codebase. The one I’ve been absolutely loving is Instructor, because modeling data is so much more powerful than modeling prompts. In this technical article, I’ll show you why.

‍

Use case

We have a RAG tool that takes a plain English question, rephrases it for better results based on a known schema, converts it to a graph database query, runs the query, performs multiple retries on errors using self-healing from error messages, interprets the results, and returns a conversational style answer with citations for transparency and groundedness.

‍

Earlier, I wrote about our first approach that used a hard-coded 9-step chain to do this reliably. With Instructor, we can accomplish equivalent (and sometimes better) results in only 20 lines of code!

Implementation

Instructor works by making LLMs format their output in a way that can be run through Pydantic, which is Python’s premier data validation library. This simple trick unlocks many power features:

‍

Structured output with well-defined types
Validation at the field or model level
Easy to apply, common LLM enhancement patterns

‍

Let me show you how:

‍

This short chunk of code is doing a lot! Let’s walk through it:

First, notice the lack of an obvious prompt. Instead, we provide a declarative annotated class that inherits from Pydantic. Instructor tells the LLM to make this GraphDatabaseResult class. Its shape, types, and field descriptions provide the context for the LLM.
Now look at its fields. Both rephrased_question and reasoning are a form of “Chain of Thought” (CoT) to prime the LLM output with sound logic before it gets to generating actual query tokens. Adding CoT couldn’t be easier!

The model_validator runs automatically after the LLM returns the raw response that Instructor feeds into GraphDatabaseResult. This is where the magic happens. All it does is actually run the query. call_graph_db will raise a database exception if there is a problem with the query, which will make the validator fail, which will make Instructor send the error to the LLM to try again. If all goes well, it stores the result in the result field and finishes. That’s powerful!

Finally, the LLM call is very much like a standard OpenAI chat completion, with just a few tweaks from Instructor such as the response_model and max_retries fields. Also notice how minimal the messages are, just the graph schema so the LLM knows what is in the database, and the user’s question.

You might feel a little odd about hitting the database as part of validation, but let me ask, how else do you know if the query is valid without actually running it? You could parse the query instead, assuming you have a robust parser for the flavor of graph query language you are using, but you might as well just run it and get self-healing for free.

‍

And that’s it. Here’s what output looks like:

‍

Notice how the reasoning calls out “distinct”, which often would get left out of the query when run without reasoning, which used to cause erroneous results. And for reference, this took about 5 seconds to run. You don’t see it here, but if the query was misconstructed, it would silently try to fix it and run it again up to 3 times. I hope you can see how powerful structured output plus validation can be!

‍

Do you still need chains?

What if you want an LLM to summarize the results for us? How would you chain that together?

‍

That’s a trick question, you don’t need chains, you just need normal functions. Notice how ask_the_graph returns a GraphDatabaseResult which you can just pass into a function that takes that type, like you normally do when working with Python code. This pattern lets you mix AI and traditional code with ease.

‍

Here’s an example (with a little extra AI pizzazz):

‍

This uses Fructose, a slick, tiny library that hides away the LLM call for you if you include @ai on a function, based on the function name, arguments, types, and docstring. It even coerces the output into a structured type like Instructor, but only for basic types. You could write out a full LLM call by hand if you prefer, but I like the brevity and high signal-to-noise ratio that Fructose provides.

‍

You can see how minimal and expressive the code is, and how you can just compose functions effortlessly. GraphDatabaseResult automatically can be represented as a string, so Fructose can simply interpolate it into the LLM query that it builds without requiring any additional processing. Also, you can choose a specific LLM model with Fructose, so it would be possible to use a smaller, cheaper, or fine-tuned model for this interpretation step.

‍

Let’s push this pattern a little further. If you’d rather have the results interpreted as part of the GraphDatabaseResult directly, and you don’t care about getting bogged down in the semantics of what “validation” means, read on to the next section.

‍

Advanced wizardry

Warning: this section technically works, but might have implicit trade-offs. Nonetheless, it shows how to add more LLM patterns with minimal effort by stretching what validation means.

‍

Here’s how you can interpret the raw database results directly in GraphDatabaseResult by just adding a response field and using the nifty Fructose interpret_results function from above in the existing model_validator:

‍

Simple. Now let’s push the envelope a little further. Let’s add human-in-the-loop validation to make sure the user approves of the rephrased question. This adds a field validator and sticks an input inside of it (human validation!):

‍

Here’s an example interaction:

‍

To finish it off, let’s add some self-critique by adding a certainty score and critique reason, based on a “judge” LLM that determines how well the results answer the original question.

‍

This time, you add a second model validator, after the first one, that uses the Fructose @ai trick. Also notice how the user’s original question gets passed as validation_context to validate against data that outside of what the LLM generated:

‍

And there you have it. With just a little more work, you get live self-evaluation to make sure answers are grounded, reliable and relevant, with automatic retries and self-healing if the score falls below a threshold.

‍

Conclusion

The final full code example can be found here.

‍

Rather than letting the LLM “ramble on” in prose, you can now get high-quality, well-typed responses thanks to Instructor and Pydantic. And rather than building complex chains, you can get solid LLM performance improvement patterns with some simple validation hacks. Not only will your code be more compact and powerful, your outputs will be of higher quality too.

‍

Conversation

Footnote

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

0 Comments

Author Name

Comment Time

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere. uis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

LLM Toolkit: Validation is all you need

Use case

Implementation

Do you still need chains?

Advanced wizardry

Conclusion

Conversation

Footnote

Leave a comment

Next Article

You might like

$13 Trillion of the US GDP Rides on Mainframes

Common sense vs status quo

Old School, New Rules: A Milestone

$1.14 Trillion to Keep the Lights on: Legacy’s Drag on Productivity

$13 Trillion of the US GDP Rides on Mainframes

A New Approach to Federal Modernization | Interview by Federal News Network

Beyond Resilience: Book Launch Party

Subscribe
to our newsletter

LLM Toolkit: Validation is all you need

Use case

Implementation

Do you still need chains?

Advanced wizardry

Conclusion

Conversation

Footnote

Leave a comment

Next Article

You might like

$13 Trillion of the US GDP Rides on Mainframes

Common sense vs status quo

Old School, New Rules: A Milestone

$1.14 Trillion to Keep the Lights on: Legacy’s Drag on Productivity

$13 Trillion of the US GDP Rides on Mainframes

A New Approach to Federal Modernization | Interview by Federal News Network

Beyond Resilience: Book Launch Party

Subscribe to our newsletter

Newsletter

Subscribe
to our newsletter