Mainframe modernization is a verification problem

Author: Sam Sanders, Field CTO at Mechanical Orchard
June 3, 2026
Open Read Mode
Close Read Mode

Mainframe modernization is defined as a verification problem, with success depending on achieving behavioral equivalence—the property that a new system produces the identical output as the legacy system for the same inputs. Workflows relying solely on Large Language Models (LLMs) often fail because they overlook critical, low-level invariants that reside "below the business rules layer," such as data structure, numeric encoding, and runtime integration details. Manually coaching the model for a large program portfolio is not a scalable operating model.

The solution requires a commitment to automated, repeatable verification using the diamond pattern: the same input is fed to both the original COBOL and the candidate modern code, with the running COBOL acting as the test harness. Platforms such as Imogen codify this approach with tooling for deterministic source discovery, byte-accurate data type simulation, and a characterization test harness to assert byte-level equivalence. The viability of large-scale modernization depends on implementing a workflow that grounds verification in the running original system, transforming modernization from a risk into a predictable strategic investment.

Introduction

Two Java replicas of the same mainframe job sit side by side. Both started from the same AWS CardDemo source tree — the open-source credit card management application AWS publishes as a reference for modernization tooling. Same COBOL programs, same JCL, same copybooks, same adjacent assembler, same sample data. The first was produced by a prompt to a capable, general-purpose large language model. The second was produced by an Imogen-driven workflow over the same tree. Both compile, and both run, but only one is verifiable.

This post is about what the difference looks like in practice, and why it matters at portfolio scale. In this example, I've chosen the program READACCT, the CardDemo job that reads the account master file and writes a formatted account report. It is small enough to be legible in a blog post and large enough to expose the categories of failure that tend to recur across enterprise mainframe modernizations.

A note before the argument starts. Both runs had access to the full CardDemo source tree. Source availability was constant. The variable was the workflow and how the workflow used the available source. Availability is not utilization, and that distinction is where most of this story lives.

What happened when I used LLMs for forward engineering

What does an LLM-only workflow actually deliver, and what does it leave behind?

It delivers Java that compiles, executes, and reads plausibly to a senior engineer. Compilation is a low bar for mainframe modernization. The harder problem, and the one that determines whether a cutover survives contact with production, is behavioral equivalence: the property that the new system produces the same outputs as the old system for the same inputs, across the long tail of inputs that real workloads actually carry.

Modernization fails in three ways. Programs are too slow to deliver. The new code carries subtle defects that surface in production. The act of cutting over disrupts the systems around the workload. An LLM-only workflow is plausibly an answer to the first risk, on a single program, in isolation. It does not by itself answer either of the other two. The reason traces back to what the modernization workflow chose to verify, and how, rather than to any property of the model.

What the workflows chose to verify shows up as concrete divergences in the code the LLM-only workflow produced.

The ReadAcct.java produced by the LLM-only workflow is a single file. Its header self-documents its scope as a Java replication of CBACT01C.cbl and READACCT.jcl. The job was named in the prompt; the surrounding source tree was inferred. For most of the program logic, that inference is fine. The invariants that sit below the business-rules layer are the source of every category of failure that follows. For example:

Data structure. Reading a record by key is structurally different from scanning a file by byte offset — closer to a database lookup than to reading a CSV line by line. The COBOL declares its input as a key-indexed file (VSAM KSDS) with an 11-byte record key; the LLM-only replica opens it as a flat byte stream of fixed-length blocks. The test passes because it feeds the replica flat-file fixtures the LLM invented. Real account data would crash the program.

Numeric encoding. The COBOL fields are EBCDIC-encoded with non-ASCII representations for signed and packed-decimal numbers; the LLM-only replica treats everything as ASCII and emits negatives as a leading-minus string. Tests pass against fixtures matching the replica's own assumptions; real mainframe bytes look nothing like that.

Subroutine call semantics. READACCT calls a small assembler routine COBDATFT with two valid input shapes and an error path; the LLM-only replica renders the call as replace("-", ""), matching only one branch. The other valid shape and the error path are silently absent, and will be missed in production exactly when they matter.

Runtime integrations. READACCT issues an abend with code 999 on failure; the LLM-only replica substitutes System.exit(12) on any exception. Schedulers and operators watching for 999 now watch for the wrong number.

All four examples sit below the layer of business rules. They are encoding rules, file-organization rules, external-call semantics, and runtime integration. None of them appears in the prompt. All of them appear in the original source tree.

The LLM-only workflow did not consume them, despite their availability. The operator's prompt had no mechanism to require the inspection that would have surfaced them.

The limitations of coaching the LLM

The obvious objection: A skilled operator could have coached the agent further. A second prompt could have asked the model to inspect the assembler routine. A third could have surfaced the EBCDIC and overpunch handling. A fourth could have caught the ABEND code. A skilled operator can absolutely coach a single LLM session into a much better replica of one program.

But what about the next thousand sessions across a typical portfolio?

A real enterprise mainframe portfolio includes many programs, many copybooks shared across them, JCL chains, dated dataset generations, DB2 schemas, shared job-step libraries, control cards, catalog dataset definitions, and small assembler routines whose semantics matter exactly when nothing else has caught the edge case. Each program sits inside a web of these dependencies. Each one carries its own version of the encoding, file-organization, runtime-integration, and external-call problems described above. None of those problems present themselves in the prompt the operator types.

If every program needs an attentive operator to coach the model toward the right files, the right copybooks, the right runtime semantics, and the right verification approach, the bottleneck of modernization becomes the supply of attentive operators. That bottleneck is what kills modernization timelines and what makes risk impossible to bound. The question for portfolio modernization is the operating model. Can the right work happen, repeatedly, across many workstreams, without each nuance being hand-coached?

That is a workflow problem, and it cannot be solved by a smarter prompt.

The limitations of business rules extraction we encountered

A more sophisticated pitch is to use one LLM to extract the business rules from the COBOL. Use another LLM to reimplement the rules in Java. The intermediate artifact looks reassuring. The pipeline feels cleaner. But it has the same structural failure as the LLM-only approach.

This is because the coverage of the codebase is disconnected from the rules themselves: examples are provided for the rules, but the coverage is difficult to measure. In my experience, the next question is "how well do these rules map to testable behavior," and that question requires extensive human involvement and energy. In a way, the extracted rules call out problems without proposing concrete solutions or providing deterministic tests to validate the behavior of the new system.

There are several examples which show this from the categories from the previous section. VSAM KSDS organization is a file-format detail. Business Rules will identify that VSAM KSDS is used. However, this leaves a question of how to implement the internal details. There is no rule that describes KSDS internals (keys, indexes, CI/CA structure, etc.).

EBCDIC, the overpunched sign nibble, and COMP-3 are encoding details, and when my teams have worked with BRE on the past, the implementation of these details would require human experience, and testing (or trouble-shooting issues) becomes a difficult problem to constrain or shrink. The COBDATFT branches are external-call semantics described in the rules for those external programs. Testing cannot be done just within the boundaries; testing behavior for each individual workflow requires simulating and validating the behavior of those other components. The abend code is a runtime-integration detail, and this will be implemented in an entirely different manner in the new architecture. How this gets implemented requires new thinking, which should neither be abdicated to an AI nor implemented on a piece-meal manner. The BRE extract might identify these as problems, but the real work involves solving these problems, and concretely rooting these solutions in tests that are connected back to coverage of the systems' internals (code, services) and behavior in the real world.

These details are below the layer at which extraction operates, and the reason why extracted rules often are fed into human or AI systems to attempt forward engineering.

There is a deeper problem. A two-stage LLM pipeline is two inferences in series with no executable check from the running original system. In layman's terms, the natural language rules are an abstraction that lack the implementation tests that validate a new system based on that abstraction really does the same behavior. The Java is judged against a rule list the first LLM wrote down. The rule list is judged against the COBOL source the first LLM read. At no point does the candidate Java meet the actual behavior of the actual program. The pipeline can be perfectly self-consistent and still be wrong about every encoding, file, runtime, and integration invariant the rule list omitted.

What credible verification looks like

The fix lives one level up from extraction quality. It requires a workflow whose unit of verification is the running original system. The verification pattern that delivers this is the diamond: the same input flows to both the original COBOL and the candidate Java, and both outputs reconverge at an equivalence check.

This verification primitive is small enough to apply to a single program and rich enough to scale across thousands.

The COBOL path is the runtime source of truth: executable original-system behavior used as the source of truth. The COBOL source is the specification, and the running COBOL is what tells you whether the Java matches it.

Two more terms before the rest of the post relies on them. A characterization test is a test that captures observed program behavior as runnable assertions and runs the same logic against both implementations. A verification harness is the test infrastructure that supplies inputs, runs both paths of the diamond, and compares outputs. The harness is what makes the diamond repeatable. Without it, you have an idea of verification. With it, you have an executable check.

The LLM-only replica has no diamond. Its tests create their own ASCII fixtures and assert against strings the LLM picked. The Java is being compared to itself. That is a perfectly serviceable unit-testing pattern in greenfield development. It is not verification of an existing system, because the existing system is not in the loop.

How Imogen addresses the gaps

Imogen is a platform that codifies a workflow Mechanical Orchard ran by hand on engagements for years before automating it. The platform's capabilities map directly to the technical problems raised in previous sections, shifting the verification discipline from human inspection to automated tooling.

  • Deterministic source & dependency mapping answers source discovery. It performs deterministic static analysis of the mainframe tree into entry points, call trees, data flows, and output relationships. This capability follows CALL statements wherever they go, ensuring that assembler dependencies, copybook chains, JCL PROC includes, and DB2 schema references are automatically included. Source discovery becomes a property of the tooling rather than a property of the operator's prompt.
  • Byte-accurate data type encoding answers the runtime source of truth problem. This capability provides byte-accurate COBOL-to-Java codecs for EBCDIC, zoned decimal with sign overpunch, COMP-3, REDEFINES, OCCURS, and group-item layouts. The negative balance that the LLM-only replica emitted as -00000102500 is encoded by the codec as the bytes a mainframe consumer expects, every time, by spec. There is no per-program guesswork about how a PIC S9(10)V99 clause becomes bytes on disk.
  • Local mainframe simulation answers the runtime source of truth problem. This process translates COBOL into a developer-testable build, alongside shims for runtime services such as CEE3ABD and adjacent assembler routines such as COBDATFT. The original program runs in a unit-test loop. The diamond's left path is no longer hypothetical; it executes on the same input the right path sees.
  • The characterization SDK answers the missing-diamond problem. It is a JUnit framework that runs the same test against the COBOL and Java path and asserts byte-level equivalence using per-field comparison rules. A characterization test written against the SDK is a single class that names the input, the expected datasets, and the comparison schema. The framework runs both paths and reports any byte that disagrees, making the diamond an executable, repeatable artifact.
  • Modern batch framework & data access layer answers the integration and data gravity problems. This provides step processors, dataset wiring, job execution, restart, and observability. Crucially, it includes the ability to access legacy data stores, such as VSAM KSDS, DB2, and IMS, via standardized APIs during phased cutover, ensuring the new Java programs maintain data continuity with the legacy environment.

These components together turn the diamond from a discipline a team has to invent into a framework a team can adopt. Any modernization workflow without an executable test specification, without spec-driven codecs, and without deterministic source discovery inherits the same verification gap on any program where the runtime environment is not threaded into the workflow's context.

Why workflow is the bet that scales

A skilled operator can produce a good Java replica of one mainframe program with a capable LLM. It will look reasonable to senior reviewers. It will pass tests that the same operator wrote. It will compile and run. But the properties that determine whether it survives cutover sit outside the scope of what the workflow itself delivered: behavioral equivalence against the running original, performance equivalence under representative load, and integration equivalence with the surrounding systems. Instead, these properties have to be added afterward, by hand, for each program, by another skilled operator.

That is the shape of the problem at portfolio scale. Modernization gets stuck at the supply of skilled operators who can hand-coach each workstream. Risk gets stuck because no two replicas were verified the same way. The schedule gets stuck because each program is a one-off.

A platform that makes the right work happen repeatedly looks different. Source discovery is deterministic. Encoding rules live in shared components. The original system runs as an executable runtime source of truth. A characterization harness applies the same diamond verification to every program. The integration and orchestration layer is delivered as a framework, so it does not have to be reinvented for each program.

The output of one program's modernization is a verified replica plus a harness that grows with production data, which means the new code can be safely refactored and improved over time. Move first, then improve.

The real decision in front of a technical leader isn't whether LLMs are useful (after all, they are deeply embedded in the Imogen platform), it's what workflow surrounds them. A workflow whose verification rests on the operator's prompt will produce a sequence of one-off replicas, each as good as the operator coaching it that day. A workflow whose verification rests on the running original system, applied through shared components across every program, will produce a portfolio of verified replicas that an organization can actually cut over to.

This commitment to automated, running verification is the only way to transform modernization from an unmanageable risk into a predictable strategic investment. It is the bet that pays off: it solves the systemic mainframe talent crisis, confidently unlocks 40 years of core business IP, and frees up budget currently absorbed by strong vendor monopolies and the high cost of change to fund genuine business agility. The platform that delivers zero-risk cutover, repeatably and at portfolio scale, is the foundation for long-term market competitiveness.

More information about Imogen can be found at www.mechanical-orchard.com.

Background Trees

Your new legacy starts here.

We believe that every company deserves to realize their vision, free of constraints from the past. Our team's technology, and experience can help them move into this evolving version safely, reliably, fearlessly.

Get in touch