Back to BlogStrategy
Gary Brach
4/23/2026
5 min read

You Reached for AI. Here's How to Make it Work for the Enterprise

When mainframe organizations started feeling the weight of lost knowledge about their applications, the instinct to reach for AI was not just reasonable — it was correct.

Large language models are genuinely remarkable. They can read code, explain logic in plain language, answer questions conversationally, and synthesize complex information faster than any human. For organizations sitting on millions of lines of COBOL code and a shrinking pool of people who understand it, the promise was obvious: finally, a tool that could navigate the morass.

Everyone moved fast. That was the right call. The tools were new, the potential was real, and the only way to learn was to deploy.

The Hype Cycle Hits Reality

The hype cycle for enterprise AI is following a familiar arc. Early enthusiasm. Rapid deployment. Then, as the tools meet the reality of production systems at scale, the limitations come into focus. This isn't a failure of AI. It's a normal and necessary part of how organizations learn to allocate new technology intelligently.

The question now is not whether to use AI. It is where AI earns its keep in real modernization programs.

When Anthropic published its blog post on using Claude Code for COBOL understanding, it generated significant conversation in the legacy modernization community. What followed was equally significant — independent testing began asking harder questions. What happens when you run the same analysis twice? How consistent are the results?

The answer was unsettling. *Testing on real COBOL programs found that running identical analyses produced results that varied by up to 42% between runs.* The same code. The same model. Materially different outputs.

This is not an edge case. Academic research comparing LLM performance on synthetic benchmarks versus real-world production code found a gap of 53 to 62 percentage points — models that score 84-89% on controlled benchmarks achieve only 25-34% accuracy on actual enterprise codebases. The complexity, proprietary conventions, and evolved logic of real legacy systems expose a structural limitation that better prompting, more documentation, and RAG pipelines cannot close.

And yet organizations deploy these tools anyway — because the alternative, doing nothing, feels worse. So they build compensating architectures. Retrieval augmented generation pipelines to feed the model better context. Prompt engineering frameworks to constrain its responses. Human verification workflows to catch errors before they propagate into consequential decisions.

The Hallucination Prevention Tax

This is the hallucination prevention tax. And here is the question it raises: how do you build a quality assurance process on top of a tool whose output changes materially between identical runs? You can't test quality into a result that isn't stable enough to test.

The tax doesn't shrink as your AI program matures. It grows as your ambitions do. Every new use case, every new stakeholder, every new question you want answered reliably requires more engineering to manage the gap between what the model confidently says and what you can actually prove.

There is a final irony worth noting. Rigorous studies of experienced developers using AI tools on complex legacy tasks found a 19% decrease in productivity — not an increase. The modest time saved generating outputs was entirely consumed by the time spent prompting, waiting, reviewing, debugging, and discarding results that couldn't be trusted. You are not paying once to solve the problem. You are paying indefinitely to manage it.

Where AI Actually Works

None of this means the instinct was wrong. LLMs are genuinely powerful — they have a real and valuable role to play in solving the knowledge access problem. But that role is specific. When they are asked to do something they were not built for — to provide deterministic, provable answers about deterministic systems — they are not reliable.

We at Phase Change have developed COBOL Colleague which creates a deterministic, causal understanding of the business functionality of COBOL applications (context as a platform). COBOL Colleague provides a layer of truth and context, enabling LLMs to reason on causal meaning rather than code, just like a true domain expert. The results are startling (and reliable).

If you would like to know more about COBOL Colleague, explore our Strategy articles, review our Technology approach, or contact us.

Share

Related Articles