What I Learned Trying to Automate Failure Analysis Reports with AI

I tried plugging AI into our automation workflow to generate failure analysis reports. It seemed straightforward. It wasn't — and the problems ran deeper than I expected. Here's what broke, and here's what I actually did about it.

Coming from an automation background, my expectation was simple: give the AI the raw data, describe the format I want, and get a consistent, ready-to-use report every time. That expectation did not survive contact with reality.

THE CHALLENGES I RAN INTO:

CHALLENGE 01

Output format is never guaranteed

My naive approach: feed the AI the content, write a detailed format prompt, expect a clean consistent report. What I got was a different version of the same report every single run — same data, different structure. For automation, inconsistency is a dealbreaker. You can't build a pipeline on output you can't predict.

CHALLENGE 02

Hallucination on technical content

One test case failed because an element was no longer present on the page. Its XPath looked like this: contains(text(), 'deep') — and the AI latched onto the ,' inside the expression, treating it as a syntax error (considering it as semi-colon ;). It confidently reported an incorrect XPath as the root cause, when in reality the element had simply disappeared from the UI. That's a subtle but significant misread — and I had no way of knowing how many similar ones slipped through undetected.

CHALLENGE 03

Token cost is a real constraint nobody talks about

While everyone pitches AI adoption, very few people ask: how many tokens are we actually burning? Feeding the AI the entire dataset to produce a report might give richer output — but at what cost? Not every step of a pipeline needs to go through a model. If a piece of logic can be handled with code, running it through AI is waste, not progress.

THE QUESTIONS THIS RAISED

How much should AI actually own?

Full report generation, or just a structured payload that our own template consumes? The naive path is to let AI do everything — but that's exactly where inconsistency and runaway token usage creep in.

Is a detailed prompt ever enough?

Instructions, constraints, data, format spec — and still the AI misses a rule sometimes. At what point do we stop tuning the prompt and go back to code? There's no guarantee a prompt can provide.

The more I constrained the prompt, the more I was essentially writing code — but in natural language, with no compiler to catch mistakes and no guarantee of consistent output.

THE BOAT ANALOGY

Here's the best way I can describe where we are right now with AI in engineering workflows:

We're trying to cross a river in a massive, expensive ship — one we're still on a trial license for. Meanwhile, we have a perfectly good small boat of our own. The question isn't whether the ship is impressive. The question is: do we actually need it for this crossing? And more importantly — could we use what we learn from the ship to make our own boat better, rather than becoming dependent on it entirely?

Token cost is the fare for that ship. Every large context window, every full dataset you pass in — that's money and latency you're spending. When AI is genuinely doing something your code can't, the fare is worth it. When AI is just formatting data a template could handle in milliseconds, you're paying ship prices for a rowboat job.

WHAT I ACTUALLY BUILT:

I stopped asking AI to do the whole job. Instead, I took back control — deliberately and specifically.

First, I narrowed the input. Instead of passing the entire test run to the model, I filtered down to only the failed test cases. That single change significantly reduced token consumption and, crucially, reduced the surface area for hallucination. Less noise in means fewer wrong inferences out.

Second, I stopped asking AI to format the report. Instead, I asked it to return a structured JSON array — one object per failed test case — with a fixed schema:

FIELD	WHAT AI FILLS IN
testCaseName	The name of the failed test case
failures	What specifically failed — error, assertion, element, step
rootCauseAnalysis	AI's interpretation of why it failed. NA if not determinable.
fixes	Suggested code or config changes to resolve the failure. NA if not applicable.
suggestions	Broader improvements — test coverage, stability, edge cases. NA if none.

The report structure — how it looks, how it's laid out, what gets highlighted — I designed that myself. The AI fills the fields. My code owns everything else.

WHY THIS WORKS

Consistency fixed

A fixed JSON schema means the output is predictable every run. No more formatting surprises.

Hallucination reduced

Narrower input means fewer opportunities for the model to make confident wrong inferences.

Token cost controlled

Only failed cases go to the model. Everything else stays in code where it belongs.

THE PATTERN THIS FOLLOWS

In AI engineering this is sometimes called "AI as extractor, not presenter." The model does the hard cognitive work — root cause analysis, pattern recognition, generating fix suggestions — and returns raw structured data. Your system does everything else. It's the right division of labor.

TAKEAWAY

Don't hand the whole job to the model. Identify exactly where AI earns its place — the reasoning, the interpretation, the analysis — and own everything around it yourself. Design the schema. Own the report. Control the tokens. Use the ship only for what the small boat genuinely can't do.