I tried plugging AI into our automation workflow to generate failure analysis reports. It seemed straightforward. It wasn't — and the problems ran deeper than I expected. Here's what broke, and here's what I actually did about it.
Coming from an automation background, my expectation was simple: give the AI the raw data, describe the format I want, and get a consistent, ready-to-use report every time. That expectation did not survive contact with reality.
CHALLENGE 01
Output format is never guaranteed
My naive approach: feed the AI the content, write a detailed format prompt, expect a clean consistent report. What I got was a different version of the same report every single run — same data, different structure. For automation, inconsistency is a dealbreaker. You can't build a pipeline on output you can't predict.
CHALLENGE 02
Hallucination on technical content
One test case failed because an element was no longer present on the page. Its XPath looked like this: contains(text(), 'deep') — and the AI latched onto the ,' inside the expression, treating it as a syntax error (considering it as semi-colon ;). It confidently reported an incorrect XPath as the root cause, when in reality the element had simply disappeared from the UI. That's a subtle but significant misread — and I had no way of knowing how many similar ones slipped through undetected.
CHALLENGE 03
Token cost is a real constraint nobody talks about
While everyone pitches AI adoption, very few people ask: how many tokens are we actually burning? Feeding the AI the entire dataset to produce a report might give richer output — but at what cost? Not every step of a pipeline needs to go through a model. If a piece of logic can be handled with code, running it through AI is waste, not progress.
THE QUESTIONS THIS RAISED
How much should AI actually own?
Full report generation, or just a structured payload that our own template consumes? The naive path is to let AI do everything — but that's exactly where inconsistency and runaway token usage creep in.
Is a detailed prompt ever enough?
Instructions, constraints, data, format spec — and still the AI misses a rule sometimes. At what point do we stop tuning the prompt and go back to code? There's no guarantee a prompt can provide.
The more I constrained the prompt, the more I was essentially writing code — but in natural language, with no compiler to catch mistakes and no guarantee of consistent output.
THE BOAT ANALOGY
Here's the best way I can describe where we are right now with AI in engineering workflows:
We're trying to cross a river in a massive, expensive ship — one we're still on a trial license for. Meanwhile, we have a perfectly good small boat of our own. The question isn't whether the ship is impressive. The question is: do we actually need it for this crossing? And more importantly — could we use what we learn from the ship to make our own boat better, rather than becoming dependent on it entirely?
Token cost is the fare for that ship. Every large context window, every full dataset you pass in — that's money and latency you're spending. When AI is genuinely doing something your code can't, the fare is worth it. When AI is just formatting data a template could handle in milliseconds, you're paying ship prices for a rowboat job.
WHAT I ACTUALLY BUILT:
I stopped asking AI to do the whole job. Instead, I took back control — deliberately and specifically.
First, I narrowed the input. Instead of passing the entire test run to the model, I filtered down to only the failed test cases. That single change significantly reduced token consumption and, crucially, reduced the surface area for hallucination. Less noise in means fewer wrong inferences out.
Second, I stopped asking AI to format the report. Instead, I asked it to return a structured JSON array — one object per failed test case — with a fixed schema:
FIELD | WHAT AI FILLS IN |
testCaseName | The name of the failed test case |
failures | What specifically failed — error, assertion, element, step |
rootCauseAnalysis | AI's interpretation of why it failed. NA if not determinable. |
fixes | Suggested code or config changes to resolve the failure. NA if not applicable. |
suggestions | Broader improvements — test coverage, stability, edge cases. NA if none. |
The report structure — how it looks, how it's laid out, what gets highlighted — I designed that myself. The AI fills the fields. My code owns everything else.
WHY THIS WORKS
Consistency fixed
A fixed JSON schema means the output is predictable every run. No more formatting surprises.
Hallucination reduced
Narrower input means fewer opportunities for the model to make confident wrong inferences.
Token cost controlled
Only failed cases go to the model. Everything else stays in code where it belongs.
THE PATTERN THIS FOLLOWS
In AI engineering this is sometimes called "AI as extractor, not presenter." The model does the hard cognitive work — root cause analysis, pattern recognition, generating fix suggestions — and returns raw structured data. Your system does everything else. It's the right division of labor.
TAKEAWAY
Don't hand the whole job to the model. Identify exactly where AI earns its place — the reasoning, the interpretation, the analysis — and own everything around it yourself. Design the schema. Own the report. Control the tokens. Use the ship only for what the small boat genuinely can't do.
0 Comments