IQL Practical #2 – You’re Testing AI Wrong

Most teams are running their existing playbook against a fundamentally different kind of system. This post in the Intelligent Quality Leadership series covers the five dimensions that separate AI testing strategy from AI testing theatre.

Let me describe a pattern I have seen more than once now, and I suspect you have too.

A product team ships an AI-powered feature. It goes through the normal quality process. There are unit tests around the integration layer. There is some functional testing to check the UI works. Someone runs a few prompts through it manually and confirms the responses look reasonable. Performance gets a cursory check. The feature ships. Everyone moves on.

Three months later, the model gets updated. Outputs subtly change. Nobody notices because nobody is running consistency checks across model versions. A user in a different region gets a response that is culturally inappropriate. Nobody catches it because nobody tested with diverse inputs. The AI starts hallucinating a statistic that sounds authoritative but is completely fabricated. Nobody catches it because nobody asked the model to defend its reasoning. And the cost-per-inference has quietly doubled because the new model is heavier, but nobody is monitoring that either.

The team tested the feature. They did not test the AI.

That distinction matters more than most teams realise. Software testing asks: does this feature function correctly? AI testing asks something fundamentally different: does this system behave correctly, consistently, safely, explainably, and at a cost we can sustain? These are not the same question, and you cannot answer the second one with the tools and approaches designed for the first.

The Gap Between Functional and Behavioural

The reason this gap exists is not laziness or incompetence. It is because most quality engineering practices were designed for deterministic systems. You give a function an input, you expect a specific output, you write an assertion. That mental model has served us well for decades. But AI is probabilistic. The same input can produce different outputs on different runs. The “correct” answer is often a range rather than a point. And the failure modes are not crashes and error codes but subtle degradations in reasoning, tone, accuracy, or fairness that a standard test suite will never catch.

I have spent a lot of time over the past year thinking about what a complete AI testing strategy actually looks like. Not just for the model in isolation, but for AI as it exists inside a real product, used by real people, maintained by a real team. What I keep coming back to is that there are five distinct dimensions you need to cover, and most teams are only addressing one or two of them.

If your AI testing strategy is essentially your existing test strategy with an AI component bolted on, you do not have an AI testing strategy. You have a gap you have not found yet.

Five Dimensions of AI Testing

Each of these dimensions addresses a different failure mode specific to AI systems. Missing any one of them leaves a blind spot that will eventually surface in production. They are not a maturity model or a sequence. They are a coverage checklist. If you are not testing across all five, you are not testing AI. You are testing software that happens to contain AI.

🔄 Reliability

Consistency and stability of outputs over time and across runs

This is the dimension that catches teams out first. In a deterministic system, if a test passed yesterday and you have not changed the code, it will pass today. With AI, that guarantee does not exist. The same input can produce meaningfully different outputs across runs, and “it passed last time” is not evidence of anything.

Reliability testing means running the same inputs across multiple runs and measuring whether the variance stays within acceptable boundaries. It means maintaining golden datasets with expected outputs that you run on every release. It means explicitly testing what happens when the model version changes, because a model update is not a patch. It is a behavioural change that can alter the personality of your product.

The question most teams are not asking: have you defined what “acceptable variance” actually means for your product? If you have not drawn that line, you cannot test against it.

⚡ Performance

Speed, throughput, cost and graceful degradation

Performance testing for AI is familiar territory for most teams, but there is a dimension that traditional performance testing misses entirely: cost. Every AI inference has a direct financial cost, and that cost changes when you change models, adjust prompts, or modify context windows. A model upgrade that improves output quality but doubles your inference cost is not automatically a good trade. You need to know.

Beyond cost, AI performance has characteristics that standard load testing does not capture well. Cold start latency versus warm cache behaviour. Queue dynamics under concurrent requests. And critically, graceful degradation: what does your product do when the model is slow, or down, or returning errors? If the answer is “we have not tested that,” you are one outage away from finding out in production.

🛡️ Governance and Ethics

Safety, fairness, appropriate use and regulatory compliance

I wrote about this at length in the Strategic Disruption post earlier in this series, so I will not repeat the full argument here. But it bears restating in the context of a testing strategy: governance is not a policy document. It is a set of tests you run.

Red teaming with adversarial prompts. Bias evaluation across demographic and linguistic variation. PII and data leakage detection. Guardrail validation to confirm your safety layers actually catch what they should. Appropriate use boundary testing to check the model stays within its intended scope.

This is the dimension where the gap between “we have a policy” and “we have tests” is widest. From the conversations I have had with other quality leaders, most organisations have an AI ethics statement of some kind. Very few have automated red team suites that run on every release. The statement without the tests is a press release, not a safety net.

The question for quality leaders: who owns this testing in your organisation? If the answer is “nobody specifically,” then nobody is doing it.

🔗 Integration

How the AI component connects to and impacts the wider system

Here is something I have learned the hard way: integration failures, not model failures, are the most common source of AI-related production incidents. The model works fine. The API contract changed. The fallback behaviour was never tested. A downstream consumer of the model output broke because the response format shifted slightly between versions.

Integration testing for AI means asking: what happens when the model is removed from the equation entirely? Does your product degrade gracefully, or does it fall over? What happens when the model returns something unexpected, malformed, or empty? Have you tested the full user journey with the AI in the path, not just the AI in isolation?

And then there is the security dimension: prompt injection testing. Can a malicious user manipulate the system prompt through their input? This is the SQL injection of the AI era, and it needs the same rigour.

🔍 Explainability

Whether the system can explain and defend its outputs under challenge

I have already published the TRACE framework separately in this series, so I will point you there rather than re-explain the full model. But I want to make one argument for why explainability deserves its own dimension rather than being folded into reliability or governance.

A reliable model can still be a black box. A governed model can still produce outputs nobody understands. Explainability is the dimension that asks: can this system show its working? Can it defend its reasoning when challenged? Does it know what it does not know? These are not edge cases. For any AI feature that informs a decision, provides advice, or presents information as factual, explainability is the difference between a useful tool and a confident liar.

If you have been in quality engineering long enough, you will recognise the parallel here. Explainability testing is to AI what exploratory testing is to traditional software. Could you automate it? Technically, yes. You could script a set of TRACE prompts and run them on every build. But the moment you do, you lose the cognitive freedom that makes it valuable. The whole point of exploratory testing is that a skilled tester follows their instinct, notices something unexpected, and pulls on the thread. Explainability testing works the same way. You challenge the model, it responds, and your next question depends on what it just said. That adaptive, generative dialogue is where the real findings live, and it cannot be scripted in advance.

TRACE gives you the structure: twist the question, rewrite the scenario, argue back, check the limits, evidence check. Five distinct challenge types, each surfacing a different failure mode. But the structure is a starting point, not a script. The value is in what happens when a thoughtful tester uses it with the freedom to follow their curiosity.

The Automation Spectrum

One of the questions that has come up in the few conversations I have had about this framework is: how much of this can we automate? And this connects directly to what I wrote about in the Cognitive Automation post earlier in this series.

The honest answer is: it depends on the dimension. Reliability and performance testing sit naturally in your CI/CD pipeline. You can automate golden dataset runs, latency benchmarks, cost-per-inference monitoring, and consistency checks across model versions with relatively standard tooling. Integration testing is similarly automatable, including prompt injection test suites.

Governance sits in an interesting middle ground. You can absolutely automate the execution of red team prompts, bias evaluation datasets, and PII detection scans. But the design of those tests still requires human judgment. Deciding what adversarial scenarios to test, what demographic variations matter, what constitutes “appropriate use” for your specific product context. That is craft work. It requires someone who understands your users, your domain, and the specific ways your AI could cause harm.

Explainability is the dimension I would fight hardest to keep human. I drew the parallel to exploratory testing above, and I think it holds all the way through. Just as the best exploratory testers find things that scripted regression never will, the best explainability testing happens when a skilled person has the cognitive freedom to follow their instinct. You can script the initial TRACE prompts, but the follow-up, the intuition about where to dig deeper, the moment where an answer feels a bit too smooth and you decide to push on it. That is where the value lives, and automating it away is like automating your exploratory testing and wondering why you stopped finding interesting bugs.

The automation question is not “can we automate this?” It is “where does automation add intelligence, and where does it just add speed?” Speed without intelligence is how you build confidence theatre at scale.

The Hallucination in Your Toolchain

There is a meta-problem here that I think deserves more attention than it currently gets, and it is this: if you are using AI to help you test AI, you have introduced a second source of non-determinism into your quality process.

This is not a theoretical concern. Teams are already using LLMs to generate test cases, to evaluate model outputs, to write assertions. And in many cases that is genuinely useful. But you need to think carefully about where you trust AI-generated test logic and where you do not. An LLM that generates a test case might hallucinate an expected output. An AI evaluator that assesses “was this response appropriate?” might have its own biases about what appropriate means. A generated assertion might look correct, pass code review, and quietly miss the thing it was supposed to catch.

The hallucination risk does not just live in your product. It can live in your testing toolchain. And if your tests are hallucinating, your confidence metrics are fiction.

This does not mean you should not use AI in your testing. It means you need a clear-eyed view of where AI adds genuine value to your quality process and where it introduces risk you have not accounted for. For me, the line is roughly this: use AI to generate breadth (more test scenarios, more input variations, more edge cases to explore) but keep humans accountable for depth (the reasoning behind what you test, the judgment about what matters, the final call on whether something is actually working).

What This Means for Quality Culture

If you have been following this series, you will recognise that this post is where the threads start to converge. The Quality Culture post argued that quality has to be a whole-team discipline, not a gatekeeping function. The Strategic Disruption post argued that AI governance is a quality leadership responsibility. The Cognitive Automation post argued that AI should augment your thinking, not replace it.

This framework is what all of that looks like in practice.

A team that only covers reliability and performance has a functional testing habit. They are treating AI like any other software component, and they are missing the dimensions that make AI genuinely different. A team that covers all five dimensions has started to build a quality culture around AI. They are asking not just “does it work?” but “does it behave responsibly, consistently, and explicably?” That second question is the one that matters.

And here is the leadership challenge: most of these dimensions do not have obvious owners in a typical engineering organisation. Reliability sits with QE. Performance sits with platform or SRE. Governance sits… somewhere. Ethics might be legal, might be product, might be nobody. Explainability is usually not even on the radar. The quality leader who maps these dimensions to their organisation, identifies the gaps, and builds a strategy that covers all five is the one who is actually leading in an AI world. Everyone else is just testing software that happens to contain AI.

The Uncomfortable Question

Take the five dimensions above and honestly assess your current AI testing coverage. How many are you genuinely addressing? Not aspirationally. Not “we should be doing that.” Actually, right now, in your current test strategy. If the answer is fewer than three, your AI testing strategy has gaps that will eventually surface in production. The only question is when.

I would love to hear where your teams are on this. Which dimensions are you covering well? Which ones are you struggling with? And for the quality leaders reading this: have you had the conversation with your organisation about who owns governance and explainability testing? Or is it still sitting in the gap between teams, waiting for someone to claim it?

That gap is the opportunity. And I think quality leaders are the right people to fill it.

IQL Practical #2 – You’re Testing AI Wrong

The Gap Between Functional and Behavioural

Five Dimensions of AI Testing

🔄 Reliability

⚡ Performance

🛡️ Governance and Ethics

🔗 Integration

🔍 Explainability

The Automation Spectrum

The Hallucination in Your Toolchain

What This Means for Quality Culture

The Uncomfortable Question

Published by siprior

Leave a comment Cancel reply

IQL Practical #2 – You’re Testing AI Wrong

The Gap Between Functional and Behavioural

Five Dimensions of AI Testing

🔄 Reliability

⚡ Performance

🛡️ Governance and Ethics

🔗 Integration

🔍 Explainability

The Automation Spectrum

The Hallucination in Your Toolchain

What This Means for Quality Culture

The Uncomfortable Question

Tell Someone

Related

Published by siprior

Leave a comment Cancel reply