Not the policy. Not the framework deck. The actual Tuesday morning when something matters.
This is part of the Practical track within the Intelligent Quality Leadership series. It builds directly on AI Governance and Guardrails, which made the case for why quality leaders need to be in this conversation. This post is about what that actually looks like when you get there.
There’s a version of AI governance that lives in slide decks. It has a risk matrix, a set of principles, maybe a responsible AI statement on the company website. It gets presented to the board once a quarter. Everyone nods.
And then a product team ships a new AI feature and nobody from quality was involved until two days before release.
That gap between the governance that exists on paper and the governance that actually operates in practice is where most of the real risk lives. And it’s the gap that quality leaders are uniquely placed to close, if we’re willing to do the unglamorous work of embedding ourselves in the moments that matter rather than just advocating for a seat at the table in the abstract.
So let’s follow that through. Not as a framework. As a story.
Tuesday morning, ten weeks before release
The product manager has sent a calendar invite titled “AI feature scoping kickoff.” You’re in it. That’s already a win, because six months ago you probably wouldn’t have been.
The feature in question is a contextual recommendation engine. It will surface suggestions to users based on their behaviour patterns. The engineering lead has a working prototype. The PM has a launch date. Everyone is excited.
You are the person in this room who is going to ask the questions nobody has thought to ask yet.
This is the first governance moment, and it happens before a single test is written. The Quality leader who understands AI governance, doesn’t wait for a test plan to think about risk. They start asking qualitative questions at the point where the answers can still change the design.
What data is this recommendation engine trained on? How was that data collected, and do users understand that their behaviour is being used this way? What happens at the edges: the new user with no history, the power user whose behaviour is atypical, the person who has been using the product through a difficult period in their life and whose behaviour doesn’t represent their actual preferences?
None of these are testing questions in the traditional sense. They’re governance questions. And asking them in week one is fundamentally different from discovering the answers in week nine.
Principle 01: Governance starts at discovery, not at sign-off
The most valuable governance work happens before the architecture is locked. Getting into scoping conversations, design reviews, and early technical discussions is how quality moves from reactive to genuinely protective. If you’re only reviewing AI features before release, you’re already too late to fix the things that matter most.
Three weeks in: the first uncomfortable answer
One of your engineers has been exploring the training data. She’s noticed something. The recommendations skew heavily toward behaviours from a specific demographic segment. Not because anyone designed it that way. Because that segment generated the most data, and the model learned accordingly.
This isn’t a bug though, it’s working exactly as intended. That’s the problem.
You have a choice about how to raise this. You could write it up as a defect and put it in the backlog, where it will be triaged, deprioritised, and eventually closed as “by design.” Or you could walk across to the product manager’s desk and have a conversation that probably makes both of you uncomfortable.
You walk across to the desk.
This is the moment that separates governance that works from governance that gets filed away. Good governance requires people who are willing to name things clearly, bring evidence, and hold the conversation even when the timelines are tight and the room would rather move on.
The principle of fairness isn’t abstract here. It’s a concrete question about which users the product serves well and which users it quietly deprioritises. Quality leaders who understand AI governance can translate that principle into specific, testable scenarios. They can show, not just assert, that there’s a problem worth solving.
The difference between raising a concern and raising a credible concern is evidence. Governance conversations that stick are the ones that come with data, scenarios, and a clear articulation of what’s at stake for real users.
Principle 02: Fairness is a testable property, not a value statement
One of the most practical contributions quality leaders can make to AI governance is translating ethical principles into test conditions. “This should be fair” is a stance. “Here are five user profiles across different demographic segments, and here is what the recommendations look like for each of them” is evidence. Build the latter. It’s what makes the conversation impossible to dismiss.
Six weeks in: the guardrail you didn’t build
The fairness issue has been addressed, partially. A decision has been made to weight the training data differently. It’s not a perfect solution but it’s a meaningful improvement, and the team has committed to a post-launch audit.
Now a different question has surfaced. Someone in an exploratory session has discovered that the recommendation engine can be nudged. If you interact with the product in a specific, deliberate way, the recommendations shift in a direction that looks less like “what’s useful for this user” and more like “what someone wanted this user to see.”
This wasn’t a designed capability. It emerged from the model’s behaviour. And nobody had thought to test for it because nobody had written down that it was a risk.
You’re looking at a guardrail that wasn’t built because it wasn’t in the requirements.
This is one of the defining challenges of governing AI systems: the risks you need to account for often aren’t in the spec, because the system’s emergent behaviour wasn’t predictable from the spec. Traditional quality approaches, which work from requirements and acceptance criteria, aren’t sufficient on their own. You need an additional layer of adversarial and exploratory thinking.
This is precisely where the TRACE framework earns its place. I’ve written about it separately as a practical tool for testing AI explainability and emergent behaviour, and it’s worth reading alongside this post if you’re building out your adversarial thinking.
In organisations where governance is working well, this kind of discovery doesn’t happen by accident. It happens because someone has a standing brief to think about what the system could do that nobody intended. That role might sit in the QE team, or in a dedicated red team function, or it might be distributed across disciplines. What matters is that it’s explicitly owned and resourced rather than hoped for.
What good governance produces at this stage:
- An adversarial test suite that explicitly targets emergent and unintended behaviours, not just specified requirements. The five dimensions from the AI testing strategy post give you a practical structure for building this.
- A documented threat model for the AI system: who might misuse it, how, and what the downstream effects look like.
- A set of defined guardrails with clear owners: what the system must never do, what triggers a human review, what constitutes an automatic block.
- A lightweight incident classification framework so that when something unexpected happens post-launch, the team knows how to triage it.
Principle 03: Guardrails need to be designed, not discovered
The most important guardrails for an AI system are rarely in the original requirements. They come from dedicated adversarial thinking: what could go wrong that we haven’t planned for? Quality leaders who embed this question into their process, rather than waiting to discover the answer in production, are doing governance work that genuinely reduces risk. The goal is to find the gaps before your users do.
Two weeks before release: the conversation about what “ready” means
The PM wants to ship. The engineering lead thinks the feature is solid. The commercial team has a campaign lined up. There is momentum, and momentum in tech companies has a weight to it that can be difficult to push against.
You’ve got two things that aren’t resolved to your satisfaction. The post-launch fairness audit has no owner and no date. And the adversarial testing turned up a low-probability, high-impact scenario that the team has agreed is “acceptable risk” without, as far as you can tell, really reckoning with what acceptable means.
You’re not trying to block the release. You’re trying to make sure the people making the decision understand what they’re deciding.
This is the hardest part of practical governance, and the part that relies most heavily on the quality leader’s credibility and relationships rather than any process or framework.
The goal in this conversation isn’t to be the person who says no. It’s to be the person who makes the risk visible and specific enough that the decision-makers are genuinely informed. “We’ve identified a scenario where the recommendations could be deliberately manipulated. Here’s what it looks like, here’s how likely we think it is, here’s what the user impact would be, and here’s what we’d need to do to mitigate it. Given that, do you still want to ship on this date?”
Sometimes the answer is yes. Sometimes the risk really is acceptable in context. But having the conversation explicitly, with evidence, and with the outcome documented, is fundamentally different from the risk just being carried silently because nobody raised it clearly enough.
The quality leader’s job isn’t always to prevent the risk. It’s to make sure the organisation is choosing its risks consciously rather than stumbling into them accidentally.
Principle 04: Make the risk visible enough to be chosen, not just carried
AI systems carry risks that are genuinely difficult to see until someone with the right framing looks for them. Quality leaders who do that work have an obligation to surface what they find clearly and specifically enough that decision-makers can engage with it properly. This isn’t about saying no. It’s about making sure that when the organisation accepts a risk, it does so with eyes open. Document what was raised, what was decided, and why. That record matters.
After launch: governance doesn’t stop at release
The feature ships. Users engage with it. Some things work better than expected. A couple of things don’t.
In an organisation where governance is working, the quality team isn’t finished at this point. They’re monitoring. They have defined signals that would trigger a review: response patterns that suggest unexpected use, complaint volumes above a threshold, specific categories of feedback that flag against the risks that were documented pre-launch.
The fairness audit happens. An owner has been assigned (you made sure of that before the release conversation ended). The results go back to the product team with a recommendation, not just data.
The adversarial scenario that was accepted as manageable risk gets reviewed at the three-month mark. It hasn’t materialised. That’s noted. The threat model is updated.
This is governance as a living practice rather than a pre-launch checklist. It’s less dramatic than the ten-week story above. It’s also what makes the difference between a team that catches problems early and a team that finds out on Twitter.
What this requires of us
Everything in this story depends on a few things being true that aren’t always true by default.
It requires that quality is in the room at the start of the conversation, not brought in to validate decisions that have already been made. That means quality leaders actively building relationships with product, engineering, legal, and data science, and making the case for early involvement before the high-stakes conversations happen rather than during them.
It requires that the quality team has developed the language to talk about AI risk in terms that resonate with the people making decisions. Not just test coverage metrics, but fairness, reliability, adversarial exposure, user impact. That’s a capability development challenge, and it doesn’t happen automatically.
And it requires a particular kind of professional courage: the willingness to raise the uncomfortable finding, to hold the difficult conversation, to be the person in the room who asks the question that slows things down, while still being a constructive and trusted part of the team.
None of that is new. It’s been true of good quality leadership for a long time. What’s changed is the stakes. AI systems can have impact at a scale and speed that makes the cost of getting governance wrong much higher than it used to be. That’s not an argument for slowing everything down. It’s an argument for getting into the conversation early enough to make the difference.
I’d love to hear where you’re at with this. Are you in the room early enough in your organisation, or is quality still being brought in to validate decisions that are already made? What’s the governance practice that’s actually made a difference where you are? The real-world examples are always more instructive than the hypotheticals.
Tags: #AIGovernance #QualityEngineering #IntelligentQualityLeadership #ResponsibleAI #TechLeadership #QualityLeadership #AITesting #EngineeringLeadership
