What a career in QA, a homegrown BDD framework, and a late-night testing session taught me about where software quality is actually heading.
By Adam Goss, Founder — Iter8 IT Consulting
Executive Summary
There is a growing belief in the AI-accelerated software world that quality assurance is a relic. Build fast. Ship fast. Iterate fast. Who has time for testing when the market won't wait and AI can regenerate the whole thing if it breaks?
I've heard versions of this argument my whole career, and it has always been wrong. What's different now is that the people making it have better tools to hide behind.
I've managed QA organizations. I've led teams of 20 testers. I've built test automation frameworks from the ground up. I spent years trying to get engineering teams and product organizations to treat Acceptance Criteria as the foundation of everything — design, development, and testing — from a single source of truth. I was largely unsuccessful, not because the idea was wrong, but because the human adoption layer kept collapsing.
About fifteen years ago, a colleague and I built a homegrown BDD system we called Testopia. The idea was sound. The tooling worked. Leadership mandated it. And it still failed — because we couldn't get product owners to write the scenarios that would have made the whole chain run.
Recently, working about two hours on a weeknight, I watched that idea finally work the way it was always supposed to. Not because anything changed about the concept. Because the tools caught up.
This is what I learned.
The Myth That Quality Is Optional
Before we get into the tools, the belief needs addressing directly.
The "move fast, ship broken, iterate" philosophy isn't new. It has been dressed up in different clothes across different eras — lean startup, minimum viable product, continuous deployment — and each time it contains a kernel of genuine insight wrapped around a dangerous assumption. The insight is that shipping something real and learning from it beats spending years perfecting something theoretical. The assumption is that quality is a dial you can turn down when speed matters more.
That assumption has always been wrong. What it actually describes is a transfer of cost, not an elimination of it. The cost of poor quality doesn't disappear when you skip testing. It moves. It moves to your users, who encounter the bugs. It moves to your support team, who fields the complaints. It moves to your engineers, who spend their next sprint fixing what should have been caught in the last one. And it moves to your reputation, which is the hardest cost of all to recover.
In a world where AI can generate code faster than any human team, the temptation to skip quality discipline is stronger than it has ever been. The pace is genuinely faster. The output is genuinely more voluminous. And the gap between "we shipped something" and "we shipped something good" is wider than ever, because the volume of output has outpaced the human capacity to review it without structure.
This is not an argument for slowing down. It is an argument for bringing the same discipline to quality that you bring to everything else. Quality is not a phase at the end of the pipeline. It is a property of how you build.
Testopia: The Right Idea, Wrong Era
Around 2010, a colleague and I were frustrated with the same problem that was frustrating every engineering team trying to implement behavior-driven development. Cucumber was just getting started. The idea of writing test scenarios in plain English that could drive both development and testing was compelling in theory. In practice, it kept breaking down at the same point: the people who should be writing the scenarios — the product owners, the business stakeholders — wouldn't do it. It was "extra work." It wasn't their job. There were always more pressing things.
So we built our own system. We called it Testopia.
The architecture was straightforward. Product owners wrote scenarios in plain English. Testopia tokenized the input and mapped it to C# classes and methods in a test project. A developer building a feature was responsible for delivering both the application code and the Testopia implementations that fulfilled the mappings. The idea was a red-green-refactor cycle where the scenarios started failing (red), the developer made them pass (green), and the codebase was cleaner for it.
We got leadership buy-in that Testopia was part of the process and required. What we never got was buy-in that the product team needed to supply the scenarios. Without that, the chain was broken before it started. Developers ended up writing their own scenarios, which defeated the purpose — you don't get much value from a test written by the same person who wrote the code it's testing.
Testopia worked technically. It failed organizationally. The concept was sound. The human adoption layer collapsed.
I thought about Testopia a lot during that session.
Phase 13: What I Actually Did
I've been building a production web application called Duck Duck Jeep Tracker in my evenings and weekends, using a structured AI-augmented development workflow. I wrote about that process in a previous article. Recently I was working on Phase 13, an event management system for the Jeep community — the ability to create, manage, and discover Jeep events and meetups.
The phase was organized into three features: Foundation (database migrations, functions, shared components), Admin Pages (an event list page and a create/edit form), and a Public List page. I planned and storified the work the same way I always do — user stories with explicit Acceptance Criteria, handed off to Claude Code for execution, reviewed by me before merge.
When the Admin features were complete in my local development environment, I did something I hadn't done in this project before. Instead of just doing a manual spot-check myself, I opened Claude Cowork and directed it to test the Admin pages.
What happened next was genuinely surprising — not because it worked, but because of how it worked.
I pointed Cowork at the running app and told it what to verify. It navigated the pages, filled in forms, submitted data, checked results, and reported back. When something didn't behave as expected, it told me what it observed and asked how I wanted to proceed. When a field label was ambiguous it flagged it. When a validation message wasn't firing correctly it caught it and described exactly what it saw.
Then I pushed to TEST and did it again. Same process, live environment, real data. Then to PROD.
And then — this is the part I keep thinking about — I asked Cowork to populate the system with real event data. It researched actual Jeep events and meetups on the internet, gathered the relevant details for about 20 events, and entered them into my newly shipped production form. My event management system went from empty to populated with real, researched content in a single session.
That is not a testing story. That is a new category of what AI agents can do in a software workflow.
Cowork vs. Selenium: An Honest Comparison
I want to be direct about the tradeoffs, because anyone with a serious automation background will ask.
Cowork is slower than Selenium. Meaningfully slower. A Selenium suite that runs 200 test cases in three minutes will outperform a Cowork session by a significant margin on raw execution speed. If you are running regression tests on a CI pipeline that triggers on every commit, Selenium is still the right instrument.
What Cowork offers that Selenium cannot is adaptability and dialogue.
A Selenium script fails silently or throws an exception. It does not tell you what it was trying to do when things went wrong. It does not ask you whether the behavior it observed was intentional. It does not notice that a label is confusing or that a user flow feels awkward and surface that observation unprompted.
Cowork does all of those things. It is closer to having a QA engineer in the room than it is to running an automated script. The back-and-forth is real. The observations are contextual. And critically — you do not have to build a page object model. You do not have to maintain CSS selectors and element IDs that break every time a developer renames a class. You describe what you want verified in plain language, and the agent figures out how to interact with the interface to verify it.
For exploratory testing, UAT-style validation, and smoke testing new features in real environments, Cowork changes what is possible for a small team or a solo developer. The tradeoff is speed. The gain is intelligence.
The right answer is not Cowork instead of Selenium. It is understanding which instrument belongs at which stage of the pipeline.
One Artifact, Three Purposes
Here is where Testopia comes back.
The thing that killed Testopia fifteen years ago was the cost of the scenario-writing step. Product owners had to learn a process, invest time up front, and trust that the downstream benefit was worth the upstream effort. Most of them concluded it wasn't — or more accurately, they never had enough organizational pressure to find out.
What I realized during that two-hour session is that the AC I am already writing — the same Acceptance Criteria that drives story development, that Claude Code uses to implement features, that I use to review pull requests — is the same artifact that drives the Cowork testing session.
I did not write separate test cases. I did not maintain a separate test plan. I pointed Cowork at the AC that already existed and said: verify this. It knew what to check because the AC told it what the feature was supposed to do.
One artifact. Three purposes.
Design and UX teams use it to understand intent and build the right interface. Developers use it to implement the right behavior. QA — whether human or AI agent — uses it to verify the right outcome was delivered.
This is what BDD was always trying to achieve. The reason it struggled is that the chain required humans at every link, and humans have competing priorities, inconsistent discipline, and limited patience for process overhead that doesn't pay off immediately. The AI agent doesn't have those problems. It reads the AC the same way every time. It doesn't skip steps because it's busy. It doesn't interpret "the form should validate required fields" charitably when the validation isn't working.
The discipline I have been advocating for my entire career — write the AC clearly, write it up front, write it with enough precision that anyone downstream can act on it — turns out to be exactly the discipline that makes AI-assisted quality work. Not as a nice-to-have. As a structural requirement.
I spent years trying to convince humans. I never had to convince the AI.
Closing the Loop: From Exploration to Regression
There is one more idea here that I am still working through, but I think it is important enough to name.
Cowork is the right instrument for exploratory and functional testing during active development. It is adaptive, conversational, and doesn't require a framework. But it is not the right instrument for a regression suite that runs on every commit and guards against unintentional breakage over time.
For that, you still want Selenium. Or Playwright. Or whatever browser automation framework your team uses.
Here is the idea: the same AC that drives Cowork testing during development can be handed to Claude Code to generate the Selenium scripts for the regression suite.
You write the AC once. Claude Code implements the feature against it. Cowork validates the feature in real environments using it. Claude Code writes the Selenium based on it. The regression suite runs against it on every pipeline trigger.
The AC is the source of truth for the entire quality chain — from the first design conversation to the automated test that runs six months after the feature shipped.
That is what Testopia was trying to be. That is what BDD was trying to be. That is what every "shift left on quality" initiative has been reaching toward for two decades.
The tools are finally there. The concept was never wrong.
What This Means for Engineering Teams
The conclusion I keep arriving at is not that AI makes QA unnecessary. It is the opposite. AI makes quality discipline more important, more visible, and more enforceable than it has ever been.
The teams that will struggle are the ones that treat AI as a reason to skip process. Generate the code, ship it, see what breaks. That approach produces volume without quality, and volume without quality is just faster failure.
The teams that will thrive are the ones that bring the same rigor to AI-augmented development that good engineering has always required. Clear requirements. Explicit acceptance criteria. Structured handoffs between planning, development, and testing. Human judgment at the gates that matter.
None of that is new. What is new is that AI can now participate meaningfully at every stage of that chain — planning, coding, testing, and automation — if you give it the structure to work within.
Quality is not a phase. It is not a team. It is not a tool. It is a decision you make about what you owe the people using what you build.
That decision has not changed. Everything else has.
Working With Iter8
Iter8 IT Consulting helps engineering teams build the systems and practices that make software delivery both fast and trustworthy. If your organization is trying to figure out how to adopt AI development tools without sacrificing the quality standards your customers depend on, that is exactly the work we do.
We have walked this path. We know where the discipline pays off and where teams typically cut corners they will regret. We would be glad to help yours build something worth being proud of.
Reach us at iter8itconsulting.com
Adam Goss is the founder of Iter8 IT Consulting, LLC, an engineering leadership and SDLC optimization practice based in Indiana. Duck Duck Jeep Tracker is live at duckduckj.com.

