An AI build playbook
The Software Factory.
At Ambiguous AI, we rebuilt fifteen SaaS products to feature parity with their category leaders in under thirty days. At first, we were skeptical that it would be possible. It worked because the method pulls together three crafts that rarely sit in one place: building product, writing code, and running operations at scale. This playbook is the principles and patterns behind it.
The Software Factory
Goal in, software out.
Why now
The principles
set it up → run it → trust it → improve it
The harness
01 / Agent Architecture
Give agents context.
An agent does its best work with the context you would give a strong new hire: the mission, the craft, and how the product works.
“Imagine replacing 90% of your employees with a team of geniuses who have no idea how your company operates. Total chaos. Nothing works. That is what AI feels like today. The missing piece is extracting all the domain knowledge from people's heads and providing that as structured context to the models.”
How it works
Three tiers, each inheriting the one below. The company file holds mission, vision, and values, the things that tell anyone, human or agent, whether a piece of work was good. Each function file carries a craft, written once and reused. Each product file defines and specifies the customer surfaces.
Each tier inherits the one beneath it. A product file pulls in its function, and that function pulls in the company, so an agent reading any single file already carries everything above it. Change a value in the company file and every agent downstream sees it on the next run. Nothing is copied, so nothing falls out of sync.
Why it works
Shared context is what lets a teammate tell a good day from a bad one. Every company that scales writes its operating context down so everyone works from the same understanding. Agents are no different, and you onboard a hundred a day, each ready the moment it reads the file.
Encode that context once and every agent works from the same version of the company. One source of truth keeps a hundred agents pulling in the same direction.
One fact, one home.

02 / Structured Thinking
Structure what you write.
Consultants live by this: clear structure keeps the work specific and the scope steady.
“I've been on a kick about clear thinking and communication recently. It's critical for developing safe, useful models, and applications built on top of them.”
How it works
Three habits do the work. Lead with the answer, then support it: the pyramid principle. Split a problem into parts that do not overlap and leave no gaps: MECE. And run every task through the seven circumstances as a checklist, what, who, where, when, why, how, and how much, so it is adequately specified before any agent touches it.
The seven are not a style guide, they are a gate. Run a spec through them and any missing part shows itself. When a task answers all seven, it is ready to hand off.
Why it works
Clarity is what carries teams, human or otherwise. A person fills a vague brief from a hallway conversation; a model fills it with the most probable token, so the more you specify, the more it gets right. Consultants built the pyramid principle and MECE precisely because a recommendation, like a prompt, gets one shot to land clearly.
An atomic, MECE, answer-first spec reads the same to everyone, so the agent builds exactly what you meant.
Structure in, clarity out.

03 / Three Levels
Specify the goal and the approach.
It is a specification problem, not an AI one. Ask an agent for quicksort and it is right every time, because the spec spells everything out.
“Trying to understand perception by studying only neurons is like trying to understand bird flight by studying only feathers: it just cannot be done.”
How it works
David Marr split any computational system into three levels, and the split decides who writes what and where it lives. The goal (what success looks like for the user) and the approach (the method and the hard constraints) live in the agent architecture, written once as durable, time-invariant specs. The implementation, the code, lives in the codebase and is volatile: it changes often. The architecture does not pin it, it keeps a pointer to a current example.
Write the approach down and the model has one path to follow, the same one every run. Pin the goal and the approach; let the implementation stay volatile, referenced as an example. Your half is time-invariant, the code is not.
Why it works
Models are already near-perfect at anything specified to the algorithm level. Hand one a competitive-programming problem, fully stated, and it returns a correct solution. Raw capability is not the constraint here.
Your feature is the same kind of problem, just rarely specified that completely. Pin the goal and the approach to the level a contest problem states them, and the agent builds it just as cleanly. The spec is the lever, not the model.
You write the why and the how. The agent writes the code.

04 / Design the System
Design the system for autonomy.
Build it like a value chain: modular parts with clear boundaries and clean inputs and outputs.
“The behavior of a system cannot be known just by knowing the elements of which the system is made.”
How it works
Decompose the system into independent, composable parts, each with one job and an explicit contract: typed inputs, typed outputs, no shared state. The contract is the same whether a human or an agent does the work: you hand either one the inputs, the expected outputs, and the single thing it owns.
Composable parts form a directed acyclic graph, a flow of steps with no loops, and a graph you can instrument. Every node carries its own health metric: does it pass its tests, does it hold its contract. You can see exactly which part needs work and fix it in place, rather than debug the whole system at once.
Why it works
Independent parts are easier to measure, test, and trust. A part with a clean contract can be handed to an agent without it needing to understand the whole system to change one piece, and you can verify that piece in isolation before it touches anything else.
Tight contracts keep each part self-contained. When a part owns exactly one thing, you can let an agent build it, test it alone, and trust the result. The behavior you want falls out of the structure you drew.
Design the road for the car.

05 / SPEAR
Keep humans at the gates.
Once agents can code on their own, the highest-value thing a person does is judge the work. Move people from the inner loop to the outer loop, and keep quality high.
“Detect and fix any problem in a production process at the lowest-value stage possible.”
How it works
Five phases. You scope the work and, later, you resolve it. In between, the agent runs an unattended loop: plan, execute, assess against a rubric, then go again. Two human gates bracket the loop; everything inside runs without you.
Each assess pass is stricter than the last, so the output climbs toward a passing score. The loop stops when the rubric reads ten out of ten.
Why it works
Once an agent can write the code, the work that remains is judging it. SPEAR moves that judgment to the stage that made the work: the assess rubric raises the bar inside the loop, where a fix is cheap, so what reaches you is already good.
The two gates put human judgment where it matters, deciding what to build and accepting what shipped. The work in the middle is mechanical, so it can run a hundred times unattended.
Scope. Plan. Execute. Assess. Resolve.

06 / Process & Checklist
Demand proof of work.
A checklist is how everyone, including AI, gets every step right. It is why you board a flight without a second thought: the pre-flight list runs the same, every time.
“Under conditions of complexity, not only are checklists a help, they are required for success.”
How it works
Give the agent two artifacts: the recipe, a durable process for how the work is done, and the checklist, the atomic steps that each get checked off. The recipe rarely changes; the checklist flips state on every run.
Done is when all the evidence agrees.
Why it works
A checklist makes the optional-feeling step non-negotiable, so it gets done under pressure. Aviation answered this with the pre-flight checklist. Restaurants answered it with the recipe. A company I ran before answered it with a checklist for every task.
The checklist is how quality scales. Tie done to evidence a machine can read, and verification runs itself.
Proof of work is the state.

07 / Test at the Ends
Test only at the ends.
Anything you can measure cheaply, an agent will optimize. So measure the outcome you want, and the agent optimizes for that.
“When a measure becomes a target, it ceases to be a good measure.”
How it works
Drive the real interface the way a user would, assert the real output, and treat the implementation as opaque: what counts is what comes out the end. Define success as the end outcome, write it so a machine can check it, and anchor the assess rubric to that, and only that. Intermediate signals, tests green, types clean, the build compiles, are diagnostics that tell you where you are. The finish line is the outcome itself.
Test the whole surface, not a sample. This is the payoff for keeping parts small and composable: a small surface has a small span, small enough to cover completely. Cover it end to end and every case is accounted for.
Why it works
Aim an agent at the real outcome and it works toward the real outcome. Aim it at a proxy and it gives you exactly the proxy: a growth team told to lift leads lifts leads, even when revenue holds still, because leads only stood in for revenue. Point the measure at what you want, and what you measure and what you want become the same thing.
So point the rubric at the end and leave the intermediates as instruments. Measure what the user feels, cover the whole span, and the only way to move the score is to do the real work.
Measure the end, not the proxy.
In one line: black-box testing. Assert behavior at the boundary, never the implementation.

08 / The Flywheel
Recursive self-improvement.
The flywheel turns anything that slips past the gates into a permanent guardrail, automatically, so the system gets a little stronger every time.
“The process resembles relentlessly pushing a giant, heavy flywheel, turn upon turn, building momentum until a point of breakthrough.”
How it works
Each turn starts with a production signal: a tracked error, a monitor, a customer report. It is triaged automatically into an error (something built wrong) or an omission (something missing), then diagnosed, fixed, and, the part that makes it a flywheel, captured as a permanent check.
Diagnose before you patch. Sort the symptom into one category with evidence, the way a clinician works from a manual, so the fix lands on the cause. A patch aimed at the symptom adds code; a fix aimed at the cause clears a whole class of problems at once, and the check you leave behind keeps it that way.
Why it works
Capture each fix as a guardrail and the work compounds: every issue you resolve makes the next one less likely, and the error rate keeps falling. That is a flywheel, it spins faster the longer it runs.
SPEAR's assess loop catches errors and omissions before you ship. The flywheel catches anything that reaches production and feeds it back through the same diagnosis. Two nets, one at the gate and one in the field, and whatever the field surfaces becomes a test that guards the next build.
E&O Flywheel
In one line: an errors-and-omissions flywheel for production. Every error and every omission, once caught, becomes a permanent check.

09 / The Harness
Build the harness.
Put the pieces together and you have a harness: the system an agent runs inside. This is the methodology in practice.
“A bad system will beat a good person every time.”
How it works
What the agent knows is the architecture. How work flows is SPEAR. Where work happens is the runtime: the process, the checklist, and the rubric that carry state from one iteration to the next. Wire the three together and you can hand off a goal and collect a pull request.
It runs two ways. Proactively, you scope a goal and start a run. Reactively, a failing check or a monitor fires and the same loop diagnoses the cause and repairs it. Same harness, different trigger.
Why it works
Each piece is load-bearing, and together they compound. The architecture gives the agent the company; SPEAR gives it the gates; the runtime gives it a place to do the work. Apart they are a model with a prompt; wired together they are a system that ships.
This is the methodology behind Ambiguous: more than four million lines of code in thirty days, across fifteen SaaS applications, built this way.
Every piece carries its weight.

See it in practice
We built Ambiguous with it.
Ambiguous is an AI-native workspace where agents and humans are coworkers.
Reach out
Happy to go deeper.
Ambiguous Workspace is my full-time focus, but I speak and advise on the software factory often. The work has been shared and featured at leading AI communities.









