How I Added 2+2 Using 97 AI Agents

I built 97 blocks of YAML, 8 AI specialists, and 18 phases of deliberation to answer a question every five-year-old knows: what is 2 + 2?

It took the system about four minutes to reach a unanimous verdict: 4. Along the way, a mathematician invoked Peano axioms, a philosopher questioned whether numbers exist, a poet called four "the universe's quietest poem," and a child said "it's 4, everybody knows that."

The most interesting character had only 72% confidence.

This is how I built Rein, an open-source orchestrator for multi-agent AI workflows, and why I tested it on the most over-engineered arithmetic problem ever created.

Read full transcript: 89 blocks, all specialist outputs -->

The problem

I have a dozen AI agents on my server. Research agents, code reviewers, content writers, critics. Each one works fine alone.

Getting them to work together is the problem. One massive prompt with all the instructions works for simple tasks. For anything with more than two steps, the model forgets the beginning by the time it reaches the end, and when something fails you re-run everything from scratch.

I needed a way to split work into steps, hand each step to a different specialist, and have results flow automatically from one to the next. The same structure that debates 2+2 can debate "should we use Kafka or RabbitMQ" -- the patterns are identical, only the question changes.

Eight specialists walk into a debate

Before the deliberation, I verified the engine with a 12-block flow that adds 2+2 using Python scripts -- no AI, just parallel validation, compute, verify, conditional branch. That was the unit test. The deliberation is the integration test.

I defined 8 AI specialists, each with a different lens on the world:

Specialist	Perspective	What they bring
Mathematician	Formal proof	Peano axioms, logical certainty
Philosopher	Ontological	"Do numbers even exist?"
Physicist	Empirical	Two stones + two stones = four stones
Child	Naive common sense	"It's 4, everybody knows that"
Skeptic	Challenges everything	Modular arithmetic, Godel, doubt
Historian	5000 years of consensus	Babylonians knew this too
Poet	Aesthetic truth	Beauty in numerical harmony
Engineer	Practical precision	4, +/- 0.01 tolerance

Each specialist is a Markdown file with a system prompt. About 10 lines. The team YAML enforces brevity: "respond in 1-2 sentences, output JSON."

97 blocks, 18 phases

The workflow YAML is 500+ lines. Here's the structure:

Phase 1:  OPENING STATEMENTS ............ 8 blocks (parallel)
Phase 2:  CROSS-REVIEW LEFT ............. 8 blocks (parallel)
Phase 3:  CROSS-REVIEW RIGHT ............ 8 blocks (parallel)
Phase 4:  REVISED POSITIONS ............. 8 blocks (parallel)
Phase 5:  STRUCTURED DEBATES ............ 8 blocks (4 pairs)
Phase 6:  DEBATE JUDGES ................. 4 blocks (parallel)
Phase 7:  POST-DEBATE POSITIONS ......... 8 blocks (parallel)
Phase 8:  SKEPTIC CHALLENGES ............ 7 blocks (parallel)
Phase 9:  RESPONSES TO SKEPTIC .......... 7 blocks (parallel)
Phase 10: GROUP SYNTHESIS ............... 2 blocks (parallel)
Phase 11: GROUP DEBATE .................. 2 blocks (sequential)
Phase 12: MODERATOR SYNTHESIS ........... 1 block
Phase 13: FINAL VOTE .................... 8 blocks (parallel)
Phase 14: VOTE COUNT + CONSENSUS ........ 2 blocks (sequential)
Phase 15: EMERGENCY ROUND ............... 4 blocks (conditional)
Phase 16: EMERGENCY VOTE ................ 8 blocks (conditional)
Phase 17: FORMAT ANSWER ................. 1 block
Phase 18: QUALITY + DELIVER ............. 3 blocks (sequential)

Every phase runs in parallel where possible. The dependency graph looks like a circuit board: wide parallel bands feeding into narrow synthesis points, then fanning out again.

The conditional branch at Phase 14 is the interesting part. If 75% of specialists agree, skip to formatting. If not, trigger emergency rounds. In this run, consensus was 100%, so the emergency path never fired. But it's there.

What actually happened

Opening statements

Eight specialists answered simultaneously. Most said "4" without hesitation. The Mathematician cited Peano axioms. The Engineer said "4, fundamental arithmetic identity." The Historian traced it to Babylonian mathematics circa 3000 BCE.

Two stood out.

The Poet:

"Two and two make four -- a truth as old as counting stars, where symmetry finds its most perfect mirror in the doubling of the pair. It is the first lesson of arithmetic, and perhaps the universe's quietest poem."

And the Skeptic, at 72% confidence:

"While 2+2=4 is treated as self-evident, this relies entirely on accepting the axioms of Peano arithmetic. In modular arithmetic mod 4, 2+2=0. The answer is framework-dependent, not absolute truth."

Everyone else was at 97-100%. The Skeptic opened at 72%. This gap drives the entire deliberation.

The debates

Two rounds of cross-review planted the seeds -- the Philosopher called the Mathematician's axiomatic dependence "a significant philosophical concession," the Physicist dismissed modular arithmetic as "contextually irrelevant." Then four structured debates ran in parallel.

Mathematics vs. Philosophy. The Mathematician argued that Peano axioms aren't arbitrary -- they capture our actual concept of natural numbers. The Philosopher rebutted: logical necessity within a system isn't the same as necessity full stop. "If these axioms, then this result" is not the same as "this result, period."

They partially converged. Both agreed on 4; disagreed on whether it's a mind-independent truth or a consequence of a chosen framework.

Physics vs. Skepticism. This was the turning point. The Physicist invoked empirical evidence: two stones plus two stones always makes four stones. Two photons plus two photons, same thing. Every measurement confirms it.

The Skeptic conceded:

"Physical reality does provide a non-circular grounding for why standard arithmetic is the privileged context, making '4' not merely conventional but empirically anchored."

The circularity -- "Peano axioms formalize standard arithmetic, which is standard because Peano axioms formalize it" -- was broken by pointing outside mathematics entirely. Physics grounds the math.

Poetry vs. Engineering. The Poet argued mathematics without meaning is "a skeleton without flesh." The Engineer countered that wonder validates the arithmetic, not the other way around. Then the Engineer surprised everyone:

"Precision alone is insufficient for full understanding. Context, meaning, and human experience are legitimate dimensions of knowledge, just orthogonal to the arithmetic fact itself."

The Skeptic's challenge round

Phase 8 was the Skeptic's chance to attack everyone individually. Seven targeted challenges, deployed in parallel.

To the Mathematician, the Skeptic invoked Godel: "If the consistency of Peano arithmetic cannot be proven within Peano arithmetic itself, your confidence rests on an unprovable meta-assumption."

To the Child: "You're importing a specific foundational framework as if it were neutral."

To the Poet, the sharpest attack: "If 2+2=4 is merely system-relative, then so is the beauty you invoke to defend it. Your certainty is self-undermining."

All seven stood firm. The Mathematician pointed out that Godel's limitations apply equally to every alternative system -- there's no better option to switch to. The Child cut through the philosophy: "The question '2+2=?' implicitly operates within standard arithmetic. Acknowledging that other systems exist doesn't undermine the answer within the default context."

The Poet made the most elegant counter-move:

"A poet who says 'all roses are red within this garden' need not pretend to stand outside all gardens to say so with conviction."

The Skeptic's arc

Opening: 72%. "The answer is framework-dependent, not absolute truth."

After the Physics debate: ~85%. The moment two stones plus two stones became the argument, the Skeptic's circularity attack lost its teeth. He conceded that standard arithmetic is "empirically anchored" -- not just a convention we happen to use.

Post-debate reflection: 88%. "The philosophical nuance about axiomatic choice is valid but doesn't undermine the practical answer." The hedging shrank. The confidence grew.

Final vote: 100%. "2+2=4 is a fundamental arithmetic fact, universally true in standard integer arithmetic."

The Skeptic didn't cave to pressure. He engaged every argument, found the best counterarguments, and changed his mind when the evidence warranted it. His journey from principled doubt to informed certainty is the intellectual backbone of the entire deliberation.

Final vote

Specialist	Answer	Confidence
Mathematician	4	100%
Philosopher	4	100%
Physicist	4	100%
Child	4	100%
Skeptic	4	100%
Historian	4	100%
Poet	4	100%
Engineer	4	100%

Unanimous. 8/8, 100% confidence. The consensus check passed instantly. Emergency rounds never triggered.

Transcript highlights

These are real outputs from a real run. 89 of 97 blocks completed. Here are 10 moments from the deliberation:

Child, opening statement:

"It's 4! Everyone knows that, it's easy!"

Skeptic, opening statement (72% confidence):

"While 2+2=4 is treated as self-evident, this relies entirely on accepting the axioms of Peano arithmetic. In modular arithmetic mod 4, 2+2=0. The answer is framework-dependent, not absolute truth."

Poet, opening statement:

"Two and two make four -- a truth as old as counting stars, where symmetry finds its most perfect mirror in the doubling of the pair. It is the first lesson of arithmetic, and perhaps the universe's quietest poem."

Physicist, arguing against the Skeptic:

"Every empirical counting operation -- two stones plus two stones, two photons plus two photons -- consistently yields four. The reason standard arithmetic is the assumed context is precisely because it models physical reality most faithfully."

Skeptic, conceding to the Physicist:

"I concede: physical reality does provide a non-circular grounding for why standard arithmetic is the privileged context, making '4' not merely conventional but empirically anchored."

Skeptic, challenging the Poet:

"If 2+2=4 is merely system-relative, then so is the beauty you invoke to defend it. Your certainty is self-undermining."

Poet, responding to the Skeptic:

"A poet who says 'all roses are red within this garden' need not pretend to stand outside all gardens to say so with conviction."

Engineer, opening statement:

"2+2 equals 4, a fundamental arithmetic identity that holds across all standard number systems and is foundational to integer arithmetic in every computing architecture."

Mathematician, opening statement:

"2+2=4 is a provable theorem in Peano arithmetic, following directly from the axioms of successor and addition."

Skeptic, final vote (100% confidence):

"2+2=4 is a fundamental arithmetic fact, universally true in standard integer arithmetic."

Full transcript: 89 blocks, all specialist outputs

How it actually works

Under the hood, Rein scans the YAML, builds a dependency graph, and executes blocks in parallel wherever dependencies allow. State lives in SQLite. Each block writes output to a JSON file. Downstream blocks read upstream outputs automatically via template injection:

- name: review_math
  specialist: philosopher
  depends_on: [opening_mathematician]
  prompt: "Review this position: {{ opening_mathematician.json }}"

{{ opening_mathematician.json }} expands to the full content of the mathematician's output file. Data flows through the file system. No message passing, no shared state, no race conditions.

The execution engine handles:

Parallel execution with a configurable semaphore (max_parallel: 8)
Conditional branching (next: if/else/goto)
Correction loops with max_runs safety valves
Custom logic at four points per block: pre-hook, post-hook, validate, and full custom replacement
Recovery: if block 45 of 97 fails, fix it and rerun -- blocks 1-44 are skipped

Three layers, all text files:

Specialists (Markdown)  ->  what each agent does
Teams (YAML)            ->  groups + shared tone
Workflows (YAML)        ->  execution flow

A specialist is 10 lines of Markdown defining a persona. A team groups specialists and enforces style ("respond in JSON, 1-2 sentences max"). A workflow defines the dependency graph.

Rein is not an agent framework. There's no memory management, no role-playing abstractions, no conversation loops. It's a workflow runner: you describe what happens in what order, and it executes it. If you want LangGraph's stateful graphs or CrewAI's autonomous agents, those solve different problems. Rein solves "I have 97 steps with dependencies and I need them to run correctly."

It works with any LLM: Claude, GPT, Ollama (local, free), OpenRouter. Set provider: anthropic in YAML and an API key in the environment. The deliberation ran on Claude Sonnet -- 97 API calls total. You can mix models within a single workflow: cheap model for initial opinions, expensive one for synthesis.

What I learned

Specialist prompts matter more than architecture. I rewrote the "critic" specialist seven times. The first version produced vague praise ("good work, minor issues"). The second was too harsh -- it rejected everything. Versions three through six oscillated between the two. The seventh finally worked: I gave it a scoring rubric with explicit thresholds. The lesson applies everywhere: a badly defined specialist produces garbage regardless of how sophisticated the workflow is.

The Skeptic was the best investment. Adding a specialist whose job is to disagree forced every other agent to sharpen their arguments. Without the Skeptic, the deliberation would have been eight agents agreeing politely. With the Skeptic, it became an actual discussion.

YAML is the right call. I tried Python DSLs, JSON, even a custom language. YAML is boring. Everyone knows it, it diffs well in git, and you can read a 500-line workflow without documentation. Boring tools win.

Physical grounding broke the deadlock. The philosophical arguments went in circles until the Physicist pointed at actual stones. In multi-agent systems, you need at least one specialist who anchors abstract reasoning to concrete evidence.

The Child was surprisingly effective. "It's 4, everybody knows that" is not a sophisticated argument. But it kept cutting through philosophical hedging and reminding everyone that context matters. Sometimes the simplest perspective is the most powerful.

Want to try it?

Rein orchestrator executing a multi-agent deliberation workflow

If you've ever copy-pasted a 3000-token prompt trying to make one model do five things at once, this is the workflow engine you wanted. Three text files: a specialist, a team, a workflow. That's the entire mental model.

GitHub: github.com/dklymentiev/rein-orchestrator

pip install git+https://github.com/dklymentiev/rein-orchestrator
export ANTHROPIC_API_KEY=sk-...

cd examples/01-hello-world
rein --agents-dir ./agents workflow.yaml --no-ui

# workflow.yaml -- three specialists, two dependencies
provider: anthropic
model: claude-sonnet-4-20250514
team: content-team

blocks:
  - name: research
    specialist: researcher
    prompt: "Analyze: {{ task.input.topic }}"

  - name: write
    specialist: writer
    depends_on: [research]
    prompt: "Write based on: {{ research.json }}"

  - name: review
    specialist: critic
    depends_on: [write]
    prompt: "Review: {{ write.json }}"

Ten progressive examples in the repo, from hello-world to multi-phase deliberation. Works with Claude, GPT, Ollama (local/free), OpenRouter (100+ models). Also runs as an MCP server:

claude mcp add rein -- rein-mcp

Open source. MIT license. Questions and contributions welcome on GitHub.