When LLM Agents Break: Why I Ended Up with a DAG and a Watchdog

At some point, I stopped trying to improve agents.

Not because they are bad, but because they fail in very predictable ways, regardless of the model, prompt, or role.

The problems always looked the same:

an agent hangs and goes silent
an agent "thinks" forever
a pipeline breaks in the middle
retries turn into chaos
after 20 minutes, it's unclear where things are stuck

Trying to fix this with more "intelligence" didn't help. Planners, self-reflection, better prompts — all of that looks nice, but it doesn't make execution reliable.

A shift in perspective

Eventually, I stopped treating an agent as an intelligent entity.

I started treating it as a process.

A process has a few objective signals:

it is running or it is not
it produces output or it is silent
its log grows or it doesn't
it exits with code 0 or with an error

Everything else is interpretation.

If an agent doesn't write to its log, I don't care whether it is "thinking" or stuck. From the system's point of view, it's the same thing.

Agent as a Unix process

I ended up with a very simple contract:

an agent is an executable
it accepts arguments
it writes to stdout/stderr
it returns an exit code

That's it.

No special trust. No assumptions about honesty or self-awareness.

If it hangs — it can be killed. If it crashes — it can be restarted. If it lies — that's irrelevant, I only look at physical signals.

Why a DAG instead of a chain

Linear chains of agents are fragile.

Real workflows almost always look like this:

some steps can run in parallel
some must be synchronized
some stages are optional
others are critical

A DAG is not about "workflow tooling". It's about constraining chaos.

It answers simple questions:

what must happen first
what can happen in parallel
where things converge
what counts as success

Without this structure, pipelines slowly drift into undefined behavior.

The watchdog matters more than orchestration

The most important component in the whole setup is not the DAG.

It's the watchdog.

The watchdog doesn't ask the agent how it feels. It doesn't read reasoning traces. It doesn't trust reported status.

It looks at facts:

is the process alive?
is the log file growing?
when was the last update?

If the log doesn't change for several minutes, the process is considered stalled. It gets killed and restarted.

This is crude — but it works.

Why YAML and not more code

I didn't rewrite agents into a unified framework.

Instead, I described execution in a manifest:

which command to run
with which arguments
what it depends on
where it logs
how long it is allowed to run

YAML describes what happens, not how it happens.

All execution logic lives in the orchestrator. The manifest stays declarative and versioned.

If the pipeline changes, the YAML changes — not the agents.

A minimal mental model

Visually, the idea looks like this:

Agent orchestration architecture

The orchestrator decides what should run. The watchdog verifies what is actually happening.

What this is not

To be explicit about the boundaries:

this is not an Agent OS
not autonomous agents
not a reasoning framework

It's an execution and supervision layer.

It doesn't make agents smarter. It makes the process predictable.

Closing thought

I didn't try to solve reasoning.

I solved a simpler problem:

being able to start, stop, restart agent pipelines

and always know where and why they broke

For that, a DAG and a watchdog turned out to be enough.

Sometimes you don't need more intelligence. Sometimes you just need control.