Health

Eight layers. Two classes of failure. One discipline: push and pull.

Multi-agent systems fail in two completely different ways. The rules drift from their schema and an agent acts on a bad instruction. Or the system runs but stops actually doing anything, and nobody notices for days.

Both kinds of failure are silent. Both compound. Both kill trust in the system if they are caught after the fact instead of at the source. Persek OS runs eight layers of checks across two classes (change-time gates and runtime observability) to catch each kind where it actually starts.

← Back to Persek OS Next subsystem · Knowledge →

x:860 y:0 Two failures / 02

What we're catching

Two families of failure.

Static rot and operational silence. Different shapes, different timescales, different fixes.

Static rot

The markdown that defines an agent's behavior drifts from its schema. A field gets renamed but a few rule files don't get updated. The agent loads the stale file, runs on a bad instruction, and produces a wrong decision mid-session. Caught at change-time, before the bad rule lands.

Operational silence

The system runs every night but stops actually producing output. Process liveness reports "ran," but the file it should have written isn't there. Or it's there but it hasn't changed in three days. Process tells you it's alive; output tells you it's working. Both have to be checked.

x:0 y:720 Static layers / 03

Change-time gates

Static layers.

Three layers that catch rot before a bad rule ever loads into a session.

Credit The static-layer framing came from this post by @nvk on X. Direct lift, adapted for agent-rule markdown instead of code.

L1 · Structural

The fastest layer. Does the file parse? Does the schema hold? Pattern: each rule has a golden fixture that should pass and a defect fixture that should fail. Both directions verified. Blocks the merge on failure.

L2 · Semantic

An LLM-assisted check. The rule still parses, but does it still mean what it said before the edit? Categories: behavior preservation, no contradictions with adjacent rules, no silent regressions. Runs on the PR.

L3 · Workflow (deferred)

End-to-end multi-turn agent scenarios. Test full workflows as units, not just rule files. Deferred. Needs tooling improvements first. Tracked, not forgotten.

x:880 y:720 Operational layers / 04

Runtime observability

Operational layers.

Five layers that catch silence after the system is running.

L0 · Output checks

The most basic check. Did the system actually produce the expected output? Expected outputs have to be present and current.

L1 · Cost checks

Did this run cost roughly what it usually costs? Cost drift can indicate a prompt or scope problem, so it gets reviewed early.

L2 · Quality checks

Are the agents still good at the things they're supposed to be good at? Nightly evals that compare current agent output against pinned expected behavior across a battery of prompts. Catches drift that slips past structural and semantic checks.

L3 · Trend checks

Is the trendline of system health up or down? Aggregates the discrete events from the lower layers into a regular trend review. Where is signal getting noisier? Where is quality slipping? Where are we stable?

L4 · Freshness checks

Is what an agent loaded into context actually current, or is it operating on a stale snapshot of the system? Catches the drift between change-time and load-time. The gap where an L1/L2 fix has shipped but an active session is still running on the old version.

x:0 y:1700 Discipline / 05

Push and pull

One discipline that ties the layers together.

Layers without discipline produce alert fatigue. The push/pull rule keeps signal high and noise low.

Push: only the things that need a response right now push to me. A serious structural failure, cost anomaly, or missing critical output. These interrupt.

Pull: everything else lives in a regular review. The slow trends, entries flagged for review, and freshness state are pulled together in one place when I am ready to look.

The discipline is what prevents the eight layers from becoming eight notification streams. Most signal is for me to notice, not to react to. Pull is the default. Push is the exception, reserved for things that genuinely cannot wait until morning.

→next dragpan 0fit all Eschome