Skip to content

Agent Orchestration Service

The Agent Orchestration Service transforms Substrate from a passive governance tool into an active automation platform. It defines, executes, and audits multi-step agentic workflows that coordinate the other five Substrate services — Ingestion, Governance, Reasoning, Proactive Maintenance, and Simulation — toward completing tasks that require both AI judgment and human approval.

Responsibility

Manage agent workflow state machines, enforce human-in-the-loop (HITL) approval gates, maintain an immutable audit trail of every action taken by any agent or human in the system, and provide rollback capability when workflows fail or are rejected.


What Makes This Different

Substrate is not a chatbot. It is a state machine.

The difference between a conventional AI assistant and Substrate's Agent Orchestration Service is the difference between "Substrate detected a violation" and:

"Substrate detected the violation, explained it using the relevant ADR and post-mortem, waited for developer approval, generated a fix using Qwen2.5-Coder, verified the fix resolves the violation without introducing new ones via the Simulation Service, opened the Fix PR on GitHub, waited for developer merge approval, re-ingested the merged code, confirmed the original violation is resolved, and logged the complete decision trail with timestamps, confidence scores, and human approver identities."

Every workflow is a deterministic state machine. Every transition is logged. Every human decision is captured. Every automated action is reversible.


Workflow Architecture

State Machine Model (AOC-01)

Each workflow is defined as a state machine with:

  • States: Named stages in the workflow (e.g., Explaining, AwaitingApproval, GeneratingFix, Verifying, Opening PR, AwaitingMerge, Confirming, Complete)
  • Transitions: Allowed state-to-state movements, each triggered by an event or guard condition
  • Guards: Boolean conditions that must be true for a transition to fire (e.g., "simulation passed", "confidence > 0.9", "human approved")
  • Triggers: Events from NATS, human approvals, webhook callbacks, or timer expiry

Workflow state is persisted to PostgreSQL at every transition. If the Orchestration Service restarts, all in-flight workflows resume from their persisted state.

workflow_id       UUID PRIMARY KEY
workflow_type     TEXT
current_state     TEXT
context           JSONB   -- all state data carried through the workflow
created_at        TIMESTAMPTZ
updated_at        TIMESTAMPTZ
created_by        TEXT    -- actor who triggered the workflow

Human Approval Gates (AOC-02)

At designated approval gate states, the workflow suspends execution and sends a notification to the assigned owner via:

  • Slack DM with a deep link to the approval UI
  • In-app notification badge in the Substrate dashboard

The owner sees the full context: the violation or proposed action, the AI's reasoning and confidence score, the simulation result (if applicable), and the institutional memory context. They can:

  • Approve: Workflow advances to the next state
  • Reject with reason: Workflow transitions to a Rejected terminal state; reason is logged
  • Request modification: Workflow returns to a configurable earlier state with the modification note attached

If no response is received within 24 hours, an escalation is triggered: the next person in the service's ownership chain is notified. Escalations continue on the standard 24-hour cadence until a decision is made or the workflow times out at 7 days and enters a Timed Out terminal state.

Confidence-Based Auto-Proceed (AOC-05)

For high-confidence workflows, human gates can be bypassed. When every step in a workflow carries a confidence score above 90%, and the workflow type is configured for auto-proceed, the workflow executes end-to-end without waiting for human approval at each gate.

Auto-proceed is:

  • Configurable per workflow type (not globally)
  • Configurable per team (some teams may require manual approval regardless of confidence)
  • Logged identically to human-approved workflows — the audit record indicates approved_by: system (auto-proceed, confidence: 0.94) rather than a human approver ID
  • Reversible: the team can disable auto-proceed for a workflow type at any time

Rollback Capability (AOC-04)

If a later step in a workflow fails or a human rejects the outcome, the service reverses all automated actions taken by preceding steps. Rollback actions are determined at workflow definition time and stored as the inverse of each automated step:

  • A graph mutation is rolled back by applying its inverse mutation
  • An opened GitHub PR is closed
  • A generated fix branch is deleted
  • A posted PR comment is edited to indicate the action was withdrawn

Rollback is executed as a new forward action, not by time-travelling the database — each rollback step is itself logged in the audit trail.


The Fix PR Generation Workflow

The Fix PR Generation Workflow is the canonical example of Substrate's agentic capability. It is a 10-step process that takes a governance violation from detection to verified resolution with full human oversight.

Step-by-Step Execution

Step 1 — Violation Explanation

The Reasoning Service is called with the violation details. It retrieves the relevant ADR context, post-mortem history, and any prior exceptions for the same policy and service. The Dense explain-lora adapter produces a plain English explanation.

Output: A structured explanation document with evidence nodes, confidence score, and institutional memory links.

Step 2 — HITL Gate: Developer Approval to Proceed

The workflow suspends. The developer receives a Slack DM and in-app notification:

"PaymentService violates substrate/api-gateway-first: CALLS edge to AuthService bypasses the declared API gateway. This pattern caused the October 2024 auth cascade failure (PM-019). Substrate can generate a fix — approve to proceed."

The developer reviews the explanation and approves or rejects. No code is generated until this approval is received.

Step 3 — Fix Generation

The Governance Service packages the violation details and the relevant code context (the files containing the violating edge, the gateway import path, the service's CODEOWNERS). Qwen2.5-Coder (DGX Spark port 8002) generates a fix diff: the minimal code change that removes the boundary violation.

Output: A git diff in unified format, with file path, line ranges, and the proposed change.

Step 4 — Simulation Verification

The Simulation Service is called with the fix diff translated into a mutation spec. It clones the sandbox, applies the mutation (the change that the fix diff would make to the graph), runs the full OPA policy suite, and returns the structured delta.

Expected result: The substrate/api-gateway-first violation is in policies_newly_satisfied. No new violations appear in policies_newly_violated.

If the simulation shows new violations, the workflow returns to Step 3. The failure reason (which new violation would be introduced) is appended to the context so Qwen2.5-Coder can generate a revised fix. The retry loop is bounded at 3 attempts before the workflow enters a Manual Resolution Required state.

Step 5 — GitHub PR Opening

The GitHub API creates a new branch from the PR's base branch, commits the fix diff, and opens a pull request. The PR body includes:

  • The violation explanation from Step 1
  • The simulation result from Step 4 (showing the violation will be resolved)
  • A link to the Substrate workflow audit trail
  • The institutional memory context (relevant ADR, post-mortem reference)

Step 6 — Fix PR Lifecycle Monitoring

The Governance Service sets up a watch on the Fix PR's check_run events. It monitors whether the Fix PR passes all CI checks independently.

Step 7 — HITL Gate: Developer Merge Approval

The workflow suspends again. The developer reviews the Fix PR in GitHub. They can:

  • Merge: Workflow advances to Step 8
  • Request changes: The agent receives the review comment, regenerates the fix (returns to Step 3 loop), and pushes an updated commit to the Fix PR branch
  • Close without merging: Workflow transitions to Rejected by Developer; rationale is captured from the closing comment

Step 8 — Re-ingestion

On Fix PR merge, the Ingestion Service webhook fires. The Ingestion Service re-parses the changed files, computes the incremental graph delta, and publishes the update to NATS. The graph is updated with the new state.

Step 9 — Violation Confirmation

The Governance Service re-evaluates the policy that was violated, now against the updated graph. If the violation is resolved, the violation record in PostgreSQL is updated to resolution_status: resolved with a resolved_at timestamp.

If the violation is still present (the fix did not resolve it as expected), the workflow enters a Verification Failed state and the developer is notified with the discrepancy.

Step 10 — Audit Trail Completion

The Agent Orchestration Service marks the workflow Complete and writes the final audit record. The complete 10-step trace — every action, every model call, every human decision, every timestamp and confidence score — is committed to the immutable audit log.


Immutable Audit Trail (AOC-03)

Every action in every workflow produces an audit record. Audit records are written to an append-only PostgreSQL table with no UPDATE or DELETE permissions granted to any application role:

audit_id             UUID PRIMARY KEY
workflow_id          UUID REFERENCES workflows(workflow_id)
workflow_step        TEXT
actor_id             TEXT    -- human user ID or "system:<service_name>"
action_type          TEXT    -- see action type taxonomy below
timestamp            TIMESTAMPTZ
input_payload        JSONB
output_result        JSONB
confidence_score     FLOAT
human_approver_id    TEXT    -- NULL if automated; user ID if HITL gate
graph_mutation_id    UUID    -- NULL if no graph change; mutation ID if graph was modified
reasoning_trace      TEXT    -- plain English explanation of why this action was taken

Action Type Taxonomy

Action Type Description
Extraction Entity or knowledge extracted from a source document
Simulation Sandbox simulation executed; mutation applied and evaluated
Code Generation Fix diff generated by Qwen2.5-Coder
Graph Mutation Node or edge added, removed, or updated in production graph
HITL Gate: Approved Human approved a workflow gate
HITL Gate: Rejected Human rejected a workflow gate
Policy Evaluation OPA policy suite evaluated against subgraph
Explanation Generated Violation or decision explanation produced
PR Opened GitHub PR created by the agent
PR Merged GitHub PR merged (webhook-confirmed)
Rollback Preceding automated action reversed
Escalation Approval gate escalated to next owner in chain
Auto-Proceed HITL gate bypassed due to confidence threshold met

Cross-Service Coordination (AOC-06)

The Agent Orchestration Service makes all inter-service calls in the defined workflow sequence. It is the only service that calls other Substrate services in an orchestrated chain. The service-to-service call order is enforced by the state machine definition, not by ad-hoc coupling between services.

Service calls are made via internal HTTP APIs (not NATS) for synchronous steps that require a response before the workflow can advance. For asynchronous steps (e.g., waiting for an ingestion event to complete), the service subscribes to the relevant NATS subject and advances the state machine when the expected event arrives.


Workflow Status Events (AOC-07)

At every state transition, the Agent Orchestration Service publishes a WorkflowStatusEvent to NATS under the substrate.workflows.<workflow_id>.status subject. The event includes:

{
  "workflow_id": "wf_abc123",
  "workflow_type": "fix_pr_generation",
  "previous_state": "GeneratingFix",
  "current_state": "Verifying",
  "timestamp": "2026-03-24T14:23:01Z",
  "progress_percent": 40,
  "awaiting_human": false,
  "message": "Simulation in progress — verifying fix does not introduce new violations"
}

The UI consumes these events via WebSocket and displays a live progress timeline of the workflow.


Daily Activity Feed (AOC-08)

Each team member's Substrate-related agent actions, approvals, and decisions are aggregated into a daily activity feed. The feed is persisted in PostgreSQL and accessible via the Substrate dashboard and REST API. It includes:

  • Approval gates actioned (approved or rejected, with timestamps)
  • Workflows triggered by the user's PRs
  • Escalations the user received
  • Auto-proceed events on workflows the user owns
  • Violations resolved through agent workflows on their services

This provides a compliance-grade record of every developer's interaction with Substrate's governance automation, without requiring manual logging.


Functional Requirements

ID Requirement Priority
AOC-01 Define workflow as a state machine (states, transitions, triggers, guards); persist state to PostgreSQL Must Have
AOC-02 Human approval gates: suspend workflow; notify owner via Slack + in-app; timeout escalation if no response in 24h Must Have
AOC-03 Complete immutable audit trail: every agent action timestamped with actor, action type, input, output, confidence, and reasoning Must Have
AOC-04 Rollback capability: if a later step fails or is rejected, revert all preceding automated actions Must Have
AOC-05 Confidence-based auto-proceed: workflows with every step above 90% confidence can proceed without human gates (configurable) Must Have
AOC-06 Cross-service coordination: orchestrate calls to Ingestion, Governance, Reasoning, Simulation in the correct sequence Must Have
AOC-07 Publish workflow status events to NATS for real-time UI progress display Must Have
AOC-08 Daily task log: every team member's Substrate-related agent actions, approvals, and decisions logged to their activity feed Must Have

Infrastructure Dependencies

Component Role in Agent Orchestration Service
PostgreSQL 16 Workflow state persistence; immutable audit log (append-only table)
NATS JetStream Receives trigger events; publishes workflow status events; listens for ingestion completion confirmations
Redis 7 Workflow lock to prevent duplicate execution on restart
Slack API HITL approval gate notifications and escalation messages
GitHub API Fix PR creation, branch management, PR status monitoring
Reasoning Service Violation explanation, institutional memory retrieval
Governance Service Policy evaluation, blast radius, Fix PR diff generation coordination
Simulation Service Fix verification before PR is opened
Ingestion Service Re-ingestion trigger after Fix PR merge
DGX Spark port 8002 Qwen2.5-Coder: fix diff generation