Agent Orchestration Service¶
The Agent Orchestration Service transforms Substrate from a passive governance tool into an active automation platform. It defines, executes, and audits multi-step agentic workflows that coordinate the other five Substrate services — Ingestion, Governance, Reasoning, Proactive Maintenance, and Simulation — toward completing tasks that require both AI judgment and human approval.
Responsibility¶
Manage agent workflow state machines, enforce human-in-the-loop (HITL) approval gates, maintain an immutable audit trail of every action taken by any agent or human in the system, and provide rollback capability when workflows fail or are rejected.
What Makes This Different¶
Substrate is not a chatbot. It is a state machine.
The difference between a conventional AI assistant and Substrate's Agent Orchestration Service is the difference between "Substrate detected a violation" and:
"Substrate detected the violation, explained it using the relevant ADR and post-mortem, waited for developer approval, generated a fix using Qwen2.5-Coder, verified the fix resolves the violation without introducing new ones via the Simulation Service, opened the Fix PR on GitHub, waited for developer merge approval, re-ingested the merged code, confirmed the original violation is resolved, and logged the complete decision trail with timestamps, confidence scores, and human approver identities."
Every workflow is a deterministic state machine. Every transition is logged. Every human decision is captured. Every automated action is reversible.
Workflow Architecture¶
State Machine Model (AOC-01)¶
Each workflow is defined as a state machine with:
- States: Named stages in the workflow (e.g.,
Explaining,AwaitingApproval,GeneratingFix,Verifying,Opening PR,AwaitingMerge,Confirming,Complete) - Transitions: Allowed state-to-state movements, each triggered by an event or guard condition
- Guards: Boolean conditions that must be true for a transition to fire (e.g., "simulation passed", "confidence > 0.9", "human approved")
- Triggers: Events from NATS, human approvals, webhook callbacks, or timer expiry
Workflow state is persisted to PostgreSQL at every transition. If the Orchestration Service restarts, all in-flight workflows resume from their persisted state.
workflow_id UUID PRIMARY KEY
workflow_type TEXT
current_state TEXT
context JSONB -- all state data carried through the workflow
created_at TIMESTAMPTZ
updated_at TIMESTAMPTZ
created_by TEXT -- actor who triggered the workflow
Human Approval Gates (AOC-02)¶
At designated approval gate states, the workflow suspends execution and sends a notification to the assigned owner via:
- Slack DM with a deep link to the approval UI
- In-app notification badge in the Substrate dashboard
The owner sees the full context: the violation or proposed action, the AI's reasoning and confidence score, the simulation result (if applicable), and the institutional memory context. They can:
- Approve: Workflow advances to the next state
- Reject with reason: Workflow transitions to a
Rejectedterminal state; reason is logged - Request modification: Workflow returns to a configurable earlier state with the modification note attached
If no response is received within 24 hours, an escalation is triggered: the next person in the service's ownership chain is notified. Escalations continue on the standard 24-hour cadence until a decision is made or the workflow times out at 7 days and enters a Timed Out terminal state.
Confidence-Based Auto-Proceed (AOC-05)¶
For high-confidence workflows, human gates can be bypassed. When every step in a workflow carries a confidence score above 90%, and the workflow type is configured for auto-proceed, the workflow executes end-to-end without waiting for human approval at each gate.
Auto-proceed is:
- Configurable per workflow type (not globally)
- Configurable per team (some teams may require manual approval regardless of confidence)
- Logged identically to human-approved workflows — the audit record indicates
approved_by: system (auto-proceed, confidence: 0.94)rather than a human approver ID - Reversible: the team can disable auto-proceed for a workflow type at any time
Rollback Capability (AOC-04)¶
If a later step in a workflow fails or a human rejects the outcome, the service reverses all automated actions taken by preceding steps. Rollback actions are determined at workflow definition time and stored as the inverse of each automated step:
- A graph mutation is rolled back by applying its inverse mutation
- An opened GitHub PR is closed
- A generated fix branch is deleted
- A posted PR comment is edited to indicate the action was withdrawn
Rollback is executed as a new forward action, not by time-travelling the database — each rollback step is itself logged in the audit trail.
The Fix PR Generation Workflow¶
The Fix PR Generation Workflow is the canonical example of Substrate's agentic capability. It is a 10-step process that takes a governance violation from detection to verified resolution with full human oversight.
Step-by-Step Execution¶
Step 1 — Violation Explanation
The Reasoning Service is called with the violation details. It retrieves the relevant ADR context, post-mortem history, and any prior exceptions for the same policy and service. The Dense explain-lora adapter produces a plain English explanation.
Output: A structured explanation document with evidence nodes, confidence score, and institutional memory links.
Step 2 — HITL Gate: Developer Approval to Proceed
The workflow suspends. The developer receives a Slack DM and in-app notification:
"PaymentService violates substrate/api-gateway-first: CALLS edge to AuthService bypasses the declared API gateway. This pattern caused the October 2024 auth cascade failure (PM-019). Substrate can generate a fix — approve to proceed."
The developer reviews the explanation and approves or rejects. No code is generated until this approval is received.
Step 3 — Fix Generation
The Governance Service packages the violation details and the relevant code context (the files containing the violating edge, the gateway import path, the service's CODEOWNERS). Qwen2.5-Coder (DGX Spark port 8002) generates a fix diff: the minimal code change that removes the boundary violation.
Output: A git diff in unified format, with file path, line ranges, and the proposed change.
Step 4 — Simulation Verification
The Simulation Service is called with the fix diff translated into a mutation spec. It clones the sandbox, applies the mutation (the change that the fix diff would make to the graph), runs the full OPA policy suite, and returns the structured delta.
Expected result: The substrate/api-gateway-first violation is in policies_newly_satisfied. No new violations appear in policies_newly_violated.
If the simulation shows new violations, the workflow returns to Step 3. The failure reason (which new violation would be introduced) is appended to the context so Qwen2.5-Coder can generate a revised fix. The retry loop is bounded at 3 attempts before the workflow enters a Manual Resolution Required state.
Step 5 — GitHub PR Opening
The GitHub API creates a new branch from the PR's base branch, commits the fix diff, and opens a pull request. The PR body includes:
- The violation explanation from Step 1
- The simulation result from Step 4 (showing the violation will be resolved)
- A link to the Substrate workflow audit trail
- The institutional memory context (relevant ADR, post-mortem reference)
Step 6 — Fix PR Lifecycle Monitoring
The Governance Service sets up a watch on the Fix PR's check_run events. It monitors whether the Fix PR passes all CI checks independently.
Step 7 — HITL Gate: Developer Merge Approval
The workflow suspends again. The developer reviews the Fix PR in GitHub. They can:
- Merge: Workflow advances to Step 8
- Request changes: The agent receives the review comment, regenerates the fix (returns to Step 3 loop), and pushes an updated commit to the Fix PR branch
- Close without merging: Workflow transitions to
Rejected by Developer; rationale is captured from the closing comment
Step 8 — Re-ingestion
On Fix PR merge, the Ingestion Service webhook fires. The Ingestion Service re-parses the changed files, computes the incremental graph delta, and publishes the update to NATS. The graph is updated with the new state.
Step 9 — Violation Confirmation
The Governance Service re-evaluates the policy that was violated, now against the updated graph. If the violation is resolved, the violation record in PostgreSQL is updated to resolution_status: resolved with a resolved_at timestamp.
If the violation is still present (the fix did not resolve it as expected), the workflow enters a Verification Failed state and the developer is notified with the discrepancy.
Step 10 — Audit Trail Completion
The Agent Orchestration Service marks the workflow Complete and writes the final audit record. The complete 10-step trace — every action, every model call, every human decision, every timestamp and confidence score — is committed to the immutable audit log.
Immutable Audit Trail (AOC-03)¶
Every action in every workflow produces an audit record. Audit records are written to an append-only PostgreSQL table with no UPDATE or DELETE permissions granted to any application role:
audit_id UUID PRIMARY KEY
workflow_id UUID REFERENCES workflows(workflow_id)
workflow_step TEXT
actor_id TEXT -- human user ID or "system:<service_name>"
action_type TEXT -- see action type taxonomy below
timestamp TIMESTAMPTZ
input_payload JSONB
output_result JSONB
confidence_score FLOAT
human_approver_id TEXT -- NULL if automated; user ID if HITL gate
graph_mutation_id UUID -- NULL if no graph change; mutation ID if graph was modified
reasoning_trace TEXT -- plain English explanation of why this action was taken
Action Type Taxonomy¶
| Action Type | Description |
|---|---|
Extraction |
Entity or knowledge extracted from a source document |
Simulation |
Sandbox simulation executed; mutation applied and evaluated |
Code Generation |
Fix diff generated by Qwen2.5-Coder |
Graph Mutation |
Node or edge added, removed, or updated in production graph |
HITL Gate: Approved |
Human approved a workflow gate |
HITL Gate: Rejected |
Human rejected a workflow gate |
Policy Evaluation |
OPA policy suite evaluated against subgraph |
Explanation Generated |
Violation or decision explanation produced |
PR Opened |
GitHub PR created by the agent |
PR Merged |
GitHub PR merged (webhook-confirmed) |
Rollback |
Preceding automated action reversed |
Escalation |
Approval gate escalated to next owner in chain |
Auto-Proceed |
HITL gate bypassed due to confidence threshold met |
Cross-Service Coordination (AOC-06)¶
The Agent Orchestration Service makes all inter-service calls in the defined workflow sequence. It is the only service that calls other Substrate services in an orchestrated chain. The service-to-service call order is enforced by the state machine definition, not by ad-hoc coupling between services.
Service calls are made via internal HTTP APIs (not NATS) for synchronous steps that require a response before the workflow can advance. For asynchronous steps (e.g., waiting for an ingestion event to complete), the service subscribes to the relevant NATS subject and advances the state machine when the expected event arrives.
Workflow Status Events (AOC-07)¶
At every state transition, the Agent Orchestration Service publishes a WorkflowStatusEvent to NATS under the substrate.workflows.<workflow_id>.status subject. The event includes:
{
"workflow_id": "wf_abc123",
"workflow_type": "fix_pr_generation",
"previous_state": "GeneratingFix",
"current_state": "Verifying",
"timestamp": "2026-03-24T14:23:01Z",
"progress_percent": 40,
"awaiting_human": false,
"message": "Simulation in progress — verifying fix does not introduce new violations"
}
The UI consumes these events via WebSocket and displays a live progress timeline of the workflow.
Daily Activity Feed (AOC-08)¶
Each team member's Substrate-related agent actions, approvals, and decisions are aggregated into a daily activity feed. The feed is persisted in PostgreSQL and accessible via the Substrate dashboard and REST API. It includes:
- Approval gates actioned (approved or rejected, with timestamps)
- Workflows triggered by the user's PRs
- Escalations the user received
- Auto-proceed events on workflows the user owns
- Violations resolved through agent workflows on their services
This provides a compliance-grade record of every developer's interaction with Substrate's governance automation, without requiring manual logging.
Functional Requirements¶
| ID | Requirement | Priority |
|---|---|---|
| AOC-01 | Define workflow as a state machine (states, transitions, triggers, guards); persist state to PostgreSQL | Must Have |
| AOC-02 | Human approval gates: suspend workflow; notify owner via Slack + in-app; timeout escalation if no response in 24h | Must Have |
| AOC-03 | Complete immutable audit trail: every agent action timestamped with actor, action type, input, output, confidence, and reasoning | Must Have |
| AOC-04 | Rollback capability: if a later step fails or is rejected, revert all preceding automated actions | Must Have |
| AOC-05 | Confidence-based auto-proceed: workflows with every step above 90% confidence can proceed without human gates (configurable) | Must Have |
| AOC-06 | Cross-service coordination: orchestrate calls to Ingestion, Governance, Reasoning, Simulation in the correct sequence | Must Have |
| AOC-07 | Publish workflow status events to NATS for real-time UI progress display | Must Have |
| AOC-08 | Daily task log: every team member's Substrate-related agent actions, approvals, and decisions logged to their activity feed | Must Have |
Infrastructure Dependencies¶
| Component | Role in Agent Orchestration Service |
|---|---|
| PostgreSQL 16 | Workflow state persistence; immutable audit log (append-only table) |
| NATS JetStream | Receives trigger events; publishes workflow status events; listens for ingestion completion confirmations |
| Redis 7 | Workflow lock to prevent duplicate execution on restart |
| Slack API | HITL approval gate notifications and escalation messages |
| GitHub API | Fix PR creation, branch management, PR status monitoring |
| Reasoning Service | Violation explanation, institutional memory retrieval |
| Governance Service | Policy evaluation, blast radius, Fix PR diff generation coordination |
| Simulation Service | Fix verification before PR is opened |
| Ingestion Service | Re-ingestion trigger after Fix PR merge |
| DGX Spark port 8002 | Qwen2.5-Coder: fix diff generation |