Skip to content

Governance Service

The Governance Service is the immune system of the Substrate platform. It enforces structural and policy intent against the Observed Graph, blocks non-compliant code changes at the CI/CD gate, and generates human-readable explanations grounded in institutional memory.

Responsibility

Evaluate the Observed Graph against all active Rego policies on every PR open event. When a violation is found, the Governance Service determines its severity, blocks or annotates the GitHub CI pipeline accordingly, generates a plain English explanation with linked ADR and post-mortem context, and — for violations with deterministic fixes — coordinates with the Agent Orchestration Service to open a Fix PR via Qwen2.5-Coder.

Unlike traditional static analysis tools (SAST/DAST), Substrate's Governance Service operates on structural intent: it reasons about the architecture graph — who calls whom, who owns what, what is declared versus what is actually running — rather than scanning raw source text for pattern matches.


Architecture Overview

The Governance Service is composed of three tightly integrated components:

  1. OPA Evaluation Engine: An OPA server process with policies loaded from the PostgreSQL policy store via the OPA bundle mechanism. Policies are hot-reloaded when updated; no server restart is required (GOV-01).

  2. Graph Query Layer: A Neo4j read client that fetches the relevant subgraph for each policy evaluation. Blast radius computation uses Neo4j's native reachability traversal with PageRank-weighted criticality scoring (GOV-08).

  3. Explanation Generator: Uses the Dense explain-lora adapter to translate structured violation data into plain English PR comments. The generator queries the Reasoning Service for linked ADRs, post-mortems, and prior exception nodes to include as supporting context in every explanation (GOV-04, GOV-09).

On each PR open event (delivered via NATS from the Ingestion Service), the Governance Service:

  1. Fetches the affected subgraph delta from Neo4j
  2. Submits the subgraph to the OPA server for policy evaluation
  3. Receives the list of violations, their severity levels, and the affected node IDs
  4. Computes blast radius for each affected node via Neo4j traversal
  5. Queries the Reasoning Service for institutional memory context
  6. Writes violation records to PostgreSQL
  7. Posts the GitHub Checks API result (block or annotate)
  8. Posts a PR comment with the plain English explanation

The full evaluation pipeline must complete in under 2 seconds (GOV-02).


Policy Enforcement Model

Severity Levels and CI Behaviour

Severity CI Behaviour Example Policies
Hard-mandatory GitHub Check fails; PR is blocked from merge No circular dependencies, API gateway routing, license compliance
Soft-mandatory GitHub Check passes with warning annotation; PR comment posted Service ownership, documentation required, test coverage
Advisory PR comment only; no CI impact SOLID principles, DRY enforcement

Hard-mandatory violations are enforced via the GitHub Checks API (GOV-03): the Governance Service posts a check_run with conclusion: failure and a detailed output.summary. Advisory violations post conclusion: neutral with annotations.

Policy Store and Hot-Reload

All Rego policies are stored in a dedicated policies table in PostgreSQL with the following schema:

policy_id        TEXT PRIMARY KEY
pack_id          TEXT
name             TEXT
rego_source      TEXT
severity         TEXT  -- hard-mandatory | soft-mandatory | advisory
enabled          BOOLEAN
created_at       TIMESTAMPTZ
updated_at       TIMESTAMPTZ

The OPA server polls this table via a custom bundle endpoint. Whenever a policy row is updated, the bundle is regenerated and pushed to the OPA server without requiring a restart. All incoming Rego source is validated via the OPA API's compile endpoint before being saved (GOV-13): invalid Rego is rejected with a structured error response.

Blast Radius Computation (GOV-08)

For every violation, the Governance Service computes the blast radius: the set of all nodes reachable from any affected node via outbound CALLS, DEPENDS_ON, HOSTS, and ROUTES_TO edges up to 5 hops. The result includes:

  • Total count of reachable nodes
  • Hop distance distribution
  • Criticality weighting via PageRank score (nodes with PageRank > 0.3 are marked critical)
  • A list of critical path services in the blast radius

Blast radius results are attached to the violation record in PostgreSQL and included in the PR comment explanation.

Runtime Drift Enforcement (GOV-12)

When the Ingestion Service's SSH Runtime Connector publishes a RuntimeDriftEvent to NATS, the Governance Service evaluates the drift magnitude against the substrate/runtime-drift policy threshold. If a host's observed state diverges from the graph-declared state beyond the configured threshold (measured as the fraction of divergent checks out of total checks run), a RuntimeViolation is raised. Runtime violations are treated as hard-mandatory and assigned to the service owner for immediate resolution.

Policy Exception Capture (GOV-11)

When an engineer or architect approves a policy exception through the HITL approval gate in the Agent Orchestration Service, the Governance Service creates an ExceptionNode in the graph with:

  • rationale property (the human-provided justification)
  • approved_by property (the approver's identity)
  • expires_at property (optional expiry date)
  • WHY edge to the violating service node
  • EXCEPTION_TO edge to the policy node

This exception is surface by the Reasoning Service in future violation explanations for the same service and policy pair, preserving the institutional context of why the exception was granted.


Official Policy Packs (MVP)

Substrate ships nine official policy packs that provide immediate governance value without requiring custom Rego authorship.

substrate/no-circular-deps — Hard-mandatory

Detects any directed cycle in the DEPENDS_ON graph using Johnson's algorithm. A circular dependency is defined as any path from node A that returns to node A through DEPENDS_ON edges. Cycles of any length are flagged; the violation output includes the full cycle path.

substrate/api-gateway-first — Hard-mandatory

Enforces that all inter-service communication routes through a declared gateway node. The policy evaluates every ACTUALLY_CALLS edge in the Observed Graph: if a direct call exists between two service nodes without an intermediate gateway node on the path, the call is a violation. The ACTUALLY_CALLS edge type is populated by the SSH Runtime Connector from observed network connections, not from declared intent.

substrate/service-ownership — Soft-mandatory

Every service node must have at least one OWNS edge connecting it to a Developer or Team node. Services with no ownership edge are flagged. The policy also queries the CODEOWNERS file (ingested as part of the GitHub connector) as a secondary ownership signal.

substrate/license-compliance — Hard-mandatory

No DEPENDS_ON edge from a commercial service node may lead to a dependency node carrying a GPL or AGPL license tag. The connector populates license information from package.json, requirements.txt, go.mod, and Cargo.toml lock files. The whitelist and blacklist are configurable per-tenant in the policy store.

substrate/no-undocumented-services — Soft-mandatory

Every service node must have at least one DOCUMENTS edge pointing to a documentation node (a GitHub Pages site, a Confluence page, or a README node). Services with no outbound DOCUMENTS edges are flagged with the violation text: "ServiceName has no documentation — assign documentation or link an existing page."

substrate/solid-principles — Advisory

A composite pack enforcing four of the five SOLID principles as measurable graph properties:

Single Responsibility Principle (SRP): Efferent coupling (fan-out) greater than 5 outbound DEPENDS_ON edges per service node is a violation. The instability index I = Ce / (Ce + Ca) is computed for each node, where Ce is efferent coupling and Ca is afferent coupling. I > 0.8 indicates a high-risk, highly unstable component.

Dependency Inversion Principle (DIP): Layer ranks are assigned as: presentation=0, application=1, domain=2, infrastructure=3. Any DEPENDS_ON edge from a lower-rank node to a higher-rank node (domain depending on infrastructure, for example) is a violation.

Interface Segregation Principle (ISP): Interface nodes with more than 8 methods, or with consumers that use fewer than 50% of the interface's methods, are flagged as over-broad interfaces.

Open/Closed and Liskov Substitution Principles: Checked via inheritance graph traversal where class hierarchies are present in the AST-parsed code graph.

substrate/dry-patterns — Advisory

Dependency-level DRY: Computes Jaccard similarity |A ∩ B| / |A ∪ B| between the dependency sets of any two service nodes. A score above 0.80 flags the pair as consolidation candidates: they depend on the same libraries to a degree suggesting they could be merged or share a library layer.

Source-level DRY: Source duplication above 5% across modules (measured via normalized AST subtree similarity) flags a refactoring recommendation.

substrate/tdd-coverage — Soft-mandatory

Every service node must carry a test_coverage property (populated by the Ingestion Service from CI test reports) of at least 80%. Services with PageRank above 0.3 (critical services) require at least 90% coverage. Service nodes with no test_coverage property at all are treated as soft-mandatory violations.

substrate/api-first — Soft-mandatory

Every REST service node must have a linked OpenAPISpecNode via a DESCRIBED_BY edge. The spec's last_modified timestamp must be within 30 days of the service's most recent deployment. Stale or absent OpenAPI specs are flagged with a soft-mandatory violation. This policy follows the Spego OPA bundle pattern as a reference implementation.


Institutional Memory in Violation Explanations (GOV-09)

Every violation explanation generated for a PR comment includes a structured context block retrieved from the Reasoning Service:

  • Linked ADRs: Any DecisionNode with a WHY edge to an affected service node, presented as "This decision was made because: [rationale]."
  • Post-mortem references: Any FailurePattern node with a CAUSED_BY or AFFECTED edge involving the same service or edge type as the current violation.
  • Prior exceptions: Any ExceptionNode on the same policy and service combination, with the original approver and rationale.
  • Related policy context: The full text of the policy being violated, with a link to its definition in the policy store.

This context transforms a raw policy violation message into an explanation that communicates not just what is wrong, but why the rule exists, what happened last time it was ignored, and whether it has been excepted before.


Fix PR Generation (GOV-14)

For violations with deterministic structural fixes (missing gateway import, missing CODEOWNERS entry, outdated OpenAPI spec path), the Governance Service coordinates with the Agent Orchestration Service to generate a Fix PR:

  1. The Governance Service packages the violation details and the affected code context.
  2. The Agent Orchestration Service invokes the 10-step Fix PR Generation Workflow (see Agent Orchestration Service documentation).
  3. Qwen2.5-Coder (DGX Spark port 8002) generates the fix diff.
  4. The Simulation Service verifies the fix resolves the violation without introducing new ones.
  5. The GitHub API opens the Fix PR with the simulation result attached.

Fix PR generation is marked as Nice to Have (MVP) and requires explicit developer approval at Step 2 before any code is generated.


Violation Storage (GOV-10)

Every violation is persisted to the violations table in PostgreSQL:

violation_id        UUID PRIMARY KEY
policy_id           TEXT
pr_number           INTEGER
repo_full_name      TEXT
severity            TEXT
affected_nodes      JSONB   -- array of node IDs and types
blast_radius_count  INTEGER
timestamp           TIMESTAMPTZ
resolution_status   TEXT    -- open | exception_granted | resolved | auto_resolved
resolved_at         TIMESTAMPTZ
exception_node_id   UUID REFERENCES exceptions(exception_id)

Violations are queryable via the Governance Service REST API for dashboards, trend analysis, and audit purposes.


Functional Requirements

ID Requirement Priority
GOV-01 OPA server running with policies loaded from and hot-reloaded from PostgreSQL policy store via bundle mechanism Must Have
GOV-02 Evaluate Observed Graph against all active policies on every PR open event; complete in under 2 seconds Must Have
GOV-03 Block GitHub CI/CD pipeline on hard-mandatory violation via GitHub Checks API; pass with annotation on advisory violations Must Have
GOV-04 Generate plain English violation explanation in PR comment; include linked ADR and post-mortem context if available Must Have
GOV-05 Detect architectural boundary violations: service calls service without routing through declared gateway Must Have
GOV-06 Detect new dependency license conflicts against defined license whitelist/blacklist policy Must Have
GOV-07 Detect Terraform/K8s state deviation from last known intended infrastructure state Must Have
GOV-08 Compute blast radius from any graph node via Neo4j reachability traversal; return affected nodes, hop count, criticality (PageRank-weighted) Must Have
GOV-09 Surface linked institutional memory in every violation explanation: relevant ADRs, post-mortems, prior exceptions Must Have
GOV-10 Store all violations with timestamp, affected nodes, policy ID, severity, and resolution status in PostgreSQL Must Have
GOV-11 When policy exception approved: capture rationale as ExceptionNode with WHY edge; link to policy and violating service Must Have
GOV-12 Enforce SSH-verified runtime drift: if host state diverges from graph-declared state beyond threshold, raise runtime violation Must Have
GOV-13 Validate Rego syntax via OPA API before saving any new or updated policy Must Have
GOV-14 Generate Fix PR suggestion using Qwen2.5-Coder for violations with deterministic structural fixes Nice to Have (MVP)
GOV-15 Detect conflicts between two Rego policies using logical contradiction analysis Nice to Have (v1.1)

Infrastructure Dependencies

Component Role in Governance Service
OPA Server Policy evaluation engine; receives subgraph inputs, returns violation lists
PostgreSQL 16 Policy store, violation log, exception records
Neo4j 5.x Subgraph fetch for policy inputs; blast radius traversal
NATS JetStream Receives PR open events from Ingestion Service; publishes violation events
GitHub Checks API Posts check_run results to block or annotate PRs
DGX Spark port 8001 Dense 70B + explain-lora: violation explanation generation
DGX Spark port 8002 Qwen2.5-Coder: fix diff generation (GOV-14)
Reasoning Service Institutional memory retrieval for violation context
Agent Orchestration Service Fix PR workflow coordination