Skip to content

Simulation Service

The Simulation Service provides a sandboxed what-if environment that lets developers, architects, and SREs explore the architectural consequences of proposed changes before writing a single line of code. It is purely advisory and never touches the production knowledge base.

Responsibility

Accept a proposed mutation — expressed either as a structured JSON specification or as natural language — clone the current Observed and Intended graphs into an ephemeral Neo4j sandbox, apply the mutation atomically, run the full active OPA policy suite against the sandbox, compute blast radius and PageRank deltas, and return a structured before/after comparison with a plain English summary and relevant institutional memory context.


Why Simulation Is Its Own Service

The Simulation Service is architecturally separate from the Governance Service for fundamental reasons that go beyond code organisation:

Dimension Governance Service Simulation Service
Graph it operates on Real, live production graph Hypothetical, ephemeral sandbox graph
Timing Blocking and synchronous — must complete before CI gate Advisory and asynchronous — developer can proceed while simulation runs
Effect on knowledge base Acts on the production UMKB Never touches the production knowledge base
Trigger Fires on actual changes that exist in a PR Fires on proposed/hypothetical changes not yet written
Audience CI pipeline and compliance record Developer, architect, or PM exploring options

The Governance Service answers: "Does what you wrote comply with our policies?" The Simulation Service answers: "If you were to make this change, what would break, what would improve, and what should you know before you start?"


Mutation Input Formats

The Simulation Service accepts two input formats for describing a proposed change:

Structured JSON Specification (SIM-01)

A mutation spec is a JSON document describing one or more graph operations:

{
  "mutations": [
    {
      "op": "add_node",
      "type": "ServiceNode",
      "properties": { "name": "RecommendationService", "language": "python" }
    },
    {
      "op": "add_edge",
      "type": "CALLS",
      "from": "ServiceNode:OrderService",
      "to": "ServiceNode:RecommendationService"
    },
    {
      "op": "remove_edge",
      "type": "DEPENDS_ON",
      "from": "ServiceNode:OrderService",
      "to": "LibraryNode:legacy-xml-parser"
    }
  ]
}

Supported operations: add_node, remove_node, add_edge, remove_edge, update_node_properties, split_node, merge_nodes.

Natural Language Description (SIM-02)

A developer can describe the change in plain English: "What happens if I split OrderService into separate OrderCreation and OrderFulfillment services?"

The MoE Scout (Llama 4 Scout, port 8000) translates the natural language description into a structured mutation spec. Before the simulation runs, the generated spec is returned to the requester for confirmation:

Interpreted mutation:
  1. Add node: ServiceNode "OrderCreationService"
  2. Add node: ServiceNode "OrderFulfillmentService"
  3. Move edges: CALLS edges from OrderService to OrderCreationService (3 found)
  4. Move edges: CALLS edges from OrderService to OrderFulfillmentService (2 found)
  5. Remove node: ServiceNode "OrderService"

Confirm? [Yes / Edit spec / Cancel]

This human confirmation step prevents the simulation from running against a misinterpreted mutation. No sandbox is created until the spec is confirmed.


Simulation Execution Pipeline

Step 1: Sandbox Creation (SIM-03)

An ephemeral Neo4j named graph is created using:

CREATE DATABASE sim_<uuid> IF NOT EXISTS

The Observed Graph and Intended Graph are cloned into this named database using Neo4j's database copy mechanism. The clone is fully isolated — no read or write operations on the sandbox can affect the production neo4j database.

Step 2: Mutation Application (SIM-04)

The mutation spec is applied to the sandbox graph as a single atomic transaction. If any operation in the spec fails (e.g., a remove_node on a node that does not exist), the entire mutation is rolled back and the error is returned to the caller with a diagnostic message.

Step 3: Policy Evaluation (SIM-05)

The full active OPA policy suite is run against the sandbox graph using the same evaluation pathway as the Governance Service, but against the sandbox Neo4j database instead of production. This produces:

  • A list of policies newly violated by the mutation (were passing before, now failing)
  • A list of policies newly satisfied by the mutation (were failing before, now passing)
  • A list of policies unchanged — still failing or still passing after the mutation

Step 4: Blast Radius Delta Computation (SIM-06)

The blast radius is computed twice — once against the pre-mutation sandbox state and once against the post-mutation state — using the same Neo4j reachability traversal and PageRank weighting as the Governance Service.

The delta is expressed as:

  • Before: N nodes in blast radius, M critical nodes
  • After: N' nodes in blast radius, M' critical nodes
  • Delta: +/- X nodes affected, +/- Y critical nodes

Step 5: PageRank Impact (SIM-07)

The GDS PageRank algorithm is re-run on the sandbox post-mutation to detect any shifts in the criticality rankings of core services. A service that gains many new inbound CALLS edges may become a new architectural bottleneck. A split service may redistribute criticality across two nodes. These changes are included in the structured delta.

Step 6: Institutional Memory Retrieval (SIM-09)

The Reasoning Service is queried for ADRs and post-mortems relevant to the mutation context: the services being modified, added, or removed. Relevant memory is returned in the simulation output as "context you should know before making this change."

Step 7: Result Return and Sandbox Destruction (SIM-07, SIM-10)

The structured result is returned to the caller. Immediately after the result is returned, the ephemeral sandbox database is dropped:

DROP DATABASE sim_<uuid>

No sandbox persists beyond the response. Sandbox lifetime is bounded at 10 minutes; any sandbox not explicitly cleaned up by the normal flow is dropped by a scheduled cleanup job.


Simulation Output

Structured Delta (SIM-07)

{
  "simulation_id": "sim_abc123",
  "mutation_summary": "Split OrderService into OrderCreationService and OrderFulfillmentService",
  "policies_newly_violated": [
    {
      "policy_id": "substrate/service-ownership",
      "severity": "soft-mandatory",
      "message": "OrderCreationService has no OWNS edge — assign ownership before deploying"
    }
  ],
  "policies_newly_satisfied": [
    {
      "policy_id": "substrate/solid-principles",
      "message": "SRP: OrderCreationService efferent coupling = 3 (was 7 in OrderService)"
    }
  ],
  "unchanged_violations": [],
  "blast_radius_delta": {
    "before": { "total_nodes": 14, "critical_nodes": 2 },
    "after": { "total_nodes": 11, "critical_nodes": 1 },
    "delta_nodes": -3,
    "delta_critical": -1
  },
  "pagerank_impact": [
    { "node": "AuthService", "before": 0.31, "after": 0.28, "change": -0.03 }
  ],
  "memory_context": [
    {
      "type": "DecisionNode",
      "id": "ADR-031",
      "title": "Order domain split deferred in 2023",
      "summary": "Split was deferred due to shared database — verify DB ownership is separated before proceeding"
    }
  ]
}

Plain English Summary by Role (SIM-08)

The Dense explain-lora adapter generates a tailored plain English summary based on the requesting role:

Output Metric Role
Blast Radius Delta Number of downstream nodes affected Architect: identifies high-impact changes early
Policy Delta List of policies newly violated/satisfied Developer: understands compliance before coding
Criticality Impact Change in PageRank for core services SRE: identifies new structural bottlenecks
Memory Context Relevant past ADRs/post-mortems PM: understands history of modified area

Example developer summary: "Splitting OrderService into two services will resolve the SOLID SRP violation (coupling drops from 7 to 3) and reduce blast radius by 3 nodes. However, you will need to assign ownership for the two new services before deployment. Note: ADR-031 from 2023 flagged shared database as a blocker for this split — verify that is resolved."

Example architect summary: "This split reduces blast radius from 14 to 11 nodes and eliminates the highest coupling violation in the order domain. AuthService's PageRank drops slightly (0.31 → 0.28) which reduces the bottleneck risk. The primary risk is that two new services will need governance bootstrapping (ownership, documentation, test coverage)."


Result Persistence (SIM-11)

Simulation results are stored in PostgreSQL for 90 days and are queryable via the Simulation Service API. This enables:

  • Sprint planning sessions where multiple scenarios are compared side by side
  • Audit of what simulations were run before a major architectural change
  • Training data for improving the NL→mutation spec translation

Multi-Step Simulation (SIM-12 — v1.1)

A multi-step simulation applies mutations sequentially: mutation A, then B, then C, evaluating policy state after each step. This enables migration roadmap planning: "If we first extract the database layer (step 1), then split the service (step 2), then introduce the gateway (step 3), at what point do all policies pass?"


Policy Impact Simulation (SIM-13 — v1.1)

A policy impact simulation runs a proposed new Rego policy against the current production graph — without modifying the graph — and returns the list of all services that would currently be in violation. This answers: "If we adopt this new policy today, what already breaks?" before the policy is activated in the Governance Service.


Use Cases

ID Scenario Mutation Input Key Output
SIM-UC-01 "What happens if I split OrderService?" NL: split node Blast radius delta, newly violated/satisfied policies, ADR-031 context
SIM-UC-02 "What breaks if I upgrade axios to 1.x?" JSON: update dependency version License policy check, dependency traversal impact
SIM-UC-03 "Blast radius of removing the API gateway?" JSON: remove node Full reachability delta, mTLS policy implications
SIM-UC-04 "If we add this policy, what currently breaks?" Policy impact simulation List of currently non-compliant services
SIM-UC-05 Sprint planning — model proposed new services JSON: add nodes + edges Policy readiness before sprint starts

Functional Requirements

ID Requirement Priority
SIM-01 Accept proposed mutation as structured JSON spec (add/remove node/edge, split node, merge nodes, update properties) Must Have
SIM-02 Accept natural language description and translate to mutation spec via MoE Scout; return spec for human confirmation before execution Must Have
SIM-03 Clone Observed + Intended graphs into ephemeral Neo4j named graph using CREATE DATABASE IF NOT EXISTS Must Have
SIM-04 Apply mutation spec to sandbox graph atomically Must Have
SIM-05 Run full active OPA policy suite against sandbox graph Must Have
SIM-06 Compute blast radius delta: before and after affected node count, with criticality weighting Must Have
SIM-07 Return structured delta: policies newly violated, newly satisfied, unchanged violations, blast radius delta, PageRank impact Must Have
SIM-08 Return plain English summary appropriate to requesting role (developer vs architect vs DevOps) Must Have
SIM-09 Surface relevant institutional memory: ADRs and post-mortems relevant to the proposed change context Must Have
SIM-10 Destroy ephemeral sandbox named graph after result returned; no sandbox persists Must Have
SIM-11 Simulation results stored in PostgreSQL for 90 days; queryable via API Must Have
SIM-12 Multi-step simulation: apply mutation A, then B, then evaluate — for migration roadmap planning Nice to Have (v1.1)
SIM-13 Policy impact simulation: "if we add this policy, what currently breaks?" without changing the graph Nice to Have (v1.1)

Infrastructure Dependencies

Component Role in Simulation Service
Neo4j 5.x Sandbox named graph creation, graph cloning, mutation application, policy input subgraph export
PostgreSQL 16 Simulation result persistence (90-day retention)
OPA Server Policy evaluation against sandbox graph
DGX Spark port 8000 Llama 4 Scout (MoE): NL→mutation spec translation
DGX Spark port 8001 Dense 70B + explain-lora: plain English summary generation by role
Reasoning Service Institutional memory retrieval for ADR and post-mortem context
Governance Service PageRank and blast radius computation logic (shared)