Deterministic Platform Functions (Non-LLM) — Design Reference¶
Overview¶
Not every Substrate function requires a language model. A substantial portion of Substrate's highest-value capabilities are fully deterministic — graph algorithms, infrastructure verification, cryptographic checks, policy evaluation, and job scheduling. These functions are faster, cheaper, and more auditable than their LLM counterparts and must not be replaced by probabilistic inference.
The governing principle: LLMs are for extraction, explanation, and reasoning. Deterministic functions are for enforcement, verification, and measurement.
This distinction is not academic. A governance decision that blocks a PR must be traceable to an exact policy clause and graph condition — not to a model's judgment. A drift score must be reproducible from the same inputs. An SSH verification result must not vary between runs on the same host state. Determinism is what makes Substrate legally and operationally defensible.
Graph Algorithms (Neo4j GDS)¶
PageRank¶
Algorithm: Neo4j GDS PageRank on the CALLS and DEPENDS_ON graph.
What it measures: The relative architectural importance of each service node. A service with high PageRank is called by many other important services — it is architecturally critical in the sense that failures or changes cascade broadly.
How it is used:
- High-PageRank services receive heavier weight in blast radius calculations (GOV-UC-04)
- PageRank is stored as page_rank on every Service node after each run
- Violation severity is scaled by the PageRank of the affected service — a boundary violation in a high-PageRank service triggers a higher urgency alert than the same violation in an isolated service
Trigger: Nightly Celery beat job + after any major graph change (more than 50 node or 200 edge additions/removals in a single ingestion batch)
Cypher invocation:
CALL gds.pageRank.write('service-dependency-graph', {
writeProperty: 'page_rank',
maxIterations: 20,
dampingFactor: 0.85
})
Betweenness Centrality¶
Algorithm: Neo4j GDS Betweenness Centrality via Brandes' algorithm, O(|V| × |E|) time complexity.
What it measures: How often a service node lies on the shortest path between two other service nodes. A high-betweenness service is a structural bottleneck — changes to it ripple the furthest because the most dependency paths pass through it.
How it is used:
- High-betweenness services are flagged as structural bottlenecks in the Architecture Graph with a distinct visual indicator
- Betweenness is stored as betweenness on every Service node
- Proactive alerts are generated when a new service crosses the betweenness threshold (PRO-UC-01), indicating the architecture is creating a new bottleneck
Trigger: Nightly Celery beat job.
Note on scale: For large graphs (>100,000 edges), Brandes' exact algorithm becomes expensive. For these graphs, Substrate switches to NetworKit's ApproxBetweenness algorithm, which provides ε-approximation with configurable error bounds. The switch is automatic based on graph size.
Cypher invocation:
CALL gds.betweenness.write('service-dependency-graph', {
writeProperty: 'betweenness'
})
Cycle Detection¶
Algorithm: Johnson's algorithm via NetworkX (nx.simple_cycles(G)) on the DEPENDS_ON subgraph.
What it measures: The presence of circular dependencies — Module A depends on Module B which depends on Module A. Circular dependencies in the DEPENDS_ON graph are always policy violation candidates because they create tight coupling, make independent deployment impossible, and are a leading indicator of architectural debt.
How it is used: - Any cycle detected in the DEPENDS_ON subgraph triggers a policy violation candidate (OPA evaluates whether the cycle violates the active SOLID or boundary policy pack) - Cycles are surfaced in the PR Check Detail with a visual cycle path annotation showing the exact nodes involved - Cycles in the Intended Graph are a configuration error and trigger an immediate Architect alert
Trigger: Synchronously on every PR graph delta (before OPA evaluation — cycle detection is a prerequisite for the full policy evaluation pass)
Implementation note: The DEPENDS_ON subgraph is extracted from Neo4j as a NetworkX DiGraph for cycle detection. The full Neo4j graph is not passed to NetworkX — only the DEPENDS_ON projection for the PR's affected modules. This bounds the computation to the relevant subgraph.
Connected Components¶
Algorithm: Neo4j GDS Weakly Connected Components (Union-Find).
What it measures: Services that are entirely disconnected from the main dependency graph — no DEPENDS_ON, ACTUALLY_CALLS, or HOSTS edges connecting them to any other service.
How it is used: - Services in isolated components are candidates for deprecation (no consumers, no producers) or integration (should be connected but aren't) - Isolated components appear in the Proactive Maintenance feed with a "potentially orphaned service" alert - Useful for detecting services that were shut down in production but still exist in the Intended Graph
Trigger: Nightly Celery beat job.
Graph Diffing¶
Algorithm: Set-difference of node and edge sets between two timestamped graph snapshots stored in PostgreSQL.
What it measures: The structural delta between two points in time — nodes added, removed, or modified; edges added, removed, or modified.
How it is used: - Powers the diff view in the PR Check Detail screen (ING-UC-01): shows exactly what the PR changes in the graph - Powers the temporal query capability (RSN-UC-06): "What changed before last Friday's incident?" - Powers the Simulation Panel before/after comparison (SIM-UC-01 through SIM-UC-05)
Implementation: Two approaches used depending on graph size:
- Simple set-difference (default): diff = (snapshot_B.nodes - snapshot_A.nodes, snapshot_B.edges - snapshot_A.edges). O(n) using hash sets. Sufficient for PR-scoped diffs.
- Laplacian Anomaly Detection: For large-scale diffs where the number of changes is unknown, Laplacian spectral analysis detects structural anomalies (graph topology shifts) that simple set-difference would miss. Used in the nightly drift trend analysis.
Trigger: On PR open (for PR diff view), on demand (for simulation and temporal query).
Leiden Community Detection¶
Algorithm: Leiden algorithm (Traag et al. 2019) via Neo4j GDS. Improvement over Louvain — produces communities with stronger internal connectivity and avoids the Louvain "internally disconnected community" problem.
What it measures: Natural community structure in the graph — groups of services that are more densely connected to each other than to the rest of the graph.
How it is used: - GraphRAG community summaries: communities define the scope of RAPTOR map-reduce summarization (RSN-UC-04: "What are our top 3 architectural risks?") - Community membership is stored on each Service node and used to scope LLM context windows — instead of passing the entire graph to the model, only the relevant community subgraph is included - Community boundaries that don't align with declared domain boundaries are flagged as structural tension indicators
Efficiency optimization: On graph updates, only communities containing the modified nodes are re-run. Full community re-detection runs nightly. The scoped re-run prevents the expensive O(|E| log |E|) full re-computation on every PR merge.
Topological Sort¶
Algorithm: Kahn's algorithm (BFS-based) for topological ordering of DAGs.
What it measures: The valid deployment order for services — an ordering where every service is deployed before the services that depend on it.
How it is used: - Deployment order generation for CD pipelines: given a set of services to deploy, returns the safe sequential order - Prerequisite check for simulation: before simulating "what happens if I remove service X?", topological sort identifies all services that must be redeployed if X changes - When cycle detection finds a cycle, topological sort is blocked and the cycle is reported as the blocker
Trigger: On demand via FastAPI gateway endpoint; used by Simulation Service.
Jaccard Similarity¶
Formula: J(A, B) = |A ∩ B| / |A ∪ B| where A and B are dependency sets.
What it measures: The overlap between the dependency sets of two services. High Jaccard similarity between two services' dependency sets indicates they share most of the same dependencies and are candidates for the DRY policy violation check.
How it is used: - DRY (Don't Repeat Yourself) policy detection: services with Jaccard similarity > 0.80 on their dependency sets are flagged as candidates for extraction into a shared library - Also used in entity resolution as a supplementary signal: two service nodes with high Jaccard similarity on their connectivity patterns are more likely to be the same canonical entity
Trigger: Nightly batch as part of the proactive maintenance scan.
Infrastructure and Verification¶
SSH Runtime Verification¶
What it does: Connects to a target host, runs a pre-approved inspection script, parses the JSON output, and diffs the result against the declared infrastructure topology.
Architecture (fully agentless, no LLM):
1. Vault AppRole authentication → ephemeral Ed25519 certificate (5-minute TTL)
2. SSH connection via ProxyJump; ForceCommand restricts session to the inspection script
3. Inspection script collects: running processes, open ports, active network connections, systemd service state, container state if Docker/Podman is present
4. Output is returned as structured JSON to the SSH Runtime Connector
5. Connector diffs observed state against declared topology (InfraResource nodes with declared_ports and expected services)
6. Discrepancies (undeclared services, unexpected ports, stopped declared services) are written to NATS as substrate.governance.ssh_drift_detected events
No LLM involved: The inspection script, JSON parsing, and diff computation are entirely deterministic. The LLM is used only if an Architect requests a plain-English explanation of the SSH diff result.
Trigger: 15-minute Celery beat schedule; on-demand via FastAPI gateway endpoint.
Webhook HMAC Verification¶
What it does: Validates the HMAC-SHA256 signature on all inbound webhooks before the payload is forwarded to NATS.
Implementation:
import hmac, hashlib
def verify_github_webhook(payload_body: bytes, signature_header: str, secret: bytes) -> bool:
expected = "sha256=" + hmac.new(secret, payload_body, hashlib.sha256).hexdigest()
return hmac.compare_digest(expected, signature_header)
hmac.compare_digest is mandatory — a direct string comparison leaks timing information that enables signature forgery.
On failure: Return HTTP 403, log to audit table with source IP and truncated payload hash (not full payload — avoids logging attacker-controlled content).
Trigger: Synchronously on every inbound webhook request; blocks processing until verified.
Embedding Deduplication¶
What it does: Before inserting a new node embedding into PostgreSQL pgvector, performs a cosine similarity check against existing embeddings to detect duplicate nodes that slipped through entity resolution.
Implementation:
SELECT node_id, 1 - (embedding <=> $new_embedding) AS similarity
FROM node_embeddings
ORDER BY embedding <=> $new_embedding
LIMIT 5;
If similarity > 0.98, the candidate is flagged as a probable duplicate and routed to the Verification Queue rather than inserted as a new node.
Trigger: On every new node embedding insertion; runs as a pre-insert check in the Ingestion Service.
JWT and OIDC Token Validation¶
What it does: Validates incoming JWT access tokens on every API request.
Implementation (local validation path):
1. Extract kid from JWT header
2. Fetch corresponding public key from cached JWKS (Redis TTL: 1 hour)
3. Verify RS256/ES256 signature
4. Verify iss matches configured Keycloak realm URL
5. Verify aud contains substrate-backend
6. Verify exp has not passed
7. Extract groups and realm_access.roles claims for RBAC
No network call for the common path. JWKS cache miss triggers a fetch from Keycloak, which adds ~10–30ms on the miss but is amortized across all requests in the cache window.
Introspection exception: For high-consequence actions (PR block, Fix PR trigger, exception approval), the token is introspected against Keycloak's /protocol/openid-connect/token/introspect endpoint to detect revocation.
SCIM Lifecycle Events¶
What it does: Translates SCIM 2.0 user and group events from the IdP into atomic Neo4j graph mutations.
Event mapping (fully deterministic, no LLM):
| SCIM Event | Graph Mutation | Side Effect |
|---|---|---|
POST /Users |
CREATE (:Developer {scim_id, keycloak_id, github_handle, active: true}) |
Publish substrate.identity.user_onboarded |
PATCH /Users/{id} active=false |
SET developer.active = false |
Immediate key-person risk scan; CRITICAL flag on orphaned services |
POST /Groups |
CREATE (:Team {name, scim_group_id}) |
None |
PATCH /Groups/{id} add member |
CREATE (developer)-[:MEMBER_OF {since: now(), role: "member"}]->(team) |
None |
PATCH /Groups/{id} remove member |
DELETE the specific MEMBER_OF edge |
Key-person risk check for team-owned services |
All mutations are wrapped in Neo4j transactions. If the transaction fails, the SCIM endpoint returns 500 and the IdP will retry.
License Validation (Air-Gap)¶
What it does: Verifies the offline JWT license against the pre-distributed Ed25519 public key. No network call.
Implementation:
from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PublicKey
def validate_license(token: str, public_key: Ed25519PublicKey) -> LicenseClaims:
# Decode JWT, verify Ed25519 signature, check exp claim
# Return claims: customer_id, tier, expiry, enabled_features
...
On failure: If the signature is invalid or the JWT is expired, Substrate enters read-only mode. Ingestion stops, governance checks stop, but existing graph data remains readable. This allows incident response to continue even if a license renewal is delayed during an outage.
OPA Policy Evaluation¶
What it does: Evaluates the current Observed Graph context against the active OPA/Rego policy packs to produce a deterministic pass/fail result with violation details.
Architecture:
1. Governance Service serializes the relevant graph subgraph as JSON (affected service, its neighbors, its dependencies, its declared and observed state)
2. JSON context is passed to the OPA server via HTTP (POST /v1/data/substrate/violations)
3. OPA evaluates all active policy packs against the context; returns a structured JSON result listing all violations, the specific rule clause, and the graph path that triggered it
4. Governance Service writes violations to PostgreSQL and publishes to NATS
The LLM boundary: OPA evaluation is fully deterministic. The LLM (explain-lora) is invoked only to generate the plain English explanation of the violation for the PR comment — never to make the pass/fail decision. This is the critical distinction that makes Substrate governance legally defensible.
Trigger: Synchronously on PR open webhook; on Terraform apply webhook; on SSH drift event.
Structural Tension Score¶
Formula:
tension_score = |intended_weight - observed_weight| / intended_weight
Where:
- intended_weight is the normalized strength of an edge in the Intended Graph (0.0–1.0; derived from policy priority and declaration strength)
- observed_weight is the normalized strength of the corresponding edge in the Observed Graph (call frequency, confidence score)
Special cases:
- Edge exists in Observed Graph but not in Intended Graph: tension_score = 1.0 (maximum tension — this is an undeclared dependency)
- Edge exists in Intended Graph but not in Observed Graph: tension_score = intended_weight (the intended dependency is missing from reality)
- Edge exists in both with matching weights: tension_score = 0.0
Aggregation: Domain-level tension score is the weighted average of all service tension scores in the domain, weighted by PageRank (higher-PageRank services contribute more to the domain score).
Trigger: Computed for every observed edge immediately after ingestion; stored on the edge in Neo4j and in the drift score time-series table in PostgreSQL.
Persistence and Job Coordination¶
Celery Beat Job Dispatch¶
What it does: Dispatches scheduled jobs for nightly and periodic background tasks.
Schedule:
| Job | Schedule | Target |
|---|---|---|
| Nightly ingestion (all connectors) | 2:00 AM daily | Ingestion Service |
| Entity resolution pass | 2:15 AM daily (after ingestion) | Ingestion Service |
| Drift score computation | 2:45 AM daily (after resolution) | Governance Service |
| PageRank + betweenness update | 3:00 AM daily | GDS batch job |
| Leiden community re-detection | 3:15 AM daily | GDS batch job |
| Verification queue auto-check | 4:00 AM daily | Reasoning Service |
| ADR gap detection (PRO-UC-06) | 4:30 AM daily | Proactive Maintenance |
| Key-person risk scan (PRO-UC-09) | 5:00 AM weekly (Monday) | Proactive Maintenance |
| Duplicate doc detection (PRO-UC-10) | 3:00 AM daily | Proactive Maintenance |
| Daily digest generation (PRO-UC-11) | 8:30 AM daily | Proactive Maintenance |
All Celery jobs use the Redis broker with AOF persistence. If Redis restarts mid-job, the job is re-queued from the AOF log rather than silently lost.
Redis Prefix Cache Invalidation¶
What it does: Invalidates hot subgraph cache entries in Redis when the underlying graph data changes.
Implementation:
1. Ingestion Service writes graph update to Neo4j
2. Publishes substrate.cache.invalidate event to NATS with affected node IDs and edge types
3. Cache Service subscribes, computes cache key patterns affected by the update, issues Redis DEL or TTL reset for matching keys
Cache key namespace: subgraph:{query_type}:{node_id}:{depth}. Cache invalidation is key-pattern based, not full-cache flush, to avoid thrashing the cache on every ingestion event.
Flyway Migrations (PostgreSQL)¶
Flyway manages all PostgreSQL schema evolution. Non-negotiable from day one.
Conventions:
- Migration files: migrations/postgres/V{version_number}__{description}.sql
- Versioning: sequential integers (V1, V2, ... not timestamps, to prevent merge conflicts)
- All migrations are forward-only (no rollback scripts)
- Breaking changes use a multi-step migration: (1) add new column, (2) backfill, (3) add constraint, (4) deprecate old column in a later version
Startup check: make migrate runs Flyway before any application service starts. If any migration fails, startup aborts with exit code 1 and logs the failing statement.
neo4j-migrations (Graph Schema)¶
neo4j-migrations manages Neo4j schema constraints and indexes.
Conventions:
- Migration files: migrations/neo4j/V{version_number}__{description}.cypher
- All constraint creation: CREATE CONSTRAINT IF NOT EXISTS FOR (n:NodeLabel) REQUIRE n.property IS UNIQUE
- All index creation: CREATE INDEX IF NOT EXISTS FOR (n:NodeLabel) ON (n.property)
- Label additions are additive; removals require explicit deprecation migration to avoid breaking existing queries
Critical indexes (must exist from day one):
CREATE INDEX service_name IF NOT EXISTS FOR (s:Service) ON (s.name);
CREATE INDEX service_domain IF NOT EXISTS FOR (s:Service) ON (s.domain);
CREATE INDEX developer_keycloak IF NOT EXISTS FOR (d:Developer) ON (d.keycloak_id);
CREATE INDEX decision_source IF NOT EXISTS FOR (d:DecisionNode) ON (d.source_url);
NATS JetStream Stream Replay¶
What it does: Provides at-least-once event delivery with replay capability for event-sourced graph rebuilds.
Stream configuration:
| Stream Name | Subject Filter | Retention | Consumers |
|---|---|---|---|
INGESTION |
substrate.ingestion.> |
7 days | Ingestion Service workers |
GOVERNANCE |
substrate.governance.> |
7 days | Governance Service workers |
IDENTITY |
substrate.identity.> |
30 days | Graph mutation workers |
AUDIT |
substrate.audit.> |
90 days | Audit log writer + SIEM export |
CACHE |
substrate.cache.> |
1 hour | Cache invalidation workers |
Replay semantics: If the Ingestion Service crashes mid-parse, the unacknowledged message is replayed after the ackWait timeout (default: 30 seconds). The worker re-processes the event idempotently — all graph writes use MERGE rather than CREATE to prevent duplicate nodes on replay.
Idempotency requirement: Every NATS consumer must be idempotent. Graph writes use MERGE (n:Service {name: $name}) ON CREATE SET n += $props ON MATCH SET n.last_updated = timestamp(). This ensures that replaying an event twice produces the same result as processing it once.