Skip to content

Security & Authentication — Design Reference

Zero Data Leaves the Building

This is the foundational security policy of Substrate. All source code, architecture diagrams, dependency graphs, policy logic, institutional memory, and audit logs stay within the customer's own infrastructure. There are no calls to OpenAI, Anthropic, or any external LLM API. There are no telemetry callbacks, license phone-home requests, or external data pipelines.

This policy is enforced at the infrastructure level, not the application level. The DGX Spark runs all AI inference locally. The air-gap license verification requires no outbound network. Every connector pulls from customer-internal systems. The Docker images are distributed for offline import.

The consequence of this policy is that every security control described in this document can be audited, configured, and verified by the customer without involving Substrate's team after deployment.


Authentication

OIDC via Keycloak

Substrate uses Keycloak as the identity provider, accessed via OpenID Connect. Keycloak is bundled in the Docker Compose deployment and the Helm umbrella chart, but customers may substitute an existing Keycloak instance or any OIDC-compliant IdP.

Realm Configuration

Substrate operates in a dedicated substrate realm, never the Keycloak master realm. This isolates Substrate's users, groups, roles, and clients from any other applications sharing the Keycloak instance.

OIDC Clients

Three clients are registered in the substrate realm:

Client ID Type Flow Purpose
substrate-spa Public Authorization Code + PKCE React SPA — no client secret; PKCE prevents authorization code interception
substrate-backend Confidential Client Credentials FastAPI backend service authenticating to Keycloak for token introspection and user lookups
substrate-service Service Account Client Credentials Admin API + SCIM provisioning; used by automated onboarding/offboarding workflows

Authorization Code Flow with PKCE

The SPA uses the Authorization Code flow with PKCE (Proof Key for Code Exchange):

  1. SPA generates a code_verifier (random 256-bit value) and its SHA-256 hash code_challenge
  2. SPA redirects to Keycloak /auth with code_challenge and code_challenge_method=S256
  3. User authenticates; Keycloak redirects to SPA with authorization_code
  4. SPA exchanges code for tokens, sending code_verifier; Keycloak verifies the challenge before issuing tokens
  5. Access token used for all API requests; refresh token used for silent renewal

This flow ensures no client secret is embedded in the SPA bundle. Even if an authorization code is intercepted, it cannot be exchanged without the original code_verifier.

JWT Claims Structure

The Keycloak JWT access token carries the following claims relevant to Substrate authorization:

{
  "sub": "uuid",
  "iss": "https://auth.internal/realms/substrate",
  "aud": ["substrate-backend"],
  "exp": 1234567890,
  "groups": ["/Engineering/Backend", "/Engineering"],
  "realm_access": {
    "roles": ["developer"]
  },
  "resource_access": {
    "substrate-backend": {
      "roles": ["graph:read", "policy:read"]
    }
  }
}

The groups claim carries full paths (e.g., /Engineering/Backend) — this is achieved by configuring a Group Membership Mapper at the client scope level in Keycloak. The full path enables direct mapping to the Neo4j team hierarchy without additional lookups.

Token Validation — Hybrid Approach

Substrate uses a hybrid validation strategy rather than pure local validation or pure introspection:

  • Most requests: Local JWT signature verification against cached JWKS (fetched from Keycloak's JWKS endpoint, cached in Redis with 1-hour TTL). Validates: RS256/ES256 signature, iss matches configured realm, aud matches configured client ID, exp has not passed. This adds zero network latency per request.
  • Introspection: Used only when immediate revocation detection is required — specifically when the Governance Service makes a policy decision that triggers an action with significant side effects (blocking a PR, triggering a Fix PR workflow). Introspection adds ~5–15ms but confirms the token was not revoked since issuance.

This hybrid approach avoids the failure mode of "revoked user's token is accepted for up to 1 hour" for high-consequence actions, while maintaining sub-millisecond validation for the vast majority of read requests.


Authorization (RBAC)

Roles

Role Capabilities
Admin Full access; user management; system configuration; license management
Architect Full graph access; policy authoring; simulation; exception approval; graph mutation
Developer PR check details; own service dependency graph; intent mismatch alerts; read-only graph for owned services
Viewer Read-only access to graphs, dashboards, and reports; no policy or graph mutation
Service Account (CI/CD) API-only; PR check submission; webhook delivery; no UI access

Group-Based Role Mapping

JWT groups claim maps to Substrate RBAC roles via a configurable mapping table in the FastAPI gateway. Example:

/Engineering/Platform → Architect
/Engineering/Backend → Developer
/Engineering/Frontend → Developer
/Leadership → Viewer

This mapping is stored in PostgreSQL and editable by Admin role users via the API.

Graph-Level Ownership Permissions

Beyond role-based access, graph ownership edges define fine-grained permissions:

  • A Developer can see detailed component-level graph data for services where an OWNS edge exists from their Developer node (directly or via MEMBER_OF → Team → OWNS)
  • A Developer cannot see internal component detail for services owned by other teams without Viewer-level access being explicitly granted
  • Architect role bypasses ownership-scoped filtering — Architects see the full graph

CODEOWNERS files from GitHub repositories are ingested as an additional ownership signal with 0.70 confidence weight, supplementing the SCIM-derived OWNS edges.


Lifecycle Management (SCIM 2.0)

Overview

SCIM 2.0 is used for automated user and group lifecycle management. Every SCIM event from the IdP triggers an atomic graph mutation in Neo4j, maintaining the identity graph in sync with the authoritative identity source.

Plugin

Substrate targets the Captain-P-Goldfish/scim-for-keycloak plugin for Keycloak SCIM 2.0 support. Native Keycloak SCIM support is expected approximately mid-2026 in Keycloak 26.6; the plugin provides equivalent functionality in the interim and will be replaced by native support when available without requiring any changes to Substrate's SCIM consumer implementation.

Onboarding Flow

When a new user is provisioned in the IdP:

  1. IdP sends POST /scim/v2/Users with user attributes (display name, email, groups, external ID)
  2. Substrate SCIM endpoint creates Developer node in Neo4j with active: true
  3. For each group membership in the SCIM payload, a MEMBER_OF edge is created: (developer)-[:MEMBER_OF {since: now(), role: "member"}]->(team)
  4. If the team node does not yet exist, it is created with verification_status = Unverified (team metadata ingested separately from Terraform/GitHub)
  5. A substrate.identity.user_onboarded event is published to NATS

Offboarding Flow

When a user is deactivated in the IdP:

  1. IdP sends PATCH /scim/v2/Users/{id} with active: false
  2. Substrate sets Developer.active = false in Neo4j
  3. Immediate key-person risk scan: Substrate queries for all Service nodes where this Developer is the sole or primary owner (direct OWNS or inherited via MEMBER_OF → Team → OWNS)
  4. Any service with no remaining active owner is immediately flagged CRITICAL in the Verification Queue
  5. Notifications are sent to all Architects (in-app + Slack if configured)
  6. A substrate.identity.user_offboarded event is published to NATS with the list of affected services

The key-person risk scan runs synchronously within the SCIM handler — offboarding must not return 200 OK until the risk scan has completed and alerts have been queued.

Team Membership Events

SCIM Event Graph Mutation
POST /Groups (new group) CREATE (:Team {name: $group_name})
PATCH /Groups/{id} add member CREATE (user)-[:MEMBER_OF {since: now(), role: "member"}]->(team)
PATCH /Groups/{id} remove member DELETE the specific MEMBER_OF edge; re-run key-person risk check for all services that team owns
DELETE /Groups/{id} Mark team verification_status = Deprecated; queue all owned services for ownership reassignment

Security Controls

TLS and mTLS

  • External TLS: All HTTP traffic from clients to the FastAPI gateway uses TLS 1.3. Certificates managed by Let's Encrypt (internet-accessible) or customer-provided PKI (air-gapped deployments).
  • Internal mTLS: All inter-service traffic on the Docker bridge network uses mutual TLS. Certificates are short-lived (24-hour TTL) issued by the Vault CA. Services renew certificates automatically via Vault Agent Sidecar before expiry.
  • Model endpoints: All vLLM endpoints on localhost ports 8000–8004 are protected by mTLS when accessed from application services.

HashiCorp Vault SSH Secrets Engine

Substrate uses Vault's SSH Secrets Engine to issue short-lived certificates for the SSH Runtime Connector. This is the only way the Substrate application touches live infrastructure.

Architecture:

  1. SSH Runtime Connector authenticates to Vault using AppRole (role_id + secret_id stored as Docker secrets)
  2. Vault issues an ephemeral Ed25519 keypair and signs the public key with the Substrate SSH CA
  3. Signed certificate has a 5-minute TTL — it expires before any manual session could be established
  4. Connector SSHes to target host via ProxyJump (no direct exposure; no agent forwarding enabled)
  5. ForceCommand on the target host restricts the SSH session to running only the pre-approved inspection script
  6. Inspection script outputs JSON; connector reads output, diffs against declared topology, writes result to NATS
  7. Certificate expires automatically; no cleanup required

This architecture means: - No long-lived SSH keys anywhere in the system - Even if a certificate is stolen, it is unusable within 5 minutes - ForceCommand prevents the connector from running arbitrary commands even if the certificate is valid - No agent forwarding means no lateral movement risk

Webhook HMAC Verification

All inbound webhooks are verified before the payload is processed:

Source Header Algorithm
GitHub X-Hub-Signature-256 HMAC-SHA256 of payload body with webhook secret
Jira X-Hub-Signature or custom header HMAC-SHA256
Confluence Custom X-Webhook-Token header HMAC-SHA256

Verification uses a constant-time comparison (hmac.compare_digest) to prevent timing attacks. Any webhook that fails verification returns 403 immediately and logs the rejection to the audit table with the source IP.

Audit Log Immutability

All user and system actions are written to a PostgreSQL audit table with the following properties:

  • Append-only: The application user has INSERT privileges only — no UPDATE or DELETE. Schema-level enforcement, not application-level.
  • Fields: event_id (UUID), timestamp (timestamptz), actor (user ID or service account), action (string), resource_type, resource_id, input_hash (SHA-256 of inputs), output_hash (SHA-256 of outputs), confidence (for agent actions), reasoning (for LLM-involved actions)
  • Partitioned: pg_partman partitions by month for query performance; no partition is ever dropped within the 2-year retention window
  • Exported: Audit events are published to NATS substrate.audit.> so external SIEM systems can subscribe

This audit log is how a blocked developer can trace the exact policy clause and graph condition that caused their PR to be blocked. It is also the basis for the Agent Orchestration immutable action log (AOC-UC-05).

Air-Gap License Verification

License validation requires no outbound network:

  1. At build/deployment time, Substrate's licensing system generates a license JWT signed with an Ed25519 private key
  2. The customer receives: the signed license JWT + the corresponding Ed25519 public key
  3. At runtime, Substrate verifies the license JWT signature locally against the pre-distributed public key
  4. The JWT contains: customer ID, license tier, expiry date, enabled feature flags
  5. No outbound call is made; no license server is queried

If the license JWT is expired or the signature fails verification, Substrate enters a read-only mode that allows viewing the existing graph but blocks all ingestion and governance actions. This allows incident response to continue even if a license renewal is delayed.

Credential Storage

  • All secrets (database passwords, webhook secrets, Vault AppRole credentials, JWT signing keys) are stored as Docker secrets and mounted as files at /run/secrets/
  • Application reads secrets from files, never from environment variables
  • No secrets in plain text, Dockerfile ENV instructions, or source control
  • .env files are in .gitignore and are never committed; .env.example files contain only placeholder values

Security Architecture Summary

Concern Control Implementation
Data locality Zero external LLM calls Local DGX Spark inference; no API keys for external models
Authentication OIDC/PKCE Keycloak substrate realm; Authorization Code + PKCE for SPA
Authorization RBAC + graph ownership JWT roles + OWNS edge traversal
Identity lifecycle SCIM 2.0 Atomic graph mutations on onboard/offboard events
SSH access Ephemeral certificates Vault SSH CA; 5-minute Ed25519 cert TTL; ForceCommand
Inter-service auth mTLS 24-hour Vault-issued certificates; auto-rotation
Webhook integrity HMAC-SHA256 Constant-time comparison; 403 + audit log on failure
Audit trail Append-only PostgreSQL No DELETE privilege; 2-year retention; NATS export
License verification Ed25519 offline JWT No outbound call; read-only degraded mode on expiry
Credentials Docker secrets File-mounted; never in environment variables or source control