Security & Authentication — Design Reference¶

Zero Data Leaves the Building¶

This is the foundational security policy of Substrate. All source code, architecture diagrams, dependency graphs, policy logic, institutional memory, and audit logs stay within the customer's own infrastructure. There are no calls to OpenAI, Anthropic, or any external LLM API. There are no telemetry callbacks, license phone-home requests, or external data pipelines.

This policy is enforced at the infrastructure level, not the application level. The DGX Spark runs all AI inference locally. The air-gap license verification requires no outbound network. Every connector pulls from customer-internal systems. The Docker images are distributed for offline import.

The consequence of this policy is that every security control described in this document can be audited, configured, and verified by the customer without involving Substrate's team after deployment.

Authentication¶

OIDC via Keycloak¶

Substrate uses Keycloak as the identity provider, accessed via OpenID Connect. Keycloak is bundled in the Docker Compose deployment and the Helm umbrella chart, but customers may substitute an existing Keycloak instance or any OIDC-compliant IdP.

Realm Configuration¶

Substrate operates in a dedicated substrate realm, never the Keycloak master realm. This isolates Substrate's users, groups, roles, and clients from any other applications sharing the Keycloak instance.

OIDC Clients¶

Three clients are registered in the substrate realm:

Client ID	Type	Flow	Purpose
`substrate-spa`	Public	Authorization Code + PKCE	React SPA — no client secret; PKCE prevents authorization code interception
`substrate-backend`	Confidential	Client Credentials	FastAPI backend service authenticating to Keycloak for token introspection and user lookups
`substrate-service`	Service Account	Client Credentials	Admin API + SCIM provisioning; used by automated onboarding/offboarding workflows

Authorization Code Flow with PKCE¶

The SPA uses the Authorization Code flow with PKCE (Proof Key for Code Exchange):

SPA generates a code_verifier (random 256-bit value) and its SHA-256 hash code_challenge
SPA redirects to Keycloak /auth with code_challenge and code_challenge_method=S256
User authenticates; Keycloak redirects to SPA with authorization_code
SPA exchanges code for tokens, sending code_verifier; Keycloak verifies the challenge before issuing tokens
Access token used for all API requests; refresh token used for silent renewal

This flow ensures no client secret is embedded in the SPA bundle. Even if an authorization code is intercepted, it cannot be exchanged without the original code_verifier.

JWT Claims Structure¶

The Keycloak JWT access token carries the following claims relevant to Substrate authorization:

{
  "sub": "uuid",
  "iss": "https://auth.internal/realms/substrate",
  "aud": ["substrate-backend"],
  "exp": 1234567890,
  "groups": ["/Engineering/Backend", "/Engineering"],
  "realm_access": {
    "roles": ["developer"]
  },
  "resource_access": {
    "substrate-backend": {
      "roles": ["graph:read", "policy:read"]
    }
  }
}

The groups claim carries full paths (e.g., /Engineering/Backend) — this is achieved by configuring a Group Membership Mapper at the client scope level in Keycloak. The full path enables direct mapping to the Neo4j team hierarchy without additional lookups.

Token Validation — Hybrid Approach¶

Substrate uses a hybrid validation strategy rather than pure local validation or pure introspection:

Most requests: Local JWT signature verification against cached JWKS (fetched from Keycloak's JWKS endpoint, cached in Redis with 1-hour TTL). Validates: RS256/ES256 signature, iss matches configured realm, aud matches configured client ID, exp has not passed. This adds zero network latency per request.
Introspection: Used only when immediate revocation detection is required — specifically when the Governance Service makes a policy decision that triggers an action with significant side effects (blocking a PR, triggering a Fix PR workflow). Introspection adds ~5–15ms but confirms the token was not revoked since issuance.

This hybrid approach avoids the failure mode of "revoked user's token is accepted for up to 1 hour" for high-consequence actions, while maintaining sub-millisecond validation for the vast majority of read requests.

Authorization (RBAC)¶

Roles¶

Role	Capabilities
Admin	Full access; user management; system configuration; license management
Architect	Full graph access; policy authoring; simulation; exception approval; graph mutation
Developer	PR check details; own service dependency graph; intent mismatch alerts; read-only graph for owned services
Viewer	Read-only access to graphs, dashboards, and reports; no policy or graph mutation
Service Account (CI/CD)	API-only; PR check submission; webhook delivery; no UI access

Group-Based Role Mapping¶

JWT groups claim maps to Substrate RBAC roles via a configurable mapping table in the FastAPI gateway. Example:

/Engineering/Platform → Architect
/Engineering/Backend → Developer
/Engineering/Frontend → Developer
/Leadership → Viewer

This mapping is stored in PostgreSQL and editable by Admin role users via the API.

Graph-Level Ownership Permissions¶

Beyond role-based access, graph ownership edges define fine-grained permissions:

A Developer can see detailed component-level graph data for services where an OWNS edge exists from their Developer node (directly or via MEMBER_OF → Team → OWNS)
A Developer cannot see internal component detail for services owned by other teams without Viewer-level access being explicitly granted
Architect role bypasses ownership-scoped filtering — Architects see the full graph

CODEOWNERS files from GitHub repositories are ingested as an additional ownership signal with 0.70 confidence weight, supplementing the SCIM-derived OWNS edges.

Lifecycle Management (SCIM 2.0)¶

Overview¶

SCIM 2.0 is used for automated user and group lifecycle management. Every SCIM event from the IdP triggers an atomic graph mutation in Neo4j, maintaining the identity graph in sync with the authoritative identity source.

Plugin¶

Substrate targets the Captain-P-Goldfish/scim-for-keycloak plugin for Keycloak SCIM 2.0 support. Native Keycloak SCIM support is expected approximately mid-2026 in Keycloak 26.6; the plugin provides equivalent functionality in the interim and will be replaced by native support when available without requiring any changes to Substrate's SCIM consumer implementation.

Onboarding Flow¶

When a new user is provisioned in the IdP:

IdP sends POST /scim/v2/Users with user attributes (display name, email, groups, external ID)
Substrate SCIM endpoint creates Developer node in Neo4j with active: true
For each group membership in the SCIM payload, a MEMBER_OF edge is created: (developer)-[:MEMBER_OF {since: now(), role: "member"}]->(team)
If the team node does not yet exist, it is created with verification_status = Unverified (team metadata ingested separately from Terraform/GitHub)
A substrate.identity.user_onboarded event is published to NATS

Offboarding Flow¶

When a user is deactivated in the IdP:

IdP sends PATCH /scim/v2/Users/{id} with active: false
Substrate sets Developer.active = false in Neo4j
Immediate key-person risk scan: Substrate queries for all Service nodes where this Developer is the sole or primary owner (direct OWNS or inherited via MEMBER_OF → Team → OWNS)
Any service with no remaining active owner is immediately flagged CRITICAL in the Verification Queue
Notifications are sent to all Architects (in-app + Slack if configured)
A substrate.identity.user_offboarded event is published to NATS with the list of affected services

The key-person risk scan runs synchronously within the SCIM handler — offboarding must not return 200 OK until the risk scan has completed and alerts have been queued.

Team Membership Events¶

SCIM Event	Graph Mutation
`POST /Groups` (new group)	`CREATE (:Team {name: $group_name})`
`PATCH /Groups/{id}` add member	`CREATE (user)-[:MEMBER_OF {since: now(), role: "member"}]->(team)`
`PATCH /Groups/{id}` remove member	`DELETE` the specific MEMBER_OF edge; re-run key-person risk check for all services that team owns
`DELETE /Groups/{id}`	Mark team `verification_status = Deprecated`; queue all owned services for ownership reassignment

Security Controls¶

TLS and mTLS¶

External TLS: All HTTP traffic from clients to the FastAPI gateway uses TLS 1.3. Certificates managed by Let's Encrypt (internet-accessible) or customer-provided PKI (air-gapped deployments).
Internal mTLS: All inter-service traffic on the Docker bridge network uses mutual TLS. Certificates are short-lived (24-hour TTL) issued by the Vault CA. Services renew certificates automatically via Vault Agent Sidecar before expiry.
Model endpoints: All vLLM endpoints on localhost ports 8000–8004 are protected by mTLS when accessed from application services.

HashiCorp Vault SSH Secrets Engine¶

Substrate uses Vault's SSH Secrets Engine to issue short-lived certificates for the SSH Runtime Connector. This is the only way the Substrate application touches live infrastructure.

Architecture:

SSH Runtime Connector authenticates to Vault using AppRole (role_id + secret_id stored as Docker secrets)
Vault issues an ephemeral Ed25519 keypair and signs the public key with the Substrate SSH CA
Signed certificate has a 5-minute TTL — it expires before any manual session could be established
Connector SSHes to target host via ProxyJump (no direct exposure; no agent forwarding enabled)
ForceCommand on the target host restricts the SSH session to running only the pre-approved inspection script
Inspection script outputs JSON; connector reads output, diffs against declared topology, writes result to NATS
Certificate expires automatically; no cleanup required

This architecture means: - No long-lived SSH keys anywhere in the system - Even if a certificate is stolen, it is unusable within 5 minutes - ForceCommand prevents the connector from running arbitrary commands even if the certificate is valid - No agent forwarding means no lateral movement risk

Webhook HMAC Verification¶

All inbound webhooks are verified before the payload is processed:

Source	Header	Algorithm
GitHub	`X-Hub-Signature-256`	HMAC-SHA256 of payload body with webhook secret
Jira	`X-Hub-Signature` or custom header	HMAC-SHA256
Confluence	Custom `X-Webhook-Token` header	HMAC-SHA256

Verification uses a constant-time comparison (hmac.compare_digest) to prevent timing attacks. Any webhook that fails verification returns 403 immediately and logs the rejection to the audit table with the source IP.

Audit Log Immutability¶

All user and system actions are written to a PostgreSQL audit table with the following properties:

Append-only: The application user has INSERT privileges only — no UPDATE or DELETE. Schema-level enforcement, not application-level.
Fields: event_id (UUID), timestamp (timestamptz), actor (user ID or service account), action (string), resource_type, resource_id, input_hash (SHA-256 of inputs), output_hash (SHA-256 of outputs), confidence (for agent actions), reasoning (for LLM-involved actions)
Partitioned: pg_partman partitions by month for query performance; no partition is ever dropped within the 2-year retention window
Exported: Audit events are published to NATS substrate.audit.> so external SIEM systems can subscribe

This audit log is how a blocked developer can trace the exact policy clause and graph condition that caused their PR to be blocked. It is also the basis for the Agent Orchestration immutable action log (AOC-UC-05).

Air-Gap License Verification¶

License validation requires no outbound network:

At build/deployment time, Substrate's licensing system generates a license JWT signed with an Ed25519 private key
The customer receives: the signed license JWT + the corresponding Ed25519 public key
At runtime, Substrate verifies the license JWT signature locally against the pre-distributed public key
The JWT contains: customer ID, license tier, expiry date, enabled feature flags
No outbound call is made; no license server is queried

If the license JWT is expired or the signature fails verification, Substrate enters a read-only mode that allows viewing the existing graph but blocks all ingestion and governance actions. This allows incident response to continue even if a license renewal is delayed.

Credential Storage¶

All secrets (database passwords, webhook secrets, Vault AppRole credentials, JWT signing keys) are stored as Docker secrets and mounted as files at /run/secrets/
Application reads secrets from files, never from environment variables
No secrets in plain text, Dockerfile ENV instructions, or source control
.env files are in .gitignore and are never committed; .env.example files contain only placeholder values

Security Architecture Summary¶

Concern	Control	Implementation
Data locality	Zero external LLM calls	Local DGX Spark inference; no API keys for external models
Authentication	OIDC/PKCE	Keycloak substrate realm; Authorization Code + PKCE for SPA
Authorization	RBAC + graph ownership	JWT roles + OWNS edge traversal
Identity lifecycle	SCIM 2.0	Atomic graph mutations on onboard/offboard events
SSH access	Ephemeral certificates	Vault SSH CA; 5-minute Ed25519 cert TTL; ForceCommand
Inter-service auth	mTLS	24-hour Vault-issued certificates; auto-rotation
Webhook integrity	HMAC-SHA256	Constant-time comparison; 403 + audit log on failure
Audit trail	Append-only PostgreSQL	No DELETE privilege; 2-year retention; NATS export
License verification	Ed25519 offline JWT	No outbound call; read-only degraded mode on expiry
Credentials	Docker secrets	File-mounted; never in environment variables or source control