Security & Authentication — Design Reference¶
Zero Data Leaves the Building¶
This is the foundational security policy of Substrate. All source code, architecture diagrams, dependency graphs, policy logic, institutional memory, and audit logs stay within the customer's own infrastructure. There are no calls to OpenAI, Anthropic, or any external LLM API. There are no telemetry callbacks, license phone-home requests, or external data pipelines.
This policy is enforced at the infrastructure level, not the application level. The DGX Spark runs all AI inference locally. The air-gap license verification requires no outbound network. Every connector pulls from customer-internal systems. The Docker images are distributed for offline import.
The consequence of this policy is that every security control described in this document can be audited, configured, and verified by the customer without involving Substrate's team after deployment.
Authentication¶
OIDC via Keycloak¶
Substrate uses Keycloak as the identity provider, accessed via OpenID Connect. Keycloak is bundled in the Docker Compose deployment and the Helm umbrella chart, but customers may substitute an existing Keycloak instance or any OIDC-compliant IdP.
Realm Configuration¶
Substrate operates in a dedicated substrate realm, never the Keycloak master realm. This isolates Substrate's users, groups, roles, and clients from any other applications sharing the Keycloak instance.
OIDC Clients¶
Three clients are registered in the substrate realm:
| Client ID | Type | Flow | Purpose |
|---|---|---|---|
substrate-spa |
Public | Authorization Code + PKCE | React SPA — no client secret; PKCE prevents authorization code interception |
substrate-backend |
Confidential | Client Credentials | FastAPI backend service authenticating to Keycloak for token introspection and user lookups |
substrate-service |
Service Account | Client Credentials | Admin API + SCIM provisioning; used by automated onboarding/offboarding workflows |
Authorization Code Flow with PKCE¶
The SPA uses the Authorization Code flow with PKCE (Proof Key for Code Exchange):
- SPA generates a
code_verifier(random 256-bit value) and its SHA-256 hashcode_challenge - SPA redirects to Keycloak
/authwithcode_challengeandcode_challenge_method=S256 - User authenticates; Keycloak redirects to SPA with
authorization_code - SPA exchanges code for tokens, sending
code_verifier; Keycloak verifies the challenge before issuing tokens - Access token used for all API requests; refresh token used for silent renewal
This flow ensures no client secret is embedded in the SPA bundle. Even if an authorization code is intercepted, it cannot be exchanged without the original code_verifier.
JWT Claims Structure¶
The Keycloak JWT access token carries the following claims relevant to Substrate authorization:
{
"sub": "uuid",
"iss": "https://auth.internal/realms/substrate",
"aud": ["substrate-backend"],
"exp": 1234567890,
"groups": ["/Engineering/Backend", "/Engineering"],
"realm_access": {
"roles": ["developer"]
},
"resource_access": {
"substrate-backend": {
"roles": ["graph:read", "policy:read"]
}
}
}
The groups claim carries full paths (e.g., /Engineering/Backend) — this is achieved by configuring a Group Membership Mapper at the client scope level in Keycloak. The full path enables direct mapping to the Neo4j team hierarchy without additional lookups.
Token Validation — Hybrid Approach¶
Substrate uses a hybrid validation strategy rather than pure local validation or pure introspection:
- Most requests: Local JWT signature verification against cached JWKS (fetched from Keycloak's JWKS endpoint, cached in Redis with 1-hour TTL). Validates: RS256/ES256 signature,
issmatches configured realm,audmatches configured client ID,exphas not passed. This adds zero network latency per request. - Introspection: Used only when immediate revocation detection is required — specifically when the Governance Service makes a policy decision that triggers an action with significant side effects (blocking a PR, triggering a Fix PR workflow). Introspection adds ~5–15ms but confirms the token was not revoked since issuance.
This hybrid approach avoids the failure mode of "revoked user's token is accepted for up to 1 hour" for high-consequence actions, while maintaining sub-millisecond validation for the vast majority of read requests.
Authorization (RBAC)¶
Roles¶
| Role | Capabilities |
|---|---|
| Admin | Full access; user management; system configuration; license management |
| Architect | Full graph access; policy authoring; simulation; exception approval; graph mutation |
| Developer | PR check details; own service dependency graph; intent mismatch alerts; read-only graph for owned services |
| Viewer | Read-only access to graphs, dashboards, and reports; no policy or graph mutation |
| Service Account (CI/CD) | API-only; PR check submission; webhook delivery; no UI access |
Group-Based Role Mapping¶
JWT groups claim maps to Substrate RBAC roles via a configurable mapping table in the FastAPI gateway. Example:
/Engineering/Platform → Architect
/Engineering/Backend → Developer
/Engineering/Frontend → Developer
/Leadership → Viewer
This mapping is stored in PostgreSQL and editable by Admin role users via the API.
Graph-Level Ownership Permissions¶
Beyond role-based access, graph ownership edges define fine-grained permissions:
- A Developer can see detailed component-level graph data for services where an OWNS edge exists from their Developer node (directly or via MEMBER_OF → Team → OWNS)
- A Developer cannot see internal component detail for services owned by other teams without Viewer-level access being explicitly granted
- Architect role bypasses ownership-scoped filtering — Architects see the full graph
CODEOWNERS files from GitHub repositories are ingested as an additional ownership signal with 0.70 confidence weight, supplementing the SCIM-derived OWNS edges.
Lifecycle Management (SCIM 2.0)¶
Overview¶
SCIM 2.0 is used for automated user and group lifecycle management. Every SCIM event from the IdP triggers an atomic graph mutation in Neo4j, maintaining the identity graph in sync with the authoritative identity source.
Plugin¶
Substrate targets the Captain-P-Goldfish/scim-for-keycloak plugin for Keycloak SCIM 2.0 support. Native Keycloak SCIM support is expected approximately mid-2026 in Keycloak 26.6; the plugin provides equivalent functionality in the interim and will be replaced by native support when available without requiring any changes to Substrate's SCIM consumer implementation.
Onboarding Flow¶
When a new user is provisioned in the IdP:
- IdP sends
POST /scim/v2/Userswith user attributes (display name, email, groups, external ID) - Substrate SCIM endpoint creates
Developernode in Neo4j withactive: true - For each group membership in the SCIM payload, a
MEMBER_OFedge is created:(developer)-[:MEMBER_OF {since: now(), role: "member"}]->(team) - If the team node does not yet exist, it is created with
verification_status = Unverified(team metadata ingested separately from Terraform/GitHub) - A
substrate.identity.user_onboardedevent is published to NATS
Offboarding Flow¶
When a user is deactivated in the IdP:
- IdP sends
PATCH /scim/v2/Users/{id}withactive: false - Substrate sets
Developer.active = falsein Neo4j - Immediate key-person risk scan: Substrate queries for all Service nodes where this Developer is the sole or primary owner (direct OWNS or inherited via MEMBER_OF → Team → OWNS)
- Any service with no remaining active owner is immediately flagged
CRITICALin the Verification Queue - Notifications are sent to all Architects (in-app + Slack if configured)
- A
substrate.identity.user_offboardedevent is published to NATS with the list of affected services
The key-person risk scan runs synchronously within the SCIM handler — offboarding must not return 200 OK until the risk scan has completed and alerts have been queued.
Team Membership Events¶
| SCIM Event | Graph Mutation |
|---|---|
POST /Groups (new group) |
CREATE (:Team {name: $group_name}) |
PATCH /Groups/{id} add member |
CREATE (user)-[:MEMBER_OF {since: now(), role: "member"}]->(team) |
PATCH /Groups/{id} remove member |
DELETE the specific MEMBER_OF edge; re-run key-person risk check for all services that team owns |
DELETE /Groups/{id} |
Mark team verification_status = Deprecated; queue all owned services for ownership reassignment |
Security Controls¶
TLS and mTLS¶
- External TLS: All HTTP traffic from clients to the FastAPI gateway uses TLS 1.3. Certificates managed by Let's Encrypt (internet-accessible) or customer-provided PKI (air-gapped deployments).
- Internal mTLS: All inter-service traffic on the Docker bridge network uses mutual TLS. Certificates are short-lived (24-hour TTL) issued by the Vault CA. Services renew certificates automatically via Vault Agent Sidecar before expiry.
- Model endpoints: All vLLM endpoints on localhost ports 8000–8004 are protected by mTLS when accessed from application services.
HashiCorp Vault SSH Secrets Engine¶
Substrate uses Vault's SSH Secrets Engine to issue short-lived certificates for the SSH Runtime Connector. This is the only way the Substrate application touches live infrastructure.
Architecture:
- SSH Runtime Connector authenticates to Vault using AppRole (role_id + secret_id stored as Docker secrets)
- Vault issues an ephemeral Ed25519 keypair and signs the public key with the Substrate SSH CA
- Signed certificate has a 5-minute TTL — it expires before any manual session could be established
- Connector SSHes to target host via ProxyJump (no direct exposure; no agent forwarding enabled)
ForceCommandon the target host restricts the SSH session to running only the pre-approved inspection script- Inspection script outputs JSON; connector reads output, diffs against declared topology, writes result to NATS
- Certificate expires automatically; no cleanup required
This architecture means:
- No long-lived SSH keys anywhere in the system
- Even if a certificate is stolen, it is unusable within 5 minutes
- ForceCommand prevents the connector from running arbitrary commands even if the certificate is valid
- No agent forwarding means no lateral movement risk
Webhook HMAC Verification¶
All inbound webhooks are verified before the payload is processed:
| Source | Header | Algorithm |
|---|---|---|
| GitHub | X-Hub-Signature-256 |
HMAC-SHA256 of payload body with webhook secret |
| Jira | X-Hub-Signature or custom header |
HMAC-SHA256 |
| Confluence | Custom X-Webhook-Token header |
HMAC-SHA256 |
Verification uses a constant-time comparison (hmac.compare_digest) to prevent timing attacks. Any webhook that fails verification returns 403 immediately and logs the rejection to the audit table with the source IP.
Audit Log Immutability¶
All user and system actions are written to a PostgreSQL audit table with the following properties:
- Append-only: The application user has INSERT privileges only — no UPDATE or DELETE. Schema-level enforcement, not application-level.
- Fields:
event_id(UUID),timestamp(timestamptz),actor(user ID or service account),action(string),resource_type,resource_id,input_hash(SHA-256 of inputs),output_hash(SHA-256 of outputs),confidence(for agent actions),reasoning(for LLM-involved actions) - Partitioned: pg_partman partitions by month for query performance; no partition is ever dropped within the 2-year retention window
- Exported: Audit events are published to NATS
substrate.audit.>so external SIEM systems can subscribe
This audit log is how a blocked developer can trace the exact policy clause and graph condition that caused their PR to be blocked. It is also the basis for the Agent Orchestration immutable action log (AOC-UC-05).
Air-Gap License Verification¶
License validation requires no outbound network:
- At build/deployment time, Substrate's licensing system generates a license JWT signed with an Ed25519 private key
- The customer receives: the signed license JWT + the corresponding Ed25519 public key
- At runtime, Substrate verifies the license JWT signature locally against the pre-distributed public key
- The JWT contains: customer ID, license tier, expiry date, enabled feature flags
- No outbound call is made; no license server is queried
If the license JWT is expired or the signature fails verification, Substrate enters a read-only mode that allows viewing the existing graph but blocks all ingestion and governance actions. This allows incident response to continue even if a license renewal is delayed.
Credential Storage¶
- All secrets (database passwords, webhook secrets, Vault AppRole credentials, JWT signing keys) are stored as Docker secrets and mounted as files at
/run/secrets/ - Application reads secrets from files, never from environment variables
- No secrets in plain text, Dockerfile
ENVinstructions, or source control .envfiles are in.gitignoreand are never committed;.env.examplefiles contain only placeholder values
Security Architecture Summary¶
| Concern | Control | Implementation |
|---|---|---|
| Data locality | Zero external LLM calls | Local DGX Spark inference; no API keys for external models |
| Authentication | OIDC/PKCE | Keycloak substrate realm; Authorization Code + PKCE for SPA |
| Authorization | RBAC + graph ownership | JWT roles + OWNS edge traversal |
| Identity lifecycle | SCIM 2.0 | Atomic graph mutations on onboard/offboard events |
| SSH access | Ephemeral certificates | Vault SSH CA; 5-minute Ed25519 cert TTL; ForceCommand |
| Inter-service auth | mTLS | 24-hour Vault-issued certificates; auto-rotation |
| Webhook integrity | HMAC-SHA256 | Constant-time comparison; 403 + audit log on failure |
| Audit trail | Append-only PostgreSQL | No DELETE privilege; 2-year retention; NATS export |
| License verification | Ed25519 offline JWT | No outbound call; read-only degraded mode on expiry |
| Credentials | Docker secrets | File-mounted; never in environment variables or source control |