Skip to content

Phase 1: Deterministic Ingestion & Graph Population — Design Spec

Date: 2026-03-26 Status: Approved Scope: Build the ingestion pipeline, 8 data connectors, entity resolution, confidence scoring, OPA governance foundation, SCIM/RBAC, and dynamic graph rendering — all without LLM dependency.


1. Decisions Summary

Decision Choice Alternatives Considered
LLM dependency Deterministic-first — no LLM required Blueprint-faithful (requires DGX Spark), Connector-focused (narrower scope)
Connector scope All 8 sources in 3 batches Fewer connectors first; deferred due to all sources being available
Build order Infrastructure-first Vertical-slice per connector (visible progress faster), Hybrid (minimal infra then first connector). Chose infra-first for stable foundation.
OPA governance Include foundation — 3 starter packs Defer entirely to Phase 2. Included so policies validate against real graph data early.
RBAC Full — role-based + ownership scoping Route protection only (simpler), Defer entirely. Chose full because SCIM is needed anyway.
SCIM Custom Keycloak Dockerfile with SCIM plugin Manual user management. SCIM chosen for automated identity lifecycle.
Confidence pipeline Full — source weights, thresholds, queue routing Scoring only without routing. Chose full because Queue module already exists.
Graph layout ELK.js Sugiyama + force-directed, user-selectable Sugiyama only, Force-directed only. Both gives flexibility.
Semantic zoom All 3 levels (Domain → Service → Component) Domain + Service only. Chose 3 levels because AST CLI delivers Function/Module nodes.
Event bus Full NATS event-driven pipeline Direct Celery dispatch (simpler), Hybrid. Chose full NATS for replay and decoupling.
Testing Route + unit tests, mocked services Integration tests with testcontainers (deferred to Phase 2).
GitHub target GitHub.com only Self-hosted GitHub Enterprise deferred.

2. Architecture Overview

Phase 1 adds 4 subsystems to the Phase 0 backend:

Webhook Receivers (FastAPI routes)
       ↓ publish
NATS JetStream (event bus)
       ↓ consume
Celery Workers (async processing)
       ↓ write
Graph Writers (Neo4j + PostgreSQL)
       ↓ emit
NATS completion events
       ↓ trigger
Downstream consumers (cache invalidation, governance, confidence scoring)

New Backend Modules

Module Purpose
ingestion/ Connector framework, webhook receivers, NATS publishers
connectors/ 8 connector implementations (GitHub, Git, Terraform, K8s, SSH, Projects v2, Confluence, Jira)
graph_writer/ Canonical graph write logic — entity resolution, confidence scoring, Neo4j MERGE patterns
governance/ OPA client, policy evaluation pipeline, violation writer
identity/ SCIM endpoint, Developer/Team graph mutations, key-person risk
rbac/ Role-based access control middleware, ownership-scoped filtering
scheduler/ Celery app, beat config, task definitions
ast_parser/ Rust CLI wrapper for tree-sitter + stack-graphs AST parsing

Existing Modules Updated

  • graph/ — Dynamic layout (ELK.js data), 3-level semantic zoom queries
  • queue/ — Wired to real confidence-based routing instead of seed data
  • dashboard/ — Metrics from real computed data
  • community/ — Communities from graph structure (manual initially, Leiden deferred)

3. Ingestion Pipeline

Event Sources

Source Type Trigger Example
Webhook receivers HTTP POST from external service GitHub push, Jira issue created
Pollers Celery beat scheduled jobs K8s API (15 min), GitHub Pages (6 hours)
SSH Runtime Celery beat every 15 minutes Vault ephemeral cert → SSH → inspection script
Git-only Celery beat or on-demand Clone/fetch repos, run AST parsing

NATS Subject Hierarchy

substrate.ingestion.github.pr_opened
substrate.ingestion.github.push
substrate.ingestion.github.codeowners_changed
substrate.ingestion.terraform.state_updated
substrate.ingestion.k8s.resources_polled
substrate.ingestion.ssh.inspection_completed
substrate.ingestion.git.repo_parsed
substrate.ingestion.confluence.page_updated
substrate.ingestion.jira.issue_created
substrate.ingestion.jira.sprint_closed
substrate.ingestion.github_projects.item_updated
substrate.ingestion.github_pages.build_completed
substrate.governance.violation_raised
substrate.governance.policy_evaluated
substrate.identity.user_onboarded
substrate.identity.user_offboarded
substrate.cache.invalidate
substrate.graph.node_written
substrate.graph.edge_written

NATS JetStream Streams

Stream Subject Filter Retention Purpose
INGESTION substrate.ingestion.> 7 days All connector events
GOVERNANCE substrate.governance.> 7 days Policy evaluation results
IDENTITY substrate.identity.> 30 days SCIM lifecycle events
GRAPH substrate.graph.> 24 hours Node/edge write notifications
CACHE substrate.cache.> 1 hour Cache invalidation signals

Celery Worker Pools

Pool Concurrency Consumes From Writes To
ingestion-worker 4 INGESTION stream Neo4j, PostgreSQL
governance-worker 2 GOVERNANCE stream PostgreSQL
identity-worker 1 IDENTITY stream Neo4j

Connector Base Class

class BaseConnector(ABC):
    source_name: str
    confidence_weight: float

    @abstractmethod
    async def ingest(self, event: dict) -> list[GraphDelta]: ...

    @abstractmethod
    async def poll(self) -> list[GraphDelta]: ...

Idempotency

All graph writes use MERGE:

MERGE (s:Service {name: $name})
ON CREATE SET s += $props, s.created_at = timestamp()
ON MATCH SET s += $props, s.updated_at = timestamp()

4. Connector Specifications

Batch 1 — Foundation

GitHub Connector (repos + PRs + CODEOWNERS)

  • Source: GitHub.com REST API v3 + Webhooks
  • Confidence weight: 0.95
  • Webhook events: push, pull_request.opened, pull_request.synchronize, pull_request.merged
  • Produces:
  • Service nodes — from repo structure, package manifests, Dockerfiles
  • Function/Module nodes — from AST parsing via Rust CLI
  • DEPENDS_ON edges — from package manifests (package.json, requirements.txt, go.mod)
  • CALLS/IMPORTS edges — from AST analysis
  • Developer nodes — from commit authors + CODEOWNERS
  • OWNS edges — from CODEOWNERS (confidence 0.70)
  • PR metadata — written to PostgreSQL
  • HMAC: X-Hub-Signature-256, SHA-256

Git-Only Connector

  • Source: Raw git clone/fetch, no GitHub API
  • Confidence weight: 0.95
  • Trigger: Celery beat or on-demand
  • Produces: Same as GitHub (Service, Function, Module, edges) but from local/bare repos. No PR metadata, no webhooks, no CODEOWNERS.
  • Use case: Air-gap environments, non-GitHub repos

Terraform Connector

  • Source: Terraform state files + optional post-apply webhook
  • Confidence weight: 0.85
  • Trigger: File watcher + Celery poll + optional webhook
  • Produces:
  • InfraResource nodes — from terraform show -json
  • HOSTS edges — InfraResource → Service (matched by name/tag conventions)
  • Declared ports, regions, provider metadata
  • State diff on each apply

Batch 2 — Enrichment

Kubernetes Connector

  • Source: K8s API (Deployments, Services, Pods, Ingress, ConfigMaps)
  • Confidence weight: 0.85
  • Trigger: K8s Watch API (continuous) + Celery poll every 15 min
  • Produces:
  • Service nodes — from K8s Service/Deployment resources
  • InfraResource nodes — from Pod/Node resources
  • HOSTS edges — Pod → Service
  • Reconciliation with Terraform state
  • Running container image versions

SSH Runtime Connector

  • Source: SSH to target hosts via Vault ephemeral certificates
  • Confidence weight: 0.90
  • Trigger: Celery beat every 15 min + on-demand
  • Produces:
  • Running processes, open ports, active network connections
  • Observed vs declared diff
  • Undeclared service detection
  • substrate.governance.ssh_drift_detected events
  • Security: Vault SSH CA, 5-min Ed25519 cert TTL, ForceCommand, no agent forwarding

GitHub Projects v2 Connector

  • Source: GitHub GraphQL API + projects_v2_item webhook
  • Confidence weight: 0.80
  • Trigger: Webhook + hourly GraphQL poll
  • Produces:
  • SprintNode — from project iterations
  • IntentAssertion — from project items
  • Sprint health updates

Batch 3 — Memory & Docs

Confluence Connector

  • Source: Confluence REST API + webhooks (page_created, page_updated)
  • Confidence weight: 0.70
  • Trigger: Webhook + on-demand
  • Produces:
  • DecisionNode — from pages matching ADR patterns (deterministic title/label matching)
  • FailurePattern — from pages matching post-mortem patterns
  • WHY edges — linking decisions to services mentioned in content
  • CAUSED_BY edges — linking failures to affected services
  • Note: LLM semantic extraction (MoE Scout) deferred. Pages ingested via deterministic pattern matching.

GitHub Pages Connector

  • Source: GitHub Pages build status API
  • Confidence weight: 0.70
  • Trigger: Celery poll every 6 hours
  • Produces:
  • Documentation staleness delta
  • Doc coverage score updates on linked Service nodes

Jira Connector

  • Source: Jira REST API + webhooks (issue_created, sprint_closed)
  • Confidence weight: 0.70
  • Trigger: Webhook + on-demand
  • Produces:
  • IntentAssertion nodes — from ticket descriptions
  • SprintNode updates — on sprint close events

Options Not Selected (Out of Scope)

Option Reason Deferred
Slack connector Blueprint v1.1; high noise (0.30 confidence), needs keyword triggers
Datadog connector Blueprint v1.1; runtime alert signals, not core graph data
Self-hosted GitHub Enterprise Separate connector variant; different API base URL + auth
Bitbucket / GitLab connectors Not in blueprint; different API surface
ServiceNow connector Enterprise ITSM; not in MVP scope
LLM-powered semantic extraction Phase 2+; requires DGX Spark with MoE Scout / Dense extract-lora

5. Entity Resolution (Deterministic)

Pipeline

Raw name from connector
       ↓
Step 1: Tokenization normalization
       ↓
Step 2: Alias registry lookup
       ↓
Step 3: Jaccard similarity on dependency sets
       ↓
Step 4: Confidence-based routing

Step 1 — Tokenization normalization: - Strip prefixes/suffixes: srv-, -service, -svc, -api, -app - Normalize separators: hyphens, underscores, dots → spaces - CamelCase split: PaymentServicepayment service - Lowercase, sort tokens alphabetically - Example: srv-payment, payment-service, PaymentService, paymentspayment

Step 2 — Alias registry:

CREATE TABLE entity_aliases (
    raw_name TEXT PRIMARY KEY,
    canonical_id UUID NOT NULL REFERENCES services(id),
    source TEXT NOT NULL,
    confidence FLOAT NOT NULL,
    resolved_at TIMESTAMPTZ DEFAULT now()
);

Populated automatically on first resolution, manually editable via API.

Step 3 — Jaccard similarity (fallback): - J(A, B) = |A ∩ B| / |A ∪ B| on dependency sets - Similarity > 0.80 → auto-merge (confidence 0.85) - Similarity 0.50–0.80 → route to Verification Queue - Similarity < 0.50 → treat as new entity

Step 4 — Confidence routing:

Resolution Method Confidence Action
Exact alias match 0.95 Auto-merge, no review
Normalization match 0.90 Auto-merge, no review
Jaccard > 0.80 0.85 Auto-merge, log for audit
Jaccard 0.50–0.80 0.65 Write as Unverified, route to Queue
No match 1.0 (new entity) Create new canonical node

Nightly Resolution Pass (Celery beat, 2:15 AM)

  1. Collect nodes added/modified in last 24 hours
  2. Run normalization + Jaccard against existing canonical nodes
  3. Auto-merge high-confidence matches
  4. Route ambiguous candidates to Verification Queue
  5. Log all decisions to audit table

LLM Upgrade Path (Phase 2+)

  • Add Step 1.5: bge-m3 embedding similarity (cosine > 0.92 → auto-merge)
  • Add Step 2.5: Dense resolve-lora classification for Jaccard 0.50–0.80 range
  • Existing normalization and alias registry remain as fast-path shortcuts

6. Confidence Scoring & Verification Queue

Source Trust Weights

Source Type Confidence Weight Rationale
Code analysis (AST) 0.95 Deterministic, source of truth
SSH runtime inspection 0.90 Direct observation
CI/CD data 0.90 Automated, reliable
Infrastructure (Terraform, K8s) 0.85 Declarative, occasionally stale
GitHub Projects v2 0.80 Structured but user-maintained
Documentation (Confluence, Pages) 0.70 Human-authored, may be outdated
CODEOWNERS ownership 0.70 Heuristic
Jira ticket components 0.70 Loosely structured
Entity resolution (Jaccard) 0.65–0.85 Varies by match strength

Confidence Thresholds

Range verification_status Action
≥ 0.90 Verified Auto-accepted
0.60–0.89 Unverified Written, queued for periodic review
0.50–0.59 Unverified Written, immediately routed to Queue
< 0.50 Rejected Not written, logged to audit

Verification Queue Wiring

Queue item sources: - Entity resolution ambiguous matches - Cross-source conflicts - Ownership disputes - New entities from low-confidence sources

Queue actions:

Action Graph Effect
Accept verification_status = Verified, confidence → 0.95
Reject verification_status = Deprecated, log rejection
Edit Update properties inline, then Accept
Escalate Route to Architect role

Escalation routing: - Items ≥ 0.60 → service owner (via OWNS edge) - Items < 0.60 → Architect role - Target: 10–15% human review rate

Node/Edge Confidence Schema (Neo4j)

Every node and edge carries: - confidence: float (0.0–1.0) - source: string (connector name) - extraction_timestamp: datetime - last_verification_timestamp: datetime - verification_status: enum (Verified / Unverified / Disputed / Stale / Deprecated)

Staleness rules (Celery beat): - Documentation: Stale after 30 days - Infrastructure: Stale after 7 days - SSH-verified runtime: Stale after 1 day


7. Graph Writer & Neo4j Schema

Graph Writer Module

Central module all connectors write through:

class GraphWriter:
    async def apply_delta(self, delta: GraphDelta) -> WriteResult:
        # 1. Entity resolution on all node names
        # 2. Apply confidence scores from source weights
        # 3. MERGE nodes and edges in a single Neo4j transaction
        # 4. Route low-confidence items to Verification Queue
        # 5. Publish substrate.graph.node_written / edge_written to NATS
        # 6. Publish substrate.cache.invalidate
        # 7. Write audit log entry to PostgreSQL

GraphDelta Schema

@dataclass
class NodeDelta:
    label: str              # "Service", "Function", "InfraResource", etc.
    properties: dict
    merge_key: str          # Property to MERGE on
    source: str
    confidence: float

@dataclass
class EdgeDelta:
    rel_type: str           # "DEPENDS_ON", "CALLS", "OWNS", etc.
    source_label: str
    source_key: dict
    target_label: str
    target_key: dict
    properties: dict
    source: str
    confidence: float

@dataclass
class GraphDelta:
    nodes: list[NodeDelta]
    edges: list[EdgeDelta]
    source_connector: str
    event_id: str           # For idempotency

Node Labels

Label Source Merge Key
Service All connectors name
Function GitHub/Git AST signature
Module GitHub/Git AST name (qualified)
InfraResource Terraform, K8s resource_id
Developer SCIM, GitHub, Git github_handle
Team SCIM name
DecisionNode Confluence, GitHub id
FailurePattern Confluence id
ExceptionNode Governance id
SprintNode GitHub Projects, Jira sprint_id
IntentAssertion GitHub Projects, Jira linked_ticket
Community Manual id
Policy OPA/PostgreSQL policy_id

Edge Types

Relationship From → To Source
CALLS Function → Function AST
IMPORTS Module → Module AST
HOSTS InfraResource → Service Terraform, K8s
ACTUALLY_CALLS Service → Service SSH, K8s
DEPENDS_ON Service → Service Package manifests
OWNS Developer/Team → Service CODEOWNERS, SCIM
MEMBER_OF Developer → Team SCIM
CHILD_OF Team → Team SCIM
WHY DecisionNode → Service/Policy Confluence
CAUSED_BY FailurePattern → Service Confluence
PREVENTED_BY Service → Policy Governance
CONTAINS Community → Service Manual

New Constraints and Indexes

CREATE CONSTRAINT function_sig IF NOT EXISTS FOR (f:Function) REQUIRE f.signature IS UNIQUE;
CREATE CONSTRAINT module_name IF NOT EXISTS FOR (m:Module) REQUIRE m.name IS UNIQUE;
CREATE CONSTRAINT infra_resource_id IF NOT EXISTS FOR (r:InfraResource) REQUIRE r.resource_id IS UNIQUE;
CREATE CONSTRAINT sprint_id IF NOT EXISTS FOR (s:SprintNode) REQUIRE s.sprint_id IS UNIQUE;
CREATE CONSTRAINT intent_ticket IF NOT EXISTS FOR (i:IntentAssertion) REQUIRE i.linked_ticket IS UNIQUE;
CREATE CONSTRAINT exception_id IF NOT EXISTS FOR (e:ExceptionNode) REQUIRE e.id IS UNIQUE;

CREATE INDEX function_file IF NOT EXISTS FOR (f:Function) ON (f.file);
CREATE INDEX infra_type IF NOT EXISTS FOR (r:InfraResource) ON (r.type);
CREATE INDEX infra_provider IF NOT EXISTS FOR (r:InfraResource) ON (r.provider);
CREATE INDEX developer_keycloak IF NOT EXISTS FOR (d:Developer) ON (d.keycloak_id);
CREATE INDEX decision_source IF NOT EXISTS FOR (d:DecisionNode) ON (d.source_url);
CREATE INDEX sprint_dates IF NOT EXISTS FOR (s:SprintNode) ON (s.start_date);
CREATE INDEX node_source IF NOT EXISTS FOR (s:Service) ON (s.source);
CREATE INDEX node_confidence IF NOT EXISTS FOR (s:Service) ON (s.confidence);
CREATE INDEX node_verification IF NOT EXISTS FOR (s:Service) ON (s.verification_status);

Graph Algorithm Jobs (Celery beat, 3:00 AM)

PageRank:

CALL gds.pageRank.write('service-dependency-graph', {
  writeProperty: 'page_rank', maxIterations: 20, dampingFactor: 0.85
})

Betweenness centrality:

CALL gds.betweenness.write('service-dependency-graph', {
  writeProperty: 'betweenness'
})

Weakly connected components:

CALL gds.wcc.write('service-dependency-graph', {
  writeProperty: 'component_id'
})

Cycle detection: - DEPENDS_ON subgraph → NetworkX DiGraph → nx.simple_cycles(G) - Run synchronously on PR ingestion (before governance evaluation) - Cycles written as policy violation candidates


8. AST Parser (Rust CLI)

Architecture

Standalone Rust binary invoked by ingestion workers:

ingestion-worker receives push/PR event
       ↓
clones/fetches repo to temp directory
       ↓
substrate-ast-cli parse --repo /tmp/repo --output json
       ↓
tree-sitter parses each file, stack-graphs resolves cross-file refs
       ↓
JSON output: nodes + edges
       ↓
ingestion-worker builds GraphDelta, sends to graph_writer

Technology

  • tree-sitter — Incremental parsing, grammars for Python, TypeScript, Go, Java, Rust, C#
  • stack-graphs — GitHub's cross-file name resolution library

Supported Languages (Phase 1)

Language Grammar stack-graphs Priority
Python tree-sitter-python Yes High
TypeScript/JavaScript tree-sitter-typescript Yes High
Go tree-sitter-go Yes High
Java tree-sitter-java Yes Medium
Rust tree-sitter-rust Yes Medium
C# tree-sitter-c-sharp Partial Low

CLI Output Format

{
  "nodes": [
    {
      "label": "Function",
      "properties": {
        "signature": "app.services.payment.process_payment",
        "file": "src/services/payment.py",
        "line": 42,
        "end_line": 87,
        "hash": "sha256:abc123...",
        "language": "python",
        "visibility": "public"
      }
    },
    {
      "label": "Module",
      "properties": {
        "name": "app.services.payment",
        "file": "src/services/payment.py",
        "language": "python"
      }
    }
  ],
  "edges": [
    {
      "rel_type": "CALLS",
      "source": "app.services.payment.process_payment",
      "target": "app.services.order.create_order",
      "properties": { "static": true }
    },
    {
      "rel_type": "IMPORTS",
      "source": "app.services.payment",
      "target": "app.services.order",
      "properties": { "static": true, "dynamic": false }
    }
  ]
}

Service Detection Heuristics

Signal Confidence Example
Dockerfile present 0.95 services/payment/Dockerfile → PaymentService
docker-compose service entry 0.95 services.payment in compose file
Package manifest at root 0.90 services/payment/package.json
K8s deployment YAML 0.85 k8s/payment-deployment.yaml
Directory with main.* 0.80 services/payment/main.py

Build & Distribution

  • Static binary: cargo build --release
  • Added to backend Docker image at /usr/local/bin/substrate-ast-cli
  • Separate Cargo.toml at tools/ast-cli/

Incremental Parsing

On PR events, only changed files are parsed:

substrate-ast-cli parse --repo /tmp/repo --files src/payment.py,src/order.py --output json

9. OPA Governance Foundation

Deployment

OPA server in Docker Compose:

opa:
  image: openpolicyagent/opa:latest
  command: ["run", "--server", "--bundle", "/policies"]
  ports: ["8181:8181"]
  volumes: ["./backend/policies:/policies"]

Evaluation Pipeline

Graph write event (substrate.graph.node_written)
       ↓
Governance worker picks up event
       ↓
Fetches affected subgraph from Neo4j
       ↓
Serializes subgraph as JSON context
       ↓
POST /v1/data/substrate/violations → OPA
       ↓
Returns violations with rule IDs, severity, affected nodes
       ↓
Violations written to PostgreSQL policy_violations
       ↓
Publishes substrate.governance.violation_raised to NATS

Starter Policy Packs (3 of 9)

1. Domain Boundary (GOV-BOUNDARY)

package substrate.boundary

violation[msg] {
    edge := input.edges[_]
    edge.type == "ACTUALLY_CALLS"
    source := input.nodes[edge.source]
    target := input.nodes[edge.target]
    source.domain != target.domain
    not edge.via_gateway
    msg := sprintf("%s (%s) calls %s (%s) directly without gateway",
        [source.name, source.domain, target.name, target.domain])
}

2. Circular Dependency (GOV-CIRCULAR)

package substrate.circular

violation[msg] {
    cycle := input.cycles[_]
    msg := sprintf("Circular dependency detected: %s", [concat(" → ", cycle)])
}

Cycles pre-computed by NetworkX and passed in OPA input context.

3. Ownership Completeness (GOV-OWNERSHIP)

package substrate.ownership

violation[msg] {
    service := input.services[_]
    service.verification_status != "Deprecated"
    count(service.owners) == 0
    msg := sprintf("%s has no active owner", [service.name])
}

Policy Packs Deferred to Phase 2

Pack ID Reason
SOLID Single Responsibility GOV-SRP Needs efferent coupling metrics
SOLID Open/Closed GOV-OCP Needs extension pattern detection
SOLID Dependency Inversion GOV-DIP Needs abstract vs concrete classification
TDD Coverage GOV-TDD Needs CI test coverage data
API-First GOV-API Needs OpenAPI spec detection
License Compatibility GOV-LICENSE Needs dependency license resolution

No GitHub Checks API in Phase 1

Violations are recorded and visible in the UI. GitHub Checks API integration (PR blocking, violation comments) is Phase 2.


10. SCIM, RBAC & Identity

Custom Keycloak Dockerfile

FROM quay.io/keycloak/keycloak:latest
ADD --chown=keycloak:keycloak \
    https://github.com/Captain-P-Goldfish/scim-for-keycloak/releases/download/latest/scim-for-keycloak.jar \
    /opt/keycloak/providers/
RUN /opt/keycloak/bin/kc.sh build

SCIM Endpoints

Method Path Purpose
POST /scim/v2/Users Create Developer node
PATCH /scim/v2/Users/{id} Update/deactivate Developer
POST /scim/v2/Groups Create Team node
PATCH /scim/v2/Groups/{id} Add/remove MEMBER_OF edges
DELETE /scim/v2/Groups/{id} Mark Team as Deprecated

Onboarding (POST /Users): 1. Create Developer node (active: true) 2. Create MEMBER_OF edges per group 3. Create Team if missing (verification_status = Unverified) 4. Publish substrate.identity.user_onboarded

Offboarding (PATCH /Users active=false): 1. Set Developer.active = false 2. Query orphaned services:

MATCH (d:Developer {keycloak_id: $kid, active: false})-[:OWNS]->(s:Service)
WHERE NOT EXISTS {
  MATCH (other:Developer {active: true})-[:OWNS]->(s)
}
AND NOT EXISTS {
  MATCH (t:Team)-[:OWNS]->(s)
  WHERE EXISTS { MATCH (m:Developer {active: true})-[:MEMBER_OF]->(t) }
}
RETURN s.name AS orphaned_service
3. Flag orphaned services CRITICAL in Queue 4. Publish substrate.identity.user_offboarded

RBAC Roles

Role Capabilities
admin Full access, user management, system configuration
architect Full graph, policy authoring, simulation, exception approval
developer Own service graph, PR details, intent mismatch alerts
viewer Read-only graphs, dashboards, reports
service-account API-only, webhook delivery, no UI

Role Enforcement

def require_role(*roles: str):
    def checker(user: UserInfo = Depends(get_current_user)):
        if not any(r in user.realm_roles for r in roles):
            raise HTTPException(403, "Insufficient role")
        return user
    return Depends(checker)

Ownership-Scoped Filtering

Developer role queries filtered to owned services:

async def get_owned_services(user: UserInfo, neo4j_session) -> list[str]:
    result = await neo4j_session.run("""
        MATCH (d:Developer {keycloak_id: $kid})
        OPTIONAL MATCH (d)-[:OWNS]->(s1:Service)
        OPTIONAL MATCH (d)-[:MEMBER_OF]->(t:Team)-[:OWNS]->(s2:Service)
        RETURN collect(DISTINCT s1.name) + collect(DISTINCT s2.name) AS services
    """, kid=user.sub)
    record = await result.single()
    return record["services"] if record else []

Route-Level Requirements

Endpoint Required Role
GET /dashboard, communities, graph Any authenticated
GET /policies Any authenticated
POST /policies, PUT /policies/{id} architect, admin
GET /queue developer, architect, admin
PATCH /queue/{id} (Accept/Reject) architect, admin
PATCH /queue/{id} (Edit) developer (own), architect, admin
POST /memory developer, architect, admin
POST /simulation/run architect, admin
SCIM endpoints service-account only
Webhook receivers HMAC-verified, no JWT

Group-to-Role Mapping

CREATE TABLE role_mappings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    keycloak_group TEXT NOT NULL UNIQUE,
    substrate_role TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now()
);

11. UI Changes

New Dependencies

Package Purpose
elkjs Sugiyama layout algorithm
web-worker Run layout off main thread

Graph Rendering

Layout engines (user-selectable): - Sugiyama (default): Dependencies flow top-to-bottom via ELK.js. Computed in web worker. - Force-directed: Nodes cluster by relationship density. React Flow built-in.

Semantic Zoom

Detected from viewport scale, not explicit user action. 300ms ease-in-out transitions.

Scale Level Rendered
< 0.4 Far (Domain) Domain super-nodes. Aggregate health badge, violation count, tension score.
0.4–0.8 Mid (Service) Service nodes with tension ring, violation badge, owner label. PageRank-weighted sizing.
> 0.8 Close (Component) Function/Module nodes inside service boundaries. File path, line number, call counts.

Graph Toolbar

[Layout: Sugiyama ▾] [Filter ▾] [Minimap ☐] [Fit View]

Filter sidebar: - Domain / community - Violation type (boundary, circular, ownership) - Confidence threshold (slider, default 0.60) - Owner / team - Verification status - Time window (7d / 14d / 30d)

Verification Queue UI Updates

  • Items show source connector, confidence badge (red < 0.60, yellow 0.60–0.79, green ≥ 0.80)
  • Accept/Reject/Edit/Escalate call real API endpoints
  • Items grouped by resolution_type

Dashboard Updates

All metrics computed from real data: - Violations from policy_violations WHERE resolved = false - Tension from graph algorithm output - Memory entries from real ingestion - Memory gaps from services with no WHY edges

RBAC in UI

  • Role from JWT claims
  • Nav items conditionally rendered
  • Policy editor: read-only for Developer/Viewer
  • Simulation Run: hidden for Developer/Viewer
  • Queue actions: role-appropriate visibility

Pages NOT Changed

Page Reason
Search Canned results until GraphRAG (Phase 3)
Simulation rendering Seeded results until simulation engine (Phase 4)

12. Celery Beat Schedule

Job Schedule Description
Nightly connector poll 2:00 AM daily Full poll on all polling connectors
Entity resolution pass 2:15 AM daily Normalization + Jaccard on last 24h nodes
PageRank update 3:00 AM daily GDS PageRank on service graph
Betweenness centrality 3:00 AM daily GDS betweenness on service graph
Connected components 3:00 AM daily GDS WCC for orphan detection
Staleness check 4:00 AM daily Mark nodes Stale per threshold rules
K8s API poll Every 15 min K8s Watch API resource changes
SSH Runtime inspection Every 15 min Vault cert → SSH → inspection → diff
GitHub Pages poll Every 6 hours Pages build timestamps vs code commits
GitHub Projects poll Every 1 hour GraphQL project item changes

Distributed Locking

async def acquire_lock(redis, event_id: str, ttl: int = 300) -> bool:
    return await redis.set(f"lock:ingestion:{event_id}", "1", nx=True, ex=ttl)

13. Infrastructure & Docker Compose

New Services

Service Image Purpose
opa openpolicyagent/opa:latest Policy evaluation
keycloak Custom Dockerfile Identity + SCIM 2.0
celery-worker Backend image Ingestion processing
celery-beat Backend image Job scheduling
celery-governance Backend image Policy evaluation workers
vault hashicorp/vault:latest SSH CA

Startup Order

1. PostgreSQL, Neo4j, Redis, NATS         (databases)
2. Flyway migrations                       (schema)
3. Keycloak (custom, SCIM plugin)          (identity)
4. Vault                                   (SSH CA)
5. OPA                                     (policy engine)
6. Backend (FastAPI)                        (API layer)
7. Celery beat                             (scheduler)
8. Celery workers (ingestion, governance)  (async processing)
9. UI (Nginx)                              (frontend)

Backend Dockerfile Updates

Multi-stage build: 1. ast-builder stage — Compiles Rust CLI from tools/ast-cli/ 2. python stage — Backend app with CLI binary, git, SSH client

New Environment Variables

NATS_URL=nats://nats:4222
OPA_URL=http://opa:8181
CELERY_BROKER_URL=redis://redis:6379/1
CELERY_RESULT_BACKEND=redis://redis:6379/2
VAULT_ADDR=http://vault:8200
VAULT_TOKEN=substrate-dev-token
VAULT_SSH_ROLE=substrate-ssh
GITHUB_APP_ID=
GITHUB_APP_PRIVATE_KEY_PATH=
GITHUB_WEBHOOK_SECRET=
TERRAFORM_STATE_PATHS=
KUBECONFIG_PATH=
SSH_TARGET_HOSTS=
CONFLUENCE_URL=
CONFLUENCE_API_TOKEN=
CONFLUENCE_WEBHOOK_SECRET=
JIRA_URL=
JIRA_API_TOKEN=
JIRA_WEBHOOK_SECRET=

Project Directory Structure (Phase 1 additions)

substrate/
  backend/
    app/
      modules/
        ingestion/          # Webhook receivers, NATS publishers
        connectors/         # 8 connector implementations
        graph_writer/       # Entity resolution, confidence, MERGE logic
        governance/         # OPA client, evaluation pipeline
        identity/           # SCIM endpoint, Developer/Team mutations
        rbac/               # Role enforcement, ownership filtering
        scheduler/          # Celery app, beat config, tasks
        ast_parser/         # Rust CLI wrapper
      core/
        nats.py             # NATS client helpers
    policies/               # OPA Rego packs
      boundary/policy.rego
      circular/policy.rego
      ownership/policy.rego
    db/
      postgres/V4__phase1_schema.sql
      neo4j/V2__phase1_constraints.cypher
  tools/
    ast-cli/                # Rust AST parser
      Cargo.toml
      src/main.rs
  keycloak/
    Dockerfile              # Custom Keycloak with SCIM
  ui/
    src/
      components/graph/
        ElkLayout.ts
        SemanticZoom.tsx
        LayoutSelector.tsx
        FilterSidebar.tsx

14. Database Migrations

PostgreSQL: V4__phase1_schema.sql

CREATE TABLE entity_aliases (
    raw_name TEXT PRIMARY KEY,
    canonical_id UUID NOT NULL,
    source TEXT NOT NULL,
    confidence FLOAT NOT NULL,
    resolved_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE ingestion_events (
    event_id TEXT PRIMARY KEY,
    source_connector TEXT NOT NULL,
    event_type TEXT NOT NULL,
    payload_hash TEXT NOT NULL,
    status TEXT DEFAULT 'pending',
    nodes_written INT DEFAULT 0,
    edges_written INT DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT now(),
    completed_at TIMESTAMPTZ
);

CREATE TABLE audit_log (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    timestamp TIMESTAMPTZ DEFAULT now(),
    actor TEXT NOT NULL,
    action TEXT NOT NULL,
    resource_type TEXT,
    resource_id TEXT,
    input_hash TEXT,
    output_hash TEXT,
    confidence FLOAT,
    detail JSONB
);

CREATE TABLE webhook_configs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    connector_type TEXT NOT NULL,
    endpoint_path TEXT NOT NULL,
    hmac_secret TEXT NOT NULL,
    active BOOLEAN DEFAULT true,
    created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE role_mappings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    keycloak_group TEXT NOT NULL UNIQUE,
    substrate_role TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE connector_state (
    connector_type TEXT PRIMARY KEY,
    last_poll_at TIMESTAMPTZ,
    last_success_at TIMESTAMPTZ,
    cursor_state JSONB,
    error_count INT DEFAULT 0,
    last_error TEXT,
    active BOOLEAN DEFAULT true
);

ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS source_connector TEXT;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS candidate_data JSONB;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS conflicting_data JSONB;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS neo4j_node_id TEXT;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS resolution_type TEXT;

Neo4j: V2__phase1_constraints.cypher

See Section 7 for full constraint and index definitions.

Seed Data Transition

  • Seed data nodes have source: "seed", confidence: 0.80
  • Connector data arrives with higher confidence and overwrites
  • After all connectors live, cleanup job removes unmatched seed-only nodes
  • No breaking migration — Phase 0 UI continues working throughout

pyproject.toml Additions

"celery[redis]>=5.3"
"nats-py>=2.7"
"networkx>=3.0"
"aiohttp>=3.9"
"paramiko>=3.4"
"hvac>=2.0"
"kubernetes-asyncio>=30"

15. Testing Strategy

Test Layers

Layer What How Target
Route tests Status codes, shapes, auth, RBAC httpx, mocked services ~60
Unit — connectors Parsing, events, GraphDelta Mock APIs, fixture payloads ~40
Unit — entity resolution Normalization, alias, Jaccard Pure functions ~15
Unit — graph writer MERGE patterns, confidence, validation Mock Neo4j ~10
Unit — governance OPA input, violation parsing Mock OPA HTTP ~10
Unit — RBAC Role checks, ownership filtering Mock user, mock Neo4j ~10
Unit — SCIM Event parsing, graph mutations Mock Neo4j ~10
Unit — AST CLI JSON parsing, service detection Fixture repos ~15

Total: ~170 tests

Connector Fixtures

tests/fixtures/{connector}/ — sample payloads per connector
tests/fixtures/ast/ — small fixture repos for CLI testing

Not Testing in Phase 1

  • Integration tests with testcontainers
  • End-to-end webhook → Neo4j flows
  • OPA with real server
  • Load/performance testing

16. Explicitly Out of Scope

Item Target Phase
LLM entity resolution (Dense resolve-lora) Phase 2+
bge-m3 embeddings / pgvector semantic search Phase 2+
GitHub Checks API (PR blocking) Phase 2
Violation PR comments Phase 2
explain-lora violation explanations Phase 2
Leiden community detection Phase 2
GraphRAG / NL search Phase 3
Simulation engine (Neo4j sandboxes) Phase 4
Agent orchestration (Fix PR) Phase 4
Slack connector v1.1
Datadog connector v1.1
Self-hosted GitHub Enterprise Later
Bitbucket / GitLab connectors Not in blueprint
ServiceNow connector Not in MVP
Integration tests with testcontainers Phase 2
Load/performance testing Phase 2+
WebGL 3D graph (Sigma.js) v2
Multi-team graph isolation v2