Phase 1: Deterministic Ingestion & Graph Population — Design Spec¶

Date: 2026-03-26 Status: Approved Scope: Build the ingestion pipeline, 8 data connectors, entity resolution, confidence scoring, OPA governance foundation, SCIM/RBAC, and dynamic graph rendering — all without LLM dependency.

1. Decisions Summary¶

Decision	Choice	Alternatives Considered
LLM dependency	Deterministic-first — no LLM required	Blueprint-faithful (requires DGX Spark), Connector-focused (narrower scope)
Connector scope	All 8 sources in 3 batches	Fewer connectors first; deferred due to all sources being available
Build order	Infrastructure-first	Vertical-slice per connector (visible progress faster), Hybrid (minimal infra then first connector). Chose infra-first for stable foundation.
OPA governance	Include foundation — 3 starter packs	Defer entirely to Phase 2. Included so policies validate against real graph data early.
RBAC	Full — role-based + ownership scoping	Route protection only (simpler), Defer entirely. Chose full because SCIM is needed anyway.
SCIM	Custom Keycloak Dockerfile with SCIM plugin	Manual user management. SCIM chosen for automated identity lifecycle.
Confidence pipeline	Full — source weights, thresholds, queue routing	Scoring only without routing. Chose full because Queue module already exists.
Graph layout	ELK.js Sugiyama + force-directed, user-selectable	Sugiyama only, Force-directed only. Both gives flexibility.
Semantic zoom	All 3 levels (Domain → Service → Component)	Domain + Service only. Chose 3 levels because AST CLI delivers Function/Module nodes.
Event bus	Full NATS event-driven pipeline	Direct Celery dispatch (simpler), Hybrid. Chose full NATS for replay and decoupling.
Testing	Route + unit tests, mocked services	Integration tests with testcontainers (deferred to Phase 2).
GitHub target	GitHub.com only	Self-hosted GitHub Enterprise deferred.

2. Architecture Overview¶

Phase 1 adds 4 subsystems to the Phase 0 backend:

Webhook Receivers (FastAPI routes)
       ↓ publish
NATS JetStream (event bus)
       ↓ consume
Celery Workers (async processing)
       ↓ write
Graph Writers (Neo4j + PostgreSQL)
       ↓ emit
NATS completion events
       ↓ trigger
Downstream consumers (cache invalidation, governance, confidence scoring)

New Backend Modules¶

Module	Purpose
`ingestion/`	Connector framework, webhook receivers, NATS publishers
`connectors/`	8 connector implementations (GitHub, Git, Terraform, K8s, SSH, Projects v2, Confluence, Jira)
`graph_writer/`	Canonical graph write logic — entity resolution, confidence scoring, Neo4j MERGE patterns
`governance/`	OPA client, policy evaluation pipeline, violation writer
`identity/`	SCIM endpoint, Developer/Team graph mutations, key-person risk
`rbac/`	Role-based access control middleware, ownership-scoped filtering
`scheduler/`	Celery app, beat config, task definitions
`ast_parser/`	Rust CLI wrapper for tree-sitter + stack-graphs AST parsing

Existing Modules Updated¶

graph/ — Dynamic layout (ELK.js data), 3-level semantic zoom queries
queue/ — Wired to real confidence-based routing instead of seed data
dashboard/ — Metrics from real computed data
community/ — Communities from graph structure (manual initially, Leiden deferred)

3. Ingestion Pipeline¶

Event Sources¶

Source Type	Trigger	Example
Webhook receivers	HTTP POST from external service	GitHub push, Jira issue created
Pollers	Celery beat scheduled jobs	K8s API (15 min), GitHub Pages (6 hours)
SSH Runtime	Celery beat every 15 minutes	Vault ephemeral cert → SSH → inspection script
Git-only	Celery beat or on-demand	Clone/fetch repos, run AST parsing

NATS Subject Hierarchy¶

substrate.ingestion.github.pr_opened
substrate.ingestion.github.push
substrate.ingestion.github.codeowners_changed
substrate.ingestion.terraform.state_updated
substrate.ingestion.k8s.resources_polled
substrate.ingestion.ssh.inspection_completed
substrate.ingestion.git.repo_parsed
substrate.ingestion.confluence.page_updated
substrate.ingestion.jira.issue_created
substrate.ingestion.jira.sprint_closed
substrate.ingestion.github_projects.item_updated
substrate.ingestion.github_pages.build_completed
substrate.governance.violation_raised
substrate.governance.policy_evaluated
substrate.identity.user_onboarded
substrate.identity.user_offboarded
substrate.cache.invalidate
substrate.graph.node_written
substrate.graph.edge_written

NATS JetStream Streams¶

Stream	Subject Filter	Retention	Purpose
`INGESTION`	`substrate.ingestion.>`	7 days	All connector events
`GOVERNANCE`	`substrate.governance.>`	7 days	Policy evaluation results
`IDENTITY`	`substrate.identity.>`	30 days	SCIM lifecycle events
`GRAPH`	`substrate.graph.>`	24 hours	Node/edge write notifications
`CACHE`	`substrate.cache.>`	1 hour	Cache invalidation signals

Celery Worker Pools¶

Pool	Concurrency	Consumes From	Writes To
`ingestion-worker`	4	`INGESTION` stream	Neo4j, PostgreSQL
`governance-worker`	2	`GOVERNANCE` stream	PostgreSQL
`identity-worker`	1	`IDENTITY` stream	Neo4j

Connector Base Class¶

class BaseConnector(ABC):
    source_name: str
    confidence_weight: float

    @abstractmethod
    async def ingest(self, event: dict) -> list[GraphDelta]: ...

    @abstractmethod
    async def poll(self) -> list[GraphDelta]: ...

Idempotency¶

All graph writes use MERGE:

MERGE (s:Service {name: $name})
ON CREATE SET s += $props, s.created_at = timestamp()
ON MATCH SET s += $props, s.updated_at = timestamp()

4. Connector Specifications¶

Batch 1 — Foundation¶

GitHub Connector (repos + PRs + CODEOWNERS)¶

Source: GitHub.com REST API v3 + Webhooks
Confidence weight: 0.95
Webhook events: push, pull_request.opened, pull_request.synchronize, pull_request.merged
Produces:
Service nodes — from repo structure, package manifests, Dockerfiles
Function/Module nodes — from AST parsing via Rust CLI
DEPENDS_ON edges — from package manifests (package.json, requirements.txt, go.mod)
CALLS/IMPORTS edges — from AST analysis
Developer nodes — from commit authors + CODEOWNERS
OWNS edges — from CODEOWNERS (confidence 0.70)
PR metadata — written to PostgreSQL
HMAC: X-Hub-Signature-256, SHA-256

Git-Only Connector¶

Source: Raw git clone/fetch, no GitHub API
Confidence weight: 0.95
Trigger: Celery beat or on-demand
Produces: Same as GitHub (Service, Function, Module, edges) but from local/bare repos. No PR metadata, no webhooks, no CODEOWNERS.
Use case: Air-gap environments, non-GitHub repos

Terraform Connector¶

Source: Terraform state files + optional post-apply webhook
Confidence weight: 0.85
Trigger: File watcher + Celery poll + optional webhook
Produces:
InfraResource nodes — from terraform show -json
HOSTS edges — InfraResource → Service (matched by name/tag conventions)
Declared ports, regions, provider metadata
State diff on each apply

Batch 2 — Enrichment¶

Kubernetes Connector¶

Source: K8s API (Deployments, Services, Pods, Ingress, ConfigMaps)
Confidence weight: 0.85
Trigger: K8s Watch API (continuous) + Celery poll every 15 min
Produces:
Service nodes — from K8s Service/Deployment resources
InfraResource nodes — from Pod/Node resources
HOSTS edges — Pod → Service
Reconciliation with Terraform state
Running container image versions

SSH Runtime Connector¶

Source: SSH to target hosts via Vault ephemeral certificates
Confidence weight: 0.90
Trigger: Celery beat every 15 min + on-demand
Produces:
Running processes, open ports, active network connections
Observed vs declared diff
Undeclared service detection
substrate.governance.ssh_drift_detected events
Security: Vault SSH CA, 5-min Ed25519 cert TTL, ForceCommand, no agent forwarding

GitHub Projects v2 Connector¶

Source: GitHub GraphQL API + projects_v2_item webhook
Confidence weight: 0.80
Trigger: Webhook + hourly GraphQL poll
Produces:
SprintNode — from project iterations
IntentAssertion — from project items
Sprint health updates

Batch 3 — Memory & Docs¶

Confluence Connector¶

Source: Confluence REST API + webhooks (page_created, page_updated)
Confidence weight: 0.70
Trigger: Webhook + on-demand
Produces:
DecisionNode — from pages matching ADR patterns (deterministic title/label matching)
FailurePattern — from pages matching post-mortem patterns
WHY edges — linking decisions to services mentioned in content
CAUSED_BY edges — linking failures to affected services
Note: LLM semantic extraction (MoE Scout) deferred. Pages ingested via deterministic pattern matching.

GitHub Pages Connector¶

Source: GitHub Pages build status API
Confidence weight: 0.70
Trigger: Celery poll every 6 hours
Produces:
Documentation staleness delta
Doc coverage score updates on linked Service nodes

Jira Connector¶

Source: Jira REST API + webhooks (issue_created, sprint_closed)
Confidence weight: 0.70
Trigger: Webhook + on-demand
Produces:
IntentAssertion nodes — from ticket descriptions
SprintNode updates — on sprint close events

Options Not Selected (Out of Scope)¶

Option	Reason Deferred
Slack connector	Blueprint v1.1; high noise (0.30 confidence), needs keyword triggers
Datadog connector	Blueprint v1.1; runtime alert signals, not core graph data
Self-hosted GitHub Enterprise	Separate connector variant; different API base URL + auth
Bitbucket / GitLab connectors	Not in blueprint; different API surface
ServiceNow connector	Enterprise ITSM; not in MVP scope
LLM-powered semantic extraction	Phase 2+; requires DGX Spark with MoE Scout / Dense extract-lora

5. Entity Resolution (Deterministic)¶

Pipeline¶

Raw name from connector
       ↓
Step 1: Tokenization normalization
       ↓
Step 2: Alias registry lookup
       ↓
Step 3: Jaccard similarity on dependency sets
       ↓
Step 4: Confidence-based routing

Step 1 — Tokenization normalization: - Strip prefixes/suffixes: srv-, -service, -svc, -api, -app - Normalize separators: hyphens, underscores, dots → spaces - CamelCase split: PaymentService → payment service - Lowercase, sort tokens alphabetically - Example: srv-payment, payment-service, PaymentService, payments → payment

Step 2 — Alias registry:

CREATE TABLE entity_aliases (
    raw_name TEXT PRIMARY KEY,
    canonical_id UUID NOT NULL REFERENCES services(id),
    source TEXT NOT NULL,
    confidence FLOAT NOT NULL,
    resolved_at TIMESTAMPTZ DEFAULT now()
);

Populated automatically on first resolution, manually editable via API.

Step 3 — Jaccard similarity (fallback): - J(A, B) = |A ∩ B| / |A ∪ B| on dependency sets - Similarity > 0.80 → auto-merge (confidence 0.85) - Similarity 0.50–0.80 → route to Verification Queue - Similarity < 0.50 → treat as new entity

Step 4 — Confidence routing:

Resolution Method	Confidence	Action
Exact alias match	0.95	Auto-merge, no review
Normalization match	0.90	Auto-merge, no review
Jaccard > 0.80	0.85	Auto-merge, log for audit
Jaccard 0.50–0.80	0.65	Write as Unverified, route to Queue
No match	1.0 (new entity)	Create new canonical node

Nightly Resolution Pass (Celery beat, 2:15 AM)¶

Collect nodes added/modified in last 24 hours
Run normalization + Jaccard against existing canonical nodes
Auto-merge high-confidence matches
Route ambiguous candidates to Verification Queue
Log all decisions to audit table

LLM Upgrade Path (Phase 2+)¶

Add Step 1.5: bge-m3 embedding similarity (cosine > 0.92 → auto-merge)
Add Step 2.5: Dense resolve-lora classification for Jaccard 0.50–0.80 range
Existing normalization and alias registry remain as fast-path shortcuts

6. Confidence Scoring & Verification Queue¶

Source Trust Weights¶

Source Type	Confidence Weight	Rationale
Code analysis (AST)	0.95	Deterministic, source of truth
SSH runtime inspection	0.90	Direct observation
CI/CD data	0.90	Automated, reliable
Infrastructure (Terraform, K8s)	0.85	Declarative, occasionally stale
GitHub Projects v2	0.80	Structured but user-maintained
Documentation (Confluence, Pages)	0.70	Human-authored, may be outdated
CODEOWNERS ownership	0.70	Heuristic
Jira ticket components	0.70	Loosely structured
Entity resolution (Jaccard)	0.65–0.85	Varies by match strength

Confidence Thresholds¶

Range	verification_status	Action
≥ 0.90	`Verified`	Auto-accepted
0.60–0.89	`Unverified`	Written, queued for periodic review
0.50–0.59	`Unverified`	Written, immediately routed to Queue
< 0.50	Rejected	Not written, logged to audit

Verification Queue Wiring¶

Queue item sources: - Entity resolution ambiguous matches - Cross-source conflicts - Ownership disputes - New entities from low-confidence sources

Queue actions:

Action	Graph Effect
Accept	`verification_status = Verified`, confidence → 0.95
Reject	`verification_status = Deprecated`, log rejection
Edit	Update properties inline, then Accept
Escalate	Route to Architect role

Escalation routing: - Items ≥ 0.60 → service owner (via OWNS edge) - Items < 0.60 → Architect role - Target: 10–15% human review rate

Node/Edge Confidence Schema (Neo4j)¶

Every node and edge carries: - confidence: float (0.0–1.0) - source: string (connector name) - extraction_timestamp: datetime - last_verification_timestamp: datetime - verification_status: enum (Verified / Unverified / Disputed / Stale / Deprecated)

Staleness rules (Celery beat): - Documentation: Stale after 30 days - Infrastructure: Stale after 7 days - SSH-verified runtime: Stale after 1 day

7. Graph Writer & Neo4j Schema¶

Graph Writer Module¶

Central module all connectors write through:

class GraphWriter:
    async def apply_delta(self, delta: GraphDelta) -> WriteResult:
        # 1. Entity resolution on all node names
        # 2. Apply confidence scores from source weights
        # 3. MERGE nodes and edges in a single Neo4j transaction
        # 4. Route low-confidence items to Verification Queue
        # 5. Publish substrate.graph.node_written / edge_written to NATS
        # 6. Publish substrate.cache.invalidate
        # 7. Write audit log entry to PostgreSQL

GraphDelta Schema¶

@dataclass
class NodeDelta:
    label: str              # "Service", "Function", "InfraResource", etc.
    properties: dict
    merge_key: str          # Property to MERGE on
    source: str
    confidence: float

@dataclass
class EdgeDelta:
    rel_type: str           # "DEPENDS_ON", "CALLS", "OWNS", etc.
    source_label: str
    source_key: dict
    target_label: str
    target_key: dict
    properties: dict
    source: str
    confidence: float

@dataclass
class GraphDelta:
    nodes: list[NodeDelta]
    edges: list[EdgeDelta]
    source_connector: str
    event_id: str           # For idempotency

Node Labels¶

Label	Source	Merge Key
`Service`	All connectors	`name`
`Function`	GitHub/Git AST	`signature`
`Module`	GitHub/Git AST	`name` (qualified)
`InfraResource`	Terraform, K8s	`resource_id`
`Developer`	SCIM, GitHub, Git	`github_handle`
`Team`	SCIM	`name`
`DecisionNode`	Confluence, GitHub	`id`
`FailurePattern`	Confluence	`id`
`ExceptionNode`	Governance	`id`
`SprintNode`	GitHub Projects, Jira	`sprint_id`
`IntentAssertion`	GitHub Projects, Jira	`linked_ticket`
`Community`	Manual	`id`
`Policy`	OPA/PostgreSQL	`policy_id`

Edge Types¶

Relationship	From → To	Source
`CALLS`	Function → Function	AST
`IMPORTS`	Module → Module	AST
`HOSTS`	InfraResource → Service	Terraform, K8s
`ACTUALLY_CALLS`	Service → Service	SSH, K8s
`DEPENDS_ON`	Service → Service	Package manifests
`OWNS`	Developer/Team → Service	CODEOWNERS, SCIM
`MEMBER_OF`	Developer → Team	SCIM
`CHILD_OF`	Team → Team	SCIM
`WHY`	DecisionNode → Service/Policy	Confluence
`CAUSED_BY`	FailurePattern → Service	Confluence
`PREVENTED_BY`	Service → Policy	Governance
`CONTAINS`	Community → Service	Manual

New Constraints and Indexes¶

CREATE CONSTRAINT function_sig IF NOT EXISTS FOR (f:Function) REQUIRE f.signature IS UNIQUE;
CREATE CONSTRAINT module_name IF NOT EXISTS FOR (m:Module) REQUIRE m.name IS UNIQUE;
CREATE CONSTRAINT infra_resource_id IF NOT EXISTS FOR (r:InfraResource) REQUIRE r.resource_id IS UNIQUE;
CREATE CONSTRAINT sprint_id IF NOT EXISTS FOR (s:SprintNode) REQUIRE s.sprint_id IS UNIQUE;
CREATE CONSTRAINT intent_ticket IF NOT EXISTS FOR (i:IntentAssertion) REQUIRE i.linked_ticket IS UNIQUE;
CREATE CONSTRAINT exception_id IF NOT EXISTS FOR (e:ExceptionNode) REQUIRE e.id IS UNIQUE;

CREATE INDEX function_file IF NOT EXISTS FOR (f:Function) ON (f.file);
CREATE INDEX infra_type IF NOT EXISTS FOR (r:InfraResource) ON (r.type);
CREATE INDEX infra_provider IF NOT EXISTS FOR (r:InfraResource) ON (r.provider);
CREATE INDEX developer_keycloak IF NOT EXISTS FOR (d:Developer) ON (d.keycloak_id);
CREATE INDEX decision_source IF NOT EXISTS FOR (d:DecisionNode) ON (d.source_url);
CREATE INDEX sprint_dates IF NOT EXISTS FOR (s:SprintNode) ON (s.start_date);
CREATE INDEX node_source IF NOT EXISTS FOR (s:Service) ON (s.source);
CREATE INDEX node_confidence IF NOT EXISTS FOR (s:Service) ON (s.confidence);
CREATE INDEX node_verification IF NOT EXISTS FOR (s:Service) ON (s.verification_status);

Graph Algorithm Jobs (Celery beat, 3:00 AM)¶

PageRank:

CALL gds.pageRank.write('service-dependency-graph', {
  writeProperty: 'page_rank', maxIterations: 20, dampingFactor: 0.85
})

Betweenness centrality:

CALL gds.betweenness.write('service-dependency-graph', {
  writeProperty: 'betweenness'
})

Weakly connected components:

CALL gds.wcc.write('service-dependency-graph', {
  writeProperty: 'component_id'
})

Cycle detection: - DEPENDS_ON subgraph → NetworkX DiGraph → nx.simple_cycles(G) - Run synchronously on PR ingestion (before governance evaluation) - Cycles written as policy violation candidates

8. AST Parser (Rust CLI)¶

Architecture¶

Standalone Rust binary invoked by ingestion workers:

ingestion-worker receives push/PR event
       ↓
clones/fetches repo to temp directory
       ↓
substrate-ast-cli parse --repo /tmp/repo --output json
       ↓
tree-sitter parses each file, stack-graphs resolves cross-file refs
       ↓
JSON output: nodes + edges
       ↓
ingestion-worker builds GraphDelta, sends to graph_writer

Technology¶

tree-sitter — Incremental parsing, grammars for Python, TypeScript, Go, Java, Rust, C#
stack-graphs — GitHub's cross-file name resolution library

Supported Languages (Phase 1)¶

Language	Grammar	stack-graphs	Priority
Python	`tree-sitter-python`	Yes	High
TypeScript/JavaScript	`tree-sitter-typescript`	Yes	High
Go	`tree-sitter-go`	Yes	High
Java	`tree-sitter-java`	Yes	Medium
Rust	`tree-sitter-rust`	Yes	Medium
C#	`tree-sitter-c-sharp`	Partial	Low

CLI Output Format¶

{
  "nodes": [
    {
      "label": "Function",
      "properties": {
        "signature": "app.services.payment.process_payment",
        "file": "src/services/payment.py",
        "line": 42,
        "end_line": 87,
        "hash": "sha256:abc123...",
        "language": "python",
        "visibility": "public"
      }
    },
    {
      "label": "Module",
      "properties": {
        "name": "app.services.payment",
        "file": "src/services/payment.py",
        "language": "python"
      }
    }
  ],
  "edges": [
    {
      "rel_type": "CALLS",
      "source": "app.services.payment.process_payment",
      "target": "app.services.order.create_order",
      "properties": { "static": true }
    },
    {
      "rel_type": "IMPORTS",
      "source": "app.services.payment",
      "target": "app.services.order",
      "properties": { "static": true, "dynamic": false }
    }
  ]
}

Service Detection Heuristics¶

Signal	Confidence	Example
Dockerfile present	0.95	`services/payment/Dockerfile` → PaymentService
docker-compose service entry	0.95	`services.payment` in compose file
Package manifest at root	0.90	`services/payment/package.json`
K8s deployment YAML	0.85	`k8s/payment-deployment.yaml`
Directory with `main.*`	0.80	`services/payment/main.py`

Build & Distribution¶

Static binary: cargo build --release
Added to backend Docker image at /usr/local/bin/substrate-ast-cli
Separate Cargo.toml at tools/ast-cli/

Incremental Parsing¶

On PR events, only changed files are parsed:

substrate-ast-cli parse --repo /tmp/repo --files src/payment.py,src/order.py --output json

9. OPA Governance Foundation¶

Deployment¶

OPA server in Docker Compose:

opa:
  image: openpolicyagent/opa:latest
  command: ["run", "--server", "--bundle", "/policies"]
  ports: ["8181:8181"]
  volumes: ["./backend/policies:/policies"]

Evaluation Pipeline¶

Graph write event (substrate.graph.node_written)
       ↓
Governance worker picks up event
       ↓
Fetches affected subgraph from Neo4j
       ↓
Serializes subgraph as JSON context
       ↓
POST /v1/data/substrate/violations → OPA
       ↓
Returns violations with rule IDs, severity, affected nodes
       ↓
Violations written to PostgreSQL policy_violations
       ↓
Publishes substrate.governance.violation_raised to NATS

Starter Policy Packs (3 of 9)¶

1. Domain Boundary (GOV-BOUNDARY)

package substrate.boundary

violation[msg] {
    edge := input.edges[_]
    edge.type == "ACTUALLY_CALLS"
    source := input.nodes[edge.source]
    target := input.nodes[edge.target]
    source.domain != target.domain
    not edge.via_gateway
    msg := sprintf("%s (%s) calls %s (%s) directly without gateway",
        [source.name, source.domain, target.name, target.domain])
}

2. Circular Dependency (GOV-CIRCULAR)

package substrate.circular

violation[msg] {
    cycle := input.cycles[_]
    msg := sprintf("Circular dependency detected: %s", [concat(" → ", cycle)])
}

Cycles pre-computed by NetworkX and passed in OPA input context.

3. Ownership Completeness (GOV-OWNERSHIP)

package substrate.ownership

violation[msg] {
    service := input.services[_]
    service.verification_status != "Deprecated"
    count(service.owners) == 0
    msg := sprintf("%s has no active owner", [service.name])
}

Policy Packs Deferred to Phase 2¶

Pack	ID	Reason
SOLID Single Responsibility	`GOV-SRP`	Needs efferent coupling metrics
SOLID Open/Closed	`GOV-OCP`	Needs extension pattern detection
SOLID Dependency Inversion	`GOV-DIP`	Needs abstract vs concrete classification
TDD Coverage	`GOV-TDD`	Needs CI test coverage data
API-First	`GOV-API`	Needs OpenAPI spec detection
License Compatibility	`GOV-LICENSE`	Needs dependency license resolution

No GitHub Checks API in Phase 1¶

Violations are recorded and visible in the UI. GitHub Checks API integration (PR blocking, violation comments) is Phase 2.

10. SCIM, RBAC & Identity¶

Custom Keycloak Dockerfile¶

FROM quay.io/keycloak/keycloak:latest
ADD --chown=keycloak:keycloak \
    https://github.com/Captain-P-Goldfish/scim-for-keycloak/releases/download/latest/scim-for-keycloak.jar \
    /opt/keycloak/providers/
RUN /opt/keycloak/bin/kc.sh build

SCIM Endpoints¶

Method	Path	Purpose
`POST`	`/scim/v2/Users`	Create Developer node
`PATCH`	`/scim/v2/Users/{id}`	Update/deactivate Developer
`POST`	`/scim/v2/Groups`	Create Team node
`PATCH`	`/scim/v2/Groups/{id}`	Add/remove MEMBER_OF edges
`DELETE`	`/scim/v2/Groups/{id}`	Mark Team as Deprecated

Onboarding (POST /Users): 1. Create Developer node (active: true) 2. Create MEMBER_OF edges per group 3. Create Team if missing (verification_status = Unverified) 4. Publish substrate.identity.user_onboarded

Offboarding (PATCH /Users active=false): 1. Set Developer.active = false 2. Query orphaned services:

MATCH (d:Developer {keycloak_id: $kid, active: false})-[:OWNS]->(s:Service)
WHERE NOT EXISTS {
  MATCH (other:Developer {active: true})-[:OWNS]->(s)
}
AND NOT EXISTS {
  MATCH (t:Team)-[:OWNS]->(s)
  WHERE EXISTS { MATCH (m:Developer {active: true})-[:MEMBER_OF]->(t) }
}
RETURN s.name AS orphaned_service

3. Flag orphaned services CRITICAL in Queue 4. Publish substrate.identity.user_offboarded

RBAC Roles¶

Role	Capabilities
`admin`	Full access, user management, system configuration
`architect`	Full graph, policy authoring, simulation, exception approval
`developer`	Own service graph, PR details, intent mismatch alerts
`viewer`	Read-only graphs, dashboards, reports
`service-account`	API-only, webhook delivery, no UI

Role Enforcement¶

def require_role(*roles: str):
    def checker(user: UserInfo = Depends(get_current_user)):
        if not any(r in user.realm_roles for r in roles):
            raise HTTPException(403, "Insufficient role")
        return user
    return Depends(checker)

Ownership-Scoped Filtering¶

Developer role queries filtered to owned services:

async def get_owned_services(user: UserInfo, neo4j_session) -> list[str]:
    result = await neo4j_session.run("""
        MATCH (d:Developer {keycloak_id: $kid})
        OPTIONAL MATCH (d)-[:OWNS]->(s1:Service)
        OPTIONAL MATCH (d)-[:MEMBER_OF]->(t:Team)-[:OWNS]->(s2:Service)
        RETURN collect(DISTINCT s1.name) + collect(DISTINCT s2.name) AS services
    """, kid=user.sub)
    record = await result.single()
    return record["services"] if record else []

Route-Level Requirements¶

Endpoint	Required Role
`GET /dashboard`, communities, graph	Any authenticated
`GET /policies`	Any authenticated
`POST /policies`, `PUT /policies/{id}`	`architect`, `admin`
`GET /queue`	`developer`, `architect`, `admin`
`PATCH /queue/{id}` (Accept/Reject)	`architect`, `admin`
`PATCH /queue/{id}` (Edit)	`developer` (own), `architect`, `admin`
`POST /memory`	`developer`, `architect`, `admin`
`POST /simulation/run`	`architect`, `admin`
SCIM endpoints	`service-account` only
Webhook receivers	HMAC-verified, no JWT

Group-to-Role Mapping¶

CREATE TABLE role_mappings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    keycloak_group TEXT NOT NULL UNIQUE,
    substrate_role TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now()
);

11. UI Changes¶

New Dependencies¶

Package	Purpose
`elkjs`	Sugiyama layout algorithm
`web-worker`	Run layout off main thread

Graph Rendering¶

Layout engines (user-selectable): - Sugiyama (default): Dependencies flow top-to-bottom via ELK.js. Computed in web worker. - Force-directed: Nodes cluster by relationship density. React Flow built-in.

Semantic Zoom¶

Detected from viewport scale, not explicit user action. 300ms ease-in-out transitions.

Scale	Level	Rendered
< 0.4	Far (Domain)	Domain super-nodes. Aggregate health badge, violation count, tension score.
0.4–0.8	Mid (Service)	Service nodes with tension ring, violation badge, owner label. PageRank-weighted sizing.
> 0.8	Close (Component)	Function/Module nodes inside service boundaries. File path, line number, call counts.

[Layout: Sugiyama ▾] [Filter ▾] [Minimap ☐] [Fit View]

Filter sidebar: - Domain / community - Violation type (boundary, circular, ownership) - Confidence threshold (slider, default 0.60) - Owner / team - Verification status - Time window (7d / 14d / 30d)

Verification Queue UI Updates¶

Items show source connector, confidence badge (red < 0.60, yellow 0.60–0.79, green ≥ 0.80)
Accept/Reject/Edit/Escalate call real API endpoints
Items grouped by resolution_type

Dashboard Updates¶

All metrics computed from real data: - Violations from policy_violations WHERE resolved = false - Tension from graph algorithm output - Memory entries from real ingestion - Memory gaps from services with no WHY edges

RBAC in UI¶

Role from JWT claims
Nav items conditionally rendered
Policy editor: read-only for Developer/Viewer
Simulation Run: hidden for Developer/Viewer
Queue actions: role-appropriate visibility

Pages NOT Changed¶

Page	Reason
Search	Canned results until GraphRAG (Phase 3)
Simulation rendering	Seeded results until simulation engine (Phase 4)

12. Celery Beat Schedule¶

Job	Schedule	Description
Nightly connector poll	2:00 AM daily	Full poll on all polling connectors
Entity resolution pass	2:15 AM daily	Normalization + Jaccard on last 24h nodes
PageRank update	3:00 AM daily	GDS PageRank on service graph
Betweenness centrality	3:00 AM daily	GDS betweenness on service graph
Connected components	3:00 AM daily	GDS WCC for orphan detection
Staleness check	4:00 AM daily	Mark nodes Stale per threshold rules
K8s API poll	Every 15 min	K8s Watch API resource changes
SSH Runtime inspection	Every 15 min	Vault cert → SSH → inspection → diff
GitHub Pages poll	Every 6 hours	Pages build timestamps vs code commits
GitHub Projects poll	Every 1 hour	GraphQL project item changes

Distributed Locking¶

async def acquire_lock(redis, event_id: str, ttl: int = 300) -> bool:
    return await redis.set(f"lock:ingestion:{event_id}", "1", nx=True, ex=ttl)

13. Infrastructure & Docker Compose¶

New Services¶

Service	Image	Purpose
`opa`	`openpolicyagent/opa:latest`	Policy evaluation
`keycloak`	Custom Dockerfile	Identity + SCIM 2.0
`celery-worker`	Backend image	Ingestion processing
`celery-beat`	Backend image	Job scheduling
`celery-governance`	Backend image	Policy evaluation workers
`vault`	`hashicorp/vault:latest`	SSH CA

Startup Order¶

1. PostgreSQL, Neo4j, Redis, NATS         (databases)
2. Flyway migrations                       (schema)
3. Keycloak (custom, SCIM plugin)          (identity)
4. Vault                                   (SSH CA)
5. OPA                                     (policy engine)
6. Backend (FastAPI)                        (API layer)
7. Celery beat                             (scheduler)
8. Celery workers (ingestion, governance)  (async processing)
9. UI (Nginx)                              (frontend)

Backend Dockerfile Updates¶

Multi-stage build: 1. ast-builder stage — Compiles Rust CLI from tools/ast-cli/ 2. python stage — Backend app with CLI binary, git, SSH client

New Environment Variables¶

NATS_URL=nats://nats:4222
OPA_URL=http://opa:8181
CELERY_BROKER_URL=redis://redis:6379/1
CELERY_RESULT_BACKEND=redis://redis:6379/2
VAULT_ADDR=http://vault:8200
VAULT_TOKEN=substrate-dev-token
VAULT_SSH_ROLE=substrate-ssh
GITHUB_APP_ID=
GITHUB_APP_PRIVATE_KEY_PATH=
GITHUB_WEBHOOK_SECRET=
TERRAFORM_STATE_PATHS=
KUBECONFIG_PATH=
SSH_TARGET_HOSTS=
CONFLUENCE_URL=
CONFLUENCE_API_TOKEN=
CONFLUENCE_WEBHOOK_SECRET=
JIRA_URL=
JIRA_API_TOKEN=
JIRA_WEBHOOK_SECRET=

Project Directory Structure (Phase 1 additions)¶

substrate/
  backend/
    app/
      modules/
        ingestion/          # Webhook receivers, NATS publishers
        connectors/         # 8 connector implementations
        graph_writer/       # Entity resolution, confidence, MERGE logic
        governance/         # OPA client, evaluation pipeline
        identity/           # SCIM endpoint, Developer/Team mutations
        rbac/               # Role enforcement, ownership filtering
        scheduler/          # Celery app, beat config, tasks
        ast_parser/         # Rust CLI wrapper
      core/
        nats.py             # NATS client helpers
    policies/               # OPA Rego packs
      boundary/policy.rego
      circular/policy.rego
      ownership/policy.rego
    db/
      postgres/V4__phase1_schema.sql
      neo4j/V2__phase1_constraints.cypher
  tools/
    ast-cli/                # Rust AST parser
      Cargo.toml
      src/main.rs
  keycloak/
    Dockerfile              # Custom Keycloak with SCIM
  ui/
    src/
      components/graph/
        ElkLayout.ts
        SemanticZoom.tsx
        LayoutSelector.tsx
        FilterSidebar.tsx

14. Database Migrations¶

PostgreSQL: V4__phase1_schema.sql¶

CREATE TABLE entity_aliases (
    raw_name TEXT PRIMARY KEY,
    canonical_id UUID NOT NULL,
    source TEXT NOT NULL,
    confidence FLOAT NOT NULL,
    resolved_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE ingestion_events (
    event_id TEXT PRIMARY KEY,
    source_connector TEXT NOT NULL,
    event_type TEXT NOT NULL,
    payload_hash TEXT NOT NULL,
    status TEXT DEFAULT 'pending',
    nodes_written INT DEFAULT 0,
    edges_written INT DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT now(),
    completed_at TIMESTAMPTZ
);

CREATE TABLE audit_log (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    timestamp TIMESTAMPTZ DEFAULT now(),
    actor TEXT NOT NULL,
    action TEXT NOT NULL,
    resource_type TEXT,
    resource_id TEXT,
    input_hash TEXT,
    output_hash TEXT,
    confidence FLOAT,
    detail JSONB
);

CREATE TABLE webhook_configs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    connector_type TEXT NOT NULL,
    endpoint_path TEXT NOT NULL,
    hmac_secret TEXT NOT NULL,
    active BOOLEAN DEFAULT true,
    created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE role_mappings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    keycloak_group TEXT NOT NULL UNIQUE,
    substrate_role TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE connector_state (
    connector_type TEXT PRIMARY KEY,
    last_poll_at TIMESTAMPTZ,
    last_success_at TIMESTAMPTZ,
    cursor_state JSONB,
    error_count INT DEFAULT 0,
    last_error TEXT,
    active BOOLEAN DEFAULT true
);

ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS source_connector TEXT;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS candidate_data JSONB;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS conflicting_data JSONB;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS neo4j_node_id TEXT;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS resolution_type TEXT;

Neo4j: V2__phase1_constraints.cypher¶

See Section 7 for full constraint and index definitions.

Seed Data Transition¶

Seed data nodes have source: "seed", confidence: 0.80
Connector data arrives with higher confidence and overwrites
After all connectors live, cleanup job removes unmatched seed-only nodes
No breaking migration — Phase 0 UI continues working throughout

pyproject.toml Additions¶

"celery[redis]>=5.3"
"nats-py>=2.7"
"networkx>=3.0"
"aiohttp>=3.9"
"paramiko>=3.4"
"hvac>=2.0"
"kubernetes-asyncio>=30"

15. Testing Strategy¶

Test Layers¶

Layer	What	How	Target
Route tests	Status codes, shapes, auth, RBAC	httpx, mocked services	~60
Unit — connectors	Parsing, events, GraphDelta	Mock APIs, fixture payloads	~40
Unit — entity resolution	Normalization, alias, Jaccard	Pure functions	~15
Unit — graph writer	MERGE patterns, confidence, validation	Mock Neo4j	~10
Unit — governance	OPA input, violation parsing	Mock OPA HTTP	~10
Unit — RBAC	Role checks, ownership filtering	Mock user, mock Neo4j	~10
Unit — SCIM	Event parsing, graph mutations	Mock Neo4j	~10
Unit — AST CLI	JSON parsing, service detection	Fixture repos	~15

Total: ~170 tests

Connector Fixtures¶

tests/fixtures/{connector}/ — sample payloads per connector
tests/fixtures/ast/ — small fixture repos for CLI testing

Not Testing in Phase 1¶

Integration tests with testcontainers
End-to-end webhook → Neo4j flows
OPA with real server
Load/performance testing

16. Explicitly Out of Scope¶

Item	Target Phase
LLM entity resolution (Dense resolve-lora)	Phase 2+
bge-m3 embeddings / pgvector semantic search	Phase 2+
GitHub Checks API (PR blocking)	Phase 2
Violation PR comments	Phase 2
explain-lora violation explanations	Phase 2
Leiden community detection	Phase 2
GraphRAG / NL search	Phase 3
Simulation engine (Neo4j sandboxes)	Phase 4
Agent orchestration (Fix PR)	Phase 4
Slack connector	v1.1
Datadog connector	v1.1
Self-hosted GitHub Enterprise	Later
Bitbucket / GitLab connectors	Not in blueprint
ServiceNow connector	Not in MVP
Integration tests with testcontainers	Phase 2
Load/performance testing	Phase 2+
WebGL 3D graph (Sigma.js)	v2
Multi-team graph isolation	v2