Phase 1: Deterministic Ingestion & Graph Population — Design Spec¶
Date: 2026-03-26 Status: Approved Scope: Build the ingestion pipeline, 8 data connectors, entity resolution, confidence scoring, OPA governance foundation, SCIM/RBAC, and dynamic graph rendering — all without LLM dependency.
1. Decisions Summary¶
| Decision | Choice | Alternatives Considered |
|---|---|---|
| LLM dependency | Deterministic-first — no LLM required | Blueprint-faithful (requires DGX Spark), Connector-focused (narrower scope) |
| Connector scope | All 8 sources in 3 batches | Fewer connectors first; deferred due to all sources being available |
| Build order | Infrastructure-first | Vertical-slice per connector (visible progress faster), Hybrid (minimal infra then first connector). Chose infra-first for stable foundation. |
| OPA governance | Include foundation — 3 starter packs | Defer entirely to Phase 2. Included so policies validate against real graph data early. |
| RBAC | Full — role-based + ownership scoping | Route protection only (simpler), Defer entirely. Chose full because SCIM is needed anyway. |
| SCIM | Custom Keycloak Dockerfile with SCIM plugin | Manual user management. SCIM chosen for automated identity lifecycle. |
| Confidence pipeline | Full — source weights, thresholds, queue routing | Scoring only without routing. Chose full because Queue module already exists. |
| Graph layout | ELK.js Sugiyama + force-directed, user-selectable | Sugiyama only, Force-directed only. Both gives flexibility. |
| Semantic zoom | All 3 levels (Domain → Service → Component) | Domain + Service only. Chose 3 levels because AST CLI delivers Function/Module nodes. |
| Event bus | Full NATS event-driven pipeline | Direct Celery dispatch (simpler), Hybrid. Chose full NATS for replay and decoupling. |
| Testing | Route + unit tests, mocked services | Integration tests with testcontainers (deferred to Phase 2). |
| GitHub target | GitHub.com only | Self-hosted GitHub Enterprise deferred. |
2. Architecture Overview¶
Phase 1 adds 4 subsystems to the Phase 0 backend:
Webhook Receivers (FastAPI routes)
↓ publish
NATS JetStream (event bus)
↓ consume
Celery Workers (async processing)
↓ write
Graph Writers (Neo4j + PostgreSQL)
↓ emit
NATS completion events
↓ trigger
Downstream consumers (cache invalidation, governance, confidence scoring)
New Backend Modules¶
| Module | Purpose |
|---|---|
ingestion/ |
Connector framework, webhook receivers, NATS publishers |
connectors/ |
8 connector implementations (GitHub, Git, Terraform, K8s, SSH, Projects v2, Confluence, Jira) |
graph_writer/ |
Canonical graph write logic — entity resolution, confidence scoring, Neo4j MERGE patterns |
governance/ |
OPA client, policy evaluation pipeline, violation writer |
identity/ |
SCIM endpoint, Developer/Team graph mutations, key-person risk |
rbac/ |
Role-based access control middleware, ownership-scoped filtering |
scheduler/ |
Celery app, beat config, task definitions |
ast_parser/ |
Rust CLI wrapper for tree-sitter + stack-graphs AST parsing |
Existing Modules Updated¶
graph/— Dynamic layout (ELK.js data), 3-level semantic zoom queriesqueue/— Wired to real confidence-based routing instead of seed datadashboard/— Metrics from real computed datacommunity/— Communities from graph structure (manual initially, Leiden deferred)
3. Ingestion Pipeline¶
Event Sources¶
| Source Type | Trigger | Example |
|---|---|---|
| Webhook receivers | HTTP POST from external service | GitHub push, Jira issue created |
| Pollers | Celery beat scheduled jobs | K8s API (15 min), GitHub Pages (6 hours) |
| SSH Runtime | Celery beat every 15 minutes | Vault ephemeral cert → SSH → inspection script |
| Git-only | Celery beat or on-demand | Clone/fetch repos, run AST parsing |
NATS Subject Hierarchy¶
substrate.ingestion.github.pr_opened
substrate.ingestion.github.push
substrate.ingestion.github.codeowners_changed
substrate.ingestion.terraform.state_updated
substrate.ingestion.k8s.resources_polled
substrate.ingestion.ssh.inspection_completed
substrate.ingestion.git.repo_parsed
substrate.ingestion.confluence.page_updated
substrate.ingestion.jira.issue_created
substrate.ingestion.jira.sprint_closed
substrate.ingestion.github_projects.item_updated
substrate.ingestion.github_pages.build_completed
substrate.governance.violation_raised
substrate.governance.policy_evaluated
substrate.identity.user_onboarded
substrate.identity.user_offboarded
substrate.cache.invalidate
substrate.graph.node_written
substrate.graph.edge_written
NATS JetStream Streams¶
| Stream | Subject Filter | Retention | Purpose |
|---|---|---|---|
INGESTION |
substrate.ingestion.> |
7 days | All connector events |
GOVERNANCE |
substrate.governance.> |
7 days | Policy evaluation results |
IDENTITY |
substrate.identity.> |
30 days | SCIM lifecycle events |
GRAPH |
substrate.graph.> |
24 hours | Node/edge write notifications |
CACHE |
substrate.cache.> |
1 hour | Cache invalidation signals |
Celery Worker Pools¶
| Pool | Concurrency | Consumes From | Writes To |
|---|---|---|---|
ingestion-worker |
4 | INGESTION stream |
Neo4j, PostgreSQL |
governance-worker |
2 | GOVERNANCE stream |
PostgreSQL |
identity-worker |
1 | IDENTITY stream |
Neo4j |
Connector Base Class¶
class BaseConnector(ABC):
source_name: str
confidence_weight: float
@abstractmethod
async def ingest(self, event: dict) -> list[GraphDelta]: ...
@abstractmethod
async def poll(self) -> list[GraphDelta]: ...
Idempotency¶
All graph writes use MERGE:
MERGE (s:Service {name: $name})
ON CREATE SET s += $props, s.created_at = timestamp()
ON MATCH SET s += $props, s.updated_at = timestamp()
4. Connector Specifications¶
Batch 1 — Foundation¶
GitHub Connector (repos + PRs + CODEOWNERS)¶
- Source: GitHub.com REST API v3 + Webhooks
- Confidence weight: 0.95
- Webhook events:
push,pull_request.opened,pull_request.synchronize,pull_request.merged - Produces:
- Service nodes — from repo structure, package manifests, Dockerfiles
- Function/Module nodes — from AST parsing via Rust CLI
- DEPENDS_ON edges — from package manifests (package.json, requirements.txt, go.mod)
- CALLS/IMPORTS edges — from AST analysis
- Developer nodes — from commit authors + CODEOWNERS
- OWNS edges — from CODEOWNERS (confidence 0.70)
- PR metadata — written to PostgreSQL
- HMAC:
X-Hub-Signature-256, SHA-256
Git-Only Connector¶
- Source: Raw git clone/fetch, no GitHub API
- Confidence weight: 0.95
- Trigger: Celery beat or on-demand
- Produces: Same as GitHub (Service, Function, Module, edges) but from local/bare repos. No PR metadata, no webhooks, no CODEOWNERS.
- Use case: Air-gap environments, non-GitHub repos
Terraform Connector¶
- Source: Terraform state files + optional post-apply webhook
- Confidence weight: 0.85
- Trigger: File watcher + Celery poll + optional webhook
- Produces:
- InfraResource nodes — from
terraform show -json - HOSTS edges — InfraResource → Service (matched by name/tag conventions)
- Declared ports, regions, provider metadata
- State diff on each apply
Batch 2 — Enrichment¶
Kubernetes Connector¶
- Source: K8s API (Deployments, Services, Pods, Ingress, ConfigMaps)
- Confidence weight: 0.85
- Trigger: K8s Watch API (continuous) + Celery poll every 15 min
- Produces:
- Service nodes — from K8s Service/Deployment resources
- InfraResource nodes — from Pod/Node resources
- HOSTS edges — Pod → Service
- Reconciliation with Terraform state
- Running container image versions
SSH Runtime Connector¶
- Source: SSH to target hosts via Vault ephemeral certificates
- Confidence weight: 0.90
- Trigger: Celery beat every 15 min + on-demand
- Produces:
- Running processes, open ports, active network connections
- Observed vs declared diff
- Undeclared service detection
substrate.governance.ssh_drift_detectedevents- Security: Vault SSH CA, 5-min Ed25519 cert TTL, ForceCommand, no agent forwarding
GitHub Projects v2 Connector¶
- Source: GitHub GraphQL API +
projects_v2_itemwebhook - Confidence weight: 0.80
- Trigger: Webhook + hourly GraphQL poll
- Produces:
- SprintNode — from project iterations
- IntentAssertion — from project items
- Sprint health updates
Batch 3 — Memory & Docs¶
Confluence Connector¶
- Source: Confluence REST API + webhooks (
page_created,page_updated) - Confidence weight: 0.70
- Trigger: Webhook + on-demand
- Produces:
- DecisionNode — from pages matching ADR patterns (deterministic title/label matching)
- FailurePattern — from pages matching post-mortem patterns
- WHY edges — linking decisions to services mentioned in content
- CAUSED_BY edges — linking failures to affected services
- Note: LLM semantic extraction (MoE Scout) deferred. Pages ingested via deterministic pattern matching.
GitHub Pages Connector¶
- Source: GitHub Pages build status API
- Confidence weight: 0.70
- Trigger: Celery poll every 6 hours
- Produces:
- Documentation staleness delta
- Doc coverage score updates on linked Service nodes
Jira Connector¶
- Source: Jira REST API + webhooks (
issue_created,sprint_closed) - Confidence weight: 0.70
- Trigger: Webhook + on-demand
- Produces:
- IntentAssertion nodes — from ticket descriptions
- SprintNode updates — on sprint close events
Options Not Selected (Out of Scope)¶
| Option | Reason Deferred |
|---|---|
| Slack connector | Blueprint v1.1; high noise (0.30 confidence), needs keyword triggers |
| Datadog connector | Blueprint v1.1; runtime alert signals, not core graph data |
| Self-hosted GitHub Enterprise | Separate connector variant; different API base URL + auth |
| Bitbucket / GitLab connectors | Not in blueprint; different API surface |
| ServiceNow connector | Enterprise ITSM; not in MVP scope |
| LLM-powered semantic extraction | Phase 2+; requires DGX Spark with MoE Scout / Dense extract-lora |
5. Entity Resolution (Deterministic)¶
Pipeline¶
Raw name from connector
↓
Step 1: Tokenization normalization
↓
Step 2: Alias registry lookup
↓
Step 3: Jaccard similarity on dependency sets
↓
Step 4: Confidence-based routing
Step 1 — Tokenization normalization:
- Strip prefixes/suffixes: srv-, -service, -svc, -api, -app
- Normalize separators: hyphens, underscores, dots → spaces
- CamelCase split: PaymentService → payment service
- Lowercase, sort tokens alphabetically
- Example: srv-payment, payment-service, PaymentService, payments → payment
Step 2 — Alias registry:
CREATE TABLE entity_aliases (
raw_name TEXT PRIMARY KEY,
canonical_id UUID NOT NULL REFERENCES services(id),
source TEXT NOT NULL,
confidence FLOAT NOT NULL,
resolved_at TIMESTAMPTZ DEFAULT now()
);
Populated automatically on first resolution, manually editable via API.
Step 3 — Jaccard similarity (fallback):
- J(A, B) = |A ∩ B| / |A ∪ B| on dependency sets
- Similarity > 0.80 → auto-merge (confidence 0.85)
- Similarity 0.50–0.80 → route to Verification Queue
- Similarity < 0.50 → treat as new entity
Step 4 — Confidence routing:
| Resolution Method | Confidence | Action |
|---|---|---|
| Exact alias match | 0.95 | Auto-merge, no review |
| Normalization match | 0.90 | Auto-merge, no review |
| Jaccard > 0.80 | 0.85 | Auto-merge, log for audit |
| Jaccard 0.50–0.80 | 0.65 | Write as Unverified, route to Queue |
| No match | 1.0 (new entity) | Create new canonical node |
Nightly Resolution Pass (Celery beat, 2:15 AM)¶
- Collect nodes added/modified in last 24 hours
- Run normalization + Jaccard against existing canonical nodes
- Auto-merge high-confidence matches
- Route ambiguous candidates to Verification Queue
- Log all decisions to audit table
LLM Upgrade Path (Phase 2+)¶
- Add Step 1.5: bge-m3 embedding similarity (cosine > 0.92 → auto-merge)
- Add Step 2.5: Dense resolve-lora classification for Jaccard 0.50–0.80 range
- Existing normalization and alias registry remain as fast-path shortcuts
6. Confidence Scoring & Verification Queue¶
Source Trust Weights¶
| Source Type | Confidence Weight | Rationale |
|---|---|---|
| Code analysis (AST) | 0.95 | Deterministic, source of truth |
| SSH runtime inspection | 0.90 | Direct observation |
| CI/CD data | 0.90 | Automated, reliable |
| Infrastructure (Terraform, K8s) | 0.85 | Declarative, occasionally stale |
| GitHub Projects v2 | 0.80 | Structured but user-maintained |
| Documentation (Confluence, Pages) | 0.70 | Human-authored, may be outdated |
| CODEOWNERS ownership | 0.70 | Heuristic |
| Jira ticket components | 0.70 | Loosely structured |
| Entity resolution (Jaccard) | 0.65–0.85 | Varies by match strength |
Confidence Thresholds¶
| Range | verification_status | Action |
|---|---|---|
| ≥ 0.90 | Verified |
Auto-accepted |
| 0.60–0.89 | Unverified |
Written, queued for periodic review |
| 0.50–0.59 | Unverified |
Written, immediately routed to Queue |
| < 0.50 | Rejected | Not written, logged to audit |
Verification Queue Wiring¶
Queue item sources: - Entity resolution ambiguous matches - Cross-source conflicts - Ownership disputes - New entities from low-confidence sources
Queue actions:
| Action | Graph Effect |
|---|---|
| Accept | verification_status = Verified, confidence → 0.95 |
| Reject | verification_status = Deprecated, log rejection |
| Edit | Update properties inline, then Accept |
| Escalate | Route to Architect role |
Escalation routing: - Items ≥ 0.60 → service owner (via OWNS edge) - Items < 0.60 → Architect role - Target: 10–15% human review rate
Node/Edge Confidence Schema (Neo4j)¶
Every node and edge carries:
- confidence: float (0.0–1.0)
- source: string (connector name)
- extraction_timestamp: datetime
- last_verification_timestamp: datetime
- verification_status: enum (Verified / Unverified / Disputed / Stale / Deprecated)
Staleness rules (Celery beat): - Documentation: Stale after 30 days - Infrastructure: Stale after 7 days - SSH-verified runtime: Stale after 1 day
7. Graph Writer & Neo4j Schema¶
Graph Writer Module¶
Central module all connectors write through:
class GraphWriter:
async def apply_delta(self, delta: GraphDelta) -> WriteResult:
# 1. Entity resolution on all node names
# 2. Apply confidence scores from source weights
# 3. MERGE nodes and edges in a single Neo4j transaction
# 4. Route low-confidence items to Verification Queue
# 5. Publish substrate.graph.node_written / edge_written to NATS
# 6. Publish substrate.cache.invalidate
# 7. Write audit log entry to PostgreSQL
GraphDelta Schema¶
@dataclass
class NodeDelta:
label: str # "Service", "Function", "InfraResource", etc.
properties: dict
merge_key: str # Property to MERGE on
source: str
confidence: float
@dataclass
class EdgeDelta:
rel_type: str # "DEPENDS_ON", "CALLS", "OWNS", etc.
source_label: str
source_key: dict
target_label: str
target_key: dict
properties: dict
source: str
confidence: float
@dataclass
class GraphDelta:
nodes: list[NodeDelta]
edges: list[EdgeDelta]
source_connector: str
event_id: str # For idempotency
Node Labels¶
| Label | Source | Merge Key |
|---|---|---|
Service |
All connectors | name |
Function |
GitHub/Git AST | signature |
Module |
GitHub/Git AST | name (qualified) |
InfraResource |
Terraform, K8s | resource_id |
Developer |
SCIM, GitHub, Git | github_handle |
Team |
SCIM | name |
DecisionNode |
Confluence, GitHub | id |
FailurePattern |
Confluence | id |
ExceptionNode |
Governance | id |
SprintNode |
GitHub Projects, Jira | sprint_id |
IntentAssertion |
GitHub Projects, Jira | linked_ticket |
Community |
Manual | id |
Policy |
OPA/PostgreSQL | policy_id |
Edge Types¶
| Relationship | From → To | Source |
|---|---|---|
CALLS |
Function → Function | AST |
IMPORTS |
Module → Module | AST |
HOSTS |
InfraResource → Service | Terraform, K8s |
ACTUALLY_CALLS |
Service → Service | SSH, K8s |
DEPENDS_ON |
Service → Service | Package manifests |
OWNS |
Developer/Team → Service | CODEOWNERS, SCIM |
MEMBER_OF |
Developer → Team | SCIM |
CHILD_OF |
Team → Team | SCIM |
WHY |
DecisionNode → Service/Policy | Confluence |
CAUSED_BY |
FailurePattern → Service | Confluence |
PREVENTED_BY |
Service → Policy | Governance |
CONTAINS |
Community → Service | Manual |
New Constraints and Indexes¶
CREATE CONSTRAINT function_sig IF NOT EXISTS FOR (f:Function) REQUIRE f.signature IS UNIQUE;
CREATE CONSTRAINT module_name IF NOT EXISTS FOR (m:Module) REQUIRE m.name IS UNIQUE;
CREATE CONSTRAINT infra_resource_id IF NOT EXISTS FOR (r:InfraResource) REQUIRE r.resource_id IS UNIQUE;
CREATE CONSTRAINT sprint_id IF NOT EXISTS FOR (s:SprintNode) REQUIRE s.sprint_id IS UNIQUE;
CREATE CONSTRAINT intent_ticket IF NOT EXISTS FOR (i:IntentAssertion) REQUIRE i.linked_ticket IS UNIQUE;
CREATE CONSTRAINT exception_id IF NOT EXISTS FOR (e:ExceptionNode) REQUIRE e.id IS UNIQUE;
CREATE INDEX function_file IF NOT EXISTS FOR (f:Function) ON (f.file);
CREATE INDEX infra_type IF NOT EXISTS FOR (r:InfraResource) ON (r.type);
CREATE INDEX infra_provider IF NOT EXISTS FOR (r:InfraResource) ON (r.provider);
CREATE INDEX developer_keycloak IF NOT EXISTS FOR (d:Developer) ON (d.keycloak_id);
CREATE INDEX decision_source IF NOT EXISTS FOR (d:DecisionNode) ON (d.source_url);
CREATE INDEX sprint_dates IF NOT EXISTS FOR (s:SprintNode) ON (s.start_date);
CREATE INDEX node_source IF NOT EXISTS FOR (s:Service) ON (s.source);
CREATE INDEX node_confidence IF NOT EXISTS FOR (s:Service) ON (s.confidence);
CREATE INDEX node_verification IF NOT EXISTS FOR (s:Service) ON (s.verification_status);
Graph Algorithm Jobs (Celery beat, 3:00 AM)¶
PageRank:
CALL gds.pageRank.write('service-dependency-graph', {
writeProperty: 'page_rank', maxIterations: 20, dampingFactor: 0.85
})
Betweenness centrality:
CALL gds.betweenness.write('service-dependency-graph', {
writeProperty: 'betweenness'
})
Weakly connected components:
CALL gds.wcc.write('service-dependency-graph', {
writeProperty: 'component_id'
})
Cycle detection:
- DEPENDS_ON subgraph → NetworkX DiGraph → nx.simple_cycles(G)
- Run synchronously on PR ingestion (before governance evaluation)
- Cycles written as policy violation candidates
8. AST Parser (Rust CLI)¶
Architecture¶
Standalone Rust binary invoked by ingestion workers:
ingestion-worker receives push/PR event
↓
clones/fetches repo to temp directory
↓
substrate-ast-cli parse --repo /tmp/repo --output json
↓
tree-sitter parses each file, stack-graphs resolves cross-file refs
↓
JSON output: nodes + edges
↓
ingestion-worker builds GraphDelta, sends to graph_writer
Technology¶
- tree-sitter — Incremental parsing, grammars for Python, TypeScript, Go, Java, Rust, C#
- stack-graphs — GitHub's cross-file name resolution library
Supported Languages (Phase 1)¶
| Language | Grammar | stack-graphs | Priority |
|---|---|---|---|
| Python | tree-sitter-python |
Yes | High |
| TypeScript/JavaScript | tree-sitter-typescript |
Yes | High |
| Go | tree-sitter-go |
Yes | High |
| Java | tree-sitter-java |
Yes | Medium |
| Rust | tree-sitter-rust |
Yes | Medium |
| C# | tree-sitter-c-sharp |
Partial | Low |
CLI Output Format¶
{
"nodes": [
{
"label": "Function",
"properties": {
"signature": "app.services.payment.process_payment",
"file": "src/services/payment.py",
"line": 42,
"end_line": 87,
"hash": "sha256:abc123...",
"language": "python",
"visibility": "public"
}
},
{
"label": "Module",
"properties": {
"name": "app.services.payment",
"file": "src/services/payment.py",
"language": "python"
}
}
],
"edges": [
{
"rel_type": "CALLS",
"source": "app.services.payment.process_payment",
"target": "app.services.order.create_order",
"properties": { "static": true }
},
{
"rel_type": "IMPORTS",
"source": "app.services.payment",
"target": "app.services.order",
"properties": { "static": true, "dynamic": false }
}
]
}
Service Detection Heuristics¶
| Signal | Confidence | Example |
|---|---|---|
| Dockerfile present | 0.95 | services/payment/Dockerfile → PaymentService |
| docker-compose service entry | 0.95 | services.payment in compose file |
| Package manifest at root | 0.90 | services/payment/package.json |
| K8s deployment YAML | 0.85 | k8s/payment-deployment.yaml |
Directory with main.* |
0.80 | services/payment/main.py |
Build & Distribution¶
- Static binary:
cargo build --release - Added to backend Docker image at
/usr/local/bin/substrate-ast-cli - Separate
Cargo.tomlattools/ast-cli/
Incremental Parsing¶
On PR events, only changed files are parsed:
substrate-ast-cli parse --repo /tmp/repo --files src/payment.py,src/order.py --output json
9. OPA Governance Foundation¶
Deployment¶
OPA server in Docker Compose:
opa:
image: openpolicyagent/opa:latest
command: ["run", "--server", "--bundle", "/policies"]
ports: ["8181:8181"]
volumes: ["./backend/policies:/policies"]
Evaluation Pipeline¶
Graph write event (substrate.graph.node_written)
↓
Governance worker picks up event
↓
Fetches affected subgraph from Neo4j
↓
Serializes subgraph as JSON context
↓
POST /v1/data/substrate/violations → OPA
↓
Returns violations with rule IDs, severity, affected nodes
↓
Violations written to PostgreSQL policy_violations
↓
Publishes substrate.governance.violation_raised to NATS
Starter Policy Packs (3 of 9)¶
1. Domain Boundary (GOV-BOUNDARY)
package substrate.boundary
violation[msg] {
edge := input.edges[_]
edge.type == "ACTUALLY_CALLS"
source := input.nodes[edge.source]
target := input.nodes[edge.target]
source.domain != target.domain
not edge.via_gateway
msg := sprintf("%s (%s) calls %s (%s) directly without gateway",
[source.name, source.domain, target.name, target.domain])
}
2. Circular Dependency (GOV-CIRCULAR)
package substrate.circular
violation[msg] {
cycle := input.cycles[_]
msg := sprintf("Circular dependency detected: %s", [concat(" → ", cycle)])
}
Cycles pre-computed by NetworkX and passed in OPA input context.
3. Ownership Completeness (GOV-OWNERSHIP)
package substrate.ownership
violation[msg] {
service := input.services[_]
service.verification_status != "Deprecated"
count(service.owners) == 0
msg := sprintf("%s has no active owner", [service.name])
}
Policy Packs Deferred to Phase 2¶
| Pack | ID | Reason |
|---|---|---|
| SOLID Single Responsibility | GOV-SRP |
Needs efferent coupling metrics |
| SOLID Open/Closed | GOV-OCP |
Needs extension pattern detection |
| SOLID Dependency Inversion | GOV-DIP |
Needs abstract vs concrete classification |
| TDD Coverage | GOV-TDD |
Needs CI test coverage data |
| API-First | GOV-API |
Needs OpenAPI spec detection |
| License Compatibility | GOV-LICENSE |
Needs dependency license resolution |
No GitHub Checks API in Phase 1¶
Violations are recorded and visible in the UI. GitHub Checks API integration (PR blocking, violation comments) is Phase 2.
10. SCIM, RBAC & Identity¶
Custom Keycloak Dockerfile¶
FROM quay.io/keycloak/keycloak:latest
ADD --chown=keycloak:keycloak \
https://github.com/Captain-P-Goldfish/scim-for-keycloak/releases/download/latest/scim-for-keycloak.jar \
/opt/keycloak/providers/
RUN /opt/keycloak/bin/kc.sh build
SCIM Endpoints¶
| Method | Path | Purpose |
|---|---|---|
POST |
/scim/v2/Users |
Create Developer node |
PATCH |
/scim/v2/Users/{id} |
Update/deactivate Developer |
POST |
/scim/v2/Groups |
Create Team node |
PATCH |
/scim/v2/Groups/{id} |
Add/remove MEMBER_OF edges |
DELETE |
/scim/v2/Groups/{id} |
Mark Team as Deprecated |
Onboarding (POST /Users):
1. Create Developer node (active: true)
2. Create MEMBER_OF edges per group
3. Create Team if missing (verification_status = Unverified)
4. Publish substrate.identity.user_onboarded
Offboarding (PATCH /Users active=false):
1. Set Developer.active = false
2. Query orphaned services:
MATCH (d:Developer {keycloak_id: $kid, active: false})-[:OWNS]->(s:Service)
WHERE NOT EXISTS {
MATCH (other:Developer {active: true})-[:OWNS]->(s)
}
AND NOT EXISTS {
MATCH (t:Team)-[:OWNS]->(s)
WHERE EXISTS { MATCH (m:Developer {active: true})-[:MEMBER_OF]->(t) }
}
RETURN s.name AS orphaned_service
substrate.identity.user_offboarded
RBAC Roles¶
| Role | Capabilities |
|---|---|
admin |
Full access, user management, system configuration |
architect |
Full graph, policy authoring, simulation, exception approval |
developer |
Own service graph, PR details, intent mismatch alerts |
viewer |
Read-only graphs, dashboards, reports |
service-account |
API-only, webhook delivery, no UI |
Role Enforcement¶
def require_role(*roles: str):
def checker(user: UserInfo = Depends(get_current_user)):
if not any(r in user.realm_roles for r in roles):
raise HTTPException(403, "Insufficient role")
return user
return Depends(checker)
Ownership-Scoped Filtering¶
Developer role queries filtered to owned services:
async def get_owned_services(user: UserInfo, neo4j_session) -> list[str]:
result = await neo4j_session.run("""
MATCH (d:Developer {keycloak_id: $kid})
OPTIONAL MATCH (d)-[:OWNS]->(s1:Service)
OPTIONAL MATCH (d)-[:MEMBER_OF]->(t:Team)-[:OWNS]->(s2:Service)
RETURN collect(DISTINCT s1.name) + collect(DISTINCT s2.name) AS services
""", kid=user.sub)
record = await result.single()
return record["services"] if record else []
Route-Level Requirements¶
| Endpoint | Required Role |
|---|---|
GET /dashboard, communities, graph |
Any authenticated |
GET /policies |
Any authenticated |
POST /policies, PUT /policies/{id} |
architect, admin |
GET /queue |
developer, architect, admin |
PATCH /queue/{id} (Accept/Reject) |
architect, admin |
PATCH /queue/{id} (Edit) |
developer (own), architect, admin |
POST /memory |
developer, architect, admin |
POST /simulation/run |
architect, admin |
| SCIM endpoints | service-account only |
| Webhook receivers | HMAC-verified, no JWT |
Group-to-Role Mapping¶
CREATE TABLE role_mappings (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
keycloak_group TEXT NOT NULL UNIQUE,
substrate_role TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
11. UI Changes¶
New Dependencies¶
| Package | Purpose |
|---|---|
elkjs |
Sugiyama layout algorithm |
web-worker |
Run layout off main thread |
Graph Rendering¶
Layout engines (user-selectable): - Sugiyama (default): Dependencies flow top-to-bottom via ELK.js. Computed in web worker. - Force-directed: Nodes cluster by relationship density. React Flow built-in.
Semantic Zoom¶
Detected from viewport scale, not explicit user action. 300ms ease-in-out transitions.
| Scale | Level | Rendered |
|---|---|---|
| < 0.4 | Far (Domain) | Domain super-nodes. Aggregate health badge, violation count, tension score. |
| 0.4–0.8 | Mid (Service) | Service nodes with tension ring, violation badge, owner label. PageRank-weighted sizing. |
| > 0.8 | Close (Component) | Function/Module nodes inside service boundaries. File path, line number, call counts. |
Graph Toolbar¶
[Layout: Sugiyama ▾] [Filter ▾] [Minimap ☐] [Fit View]
Filter sidebar: - Domain / community - Violation type (boundary, circular, ownership) - Confidence threshold (slider, default 0.60) - Owner / team - Verification status - Time window (7d / 14d / 30d)
Verification Queue UI Updates¶
- Items show source connector, confidence badge (red < 0.60, yellow 0.60–0.79, green ≥ 0.80)
- Accept/Reject/Edit/Escalate call real API endpoints
- Items grouped by
resolution_type
Dashboard Updates¶
All metrics computed from real data:
- Violations from policy_violations WHERE resolved = false
- Tension from graph algorithm output
- Memory entries from real ingestion
- Memory gaps from services with no WHY edges
RBAC in UI¶
- Role from JWT claims
- Nav items conditionally rendered
- Policy editor: read-only for Developer/Viewer
- Simulation Run: hidden for Developer/Viewer
- Queue actions: role-appropriate visibility
Pages NOT Changed¶
| Page | Reason |
|---|---|
| Search | Canned results until GraphRAG (Phase 3) |
| Simulation rendering | Seeded results until simulation engine (Phase 4) |
12. Celery Beat Schedule¶
| Job | Schedule | Description |
|---|---|---|
| Nightly connector poll | 2:00 AM daily | Full poll on all polling connectors |
| Entity resolution pass | 2:15 AM daily | Normalization + Jaccard on last 24h nodes |
| PageRank update | 3:00 AM daily | GDS PageRank on service graph |
| Betweenness centrality | 3:00 AM daily | GDS betweenness on service graph |
| Connected components | 3:00 AM daily | GDS WCC for orphan detection |
| Staleness check | 4:00 AM daily | Mark nodes Stale per threshold rules |
| K8s API poll | Every 15 min | K8s Watch API resource changes |
| SSH Runtime inspection | Every 15 min | Vault cert → SSH → inspection → diff |
| GitHub Pages poll | Every 6 hours | Pages build timestamps vs code commits |
| GitHub Projects poll | Every 1 hour | GraphQL project item changes |
Distributed Locking¶
async def acquire_lock(redis, event_id: str, ttl: int = 300) -> bool:
return await redis.set(f"lock:ingestion:{event_id}", "1", nx=True, ex=ttl)
13. Infrastructure & Docker Compose¶
New Services¶
| Service | Image | Purpose |
|---|---|---|
opa |
openpolicyagent/opa:latest |
Policy evaluation |
keycloak |
Custom Dockerfile | Identity + SCIM 2.0 |
celery-worker |
Backend image | Ingestion processing |
celery-beat |
Backend image | Job scheduling |
celery-governance |
Backend image | Policy evaluation workers |
vault |
hashicorp/vault:latest |
SSH CA |
Startup Order¶
1. PostgreSQL, Neo4j, Redis, NATS (databases)
2. Flyway migrations (schema)
3. Keycloak (custom, SCIM plugin) (identity)
4. Vault (SSH CA)
5. OPA (policy engine)
6. Backend (FastAPI) (API layer)
7. Celery beat (scheduler)
8. Celery workers (ingestion, governance) (async processing)
9. UI (Nginx) (frontend)
Backend Dockerfile Updates¶
Multi-stage build:
1. ast-builder stage — Compiles Rust CLI from tools/ast-cli/
2. python stage — Backend app with CLI binary, git, SSH client
New Environment Variables¶
NATS_URL=nats://nats:4222
OPA_URL=http://opa:8181
CELERY_BROKER_URL=redis://redis:6379/1
CELERY_RESULT_BACKEND=redis://redis:6379/2
VAULT_ADDR=http://vault:8200
VAULT_TOKEN=substrate-dev-token
VAULT_SSH_ROLE=substrate-ssh
GITHUB_APP_ID=
GITHUB_APP_PRIVATE_KEY_PATH=
GITHUB_WEBHOOK_SECRET=
TERRAFORM_STATE_PATHS=
KUBECONFIG_PATH=
SSH_TARGET_HOSTS=
CONFLUENCE_URL=
CONFLUENCE_API_TOKEN=
CONFLUENCE_WEBHOOK_SECRET=
JIRA_URL=
JIRA_API_TOKEN=
JIRA_WEBHOOK_SECRET=
Project Directory Structure (Phase 1 additions)¶
substrate/
backend/
app/
modules/
ingestion/ # Webhook receivers, NATS publishers
connectors/ # 8 connector implementations
graph_writer/ # Entity resolution, confidence, MERGE logic
governance/ # OPA client, evaluation pipeline
identity/ # SCIM endpoint, Developer/Team mutations
rbac/ # Role enforcement, ownership filtering
scheduler/ # Celery app, beat config, tasks
ast_parser/ # Rust CLI wrapper
core/
nats.py # NATS client helpers
policies/ # OPA Rego packs
boundary/policy.rego
circular/policy.rego
ownership/policy.rego
db/
postgres/V4__phase1_schema.sql
neo4j/V2__phase1_constraints.cypher
tools/
ast-cli/ # Rust AST parser
Cargo.toml
src/main.rs
keycloak/
Dockerfile # Custom Keycloak with SCIM
ui/
src/
components/graph/
ElkLayout.ts
SemanticZoom.tsx
LayoutSelector.tsx
FilterSidebar.tsx
14. Database Migrations¶
PostgreSQL: V4__phase1_schema.sql¶
CREATE TABLE entity_aliases (
raw_name TEXT PRIMARY KEY,
canonical_id UUID NOT NULL,
source TEXT NOT NULL,
confidence FLOAT NOT NULL,
resolved_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE ingestion_events (
event_id TEXT PRIMARY KEY,
source_connector TEXT NOT NULL,
event_type TEXT NOT NULL,
payload_hash TEXT NOT NULL,
status TEXT DEFAULT 'pending',
nodes_written INT DEFAULT 0,
edges_written INT DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT now(),
completed_at TIMESTAMPTZ
);
CREATE TABLE audit_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
timestamp TIMESTAMPTZ DEFAULT now(),
actor TEXT NOT NULL,
action TEXT NOT NULL,
resource_type TEXT,
resource_id TEXT,
input_hash TEXT,
output_hash TEXT,
confidence FLOAT,
detail JSONB
);
CREATE TABLE webhook_configs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
connector_type TEXT NOT NULL,
endpoint_path TEXT NOT NULL,
hmac_secret TEXT NOT NULL,
active BOOLEAN DEFAULT true,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE role_mappings (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
keycloak_group TEXT NOT NULL UNIQUE,
substrate_role TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE connector_state (
connector_type TEXT PRIMARY KEY,
last_poll_at TIMESTAMPTZ,
last_success_at TIMESTAMPTZ,
cursor_state JSONB,
error_count INT DEFAULT 0,
last_error TEXT,
active BOOLEAN DEFAULT true
);
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS source_connector TEXT;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS candidate_data JSONB;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS conflicting_data JSONB;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS neo4j_node_id TEXT;
ALTER TABLE queue_items ADD COLUMN IF NOT EXISTS resolution_type TEXT;
Neo4j: V2__phase1_constraints.cypher¶
See Section 7 for full constraint and index definitions.
Seed Data Transition¶
- Seed data nodes have
source: "seed",confidence: 0.80 - Connector data arrives with higher confidence and overwrites
- After all connectors live, cleanup job removes unmatched seed-only nodes
- No breaking migration — Phase 0 UI continues working throughout
pyproject.toml Additions¶
"celery[redis]>=5.3"
"nats-py>=2.7"
"networkx>=3.0"
"aiohttp>=3.9"
"paramiko>=3.4"
"hvac>=2.0"
"kubernetes-asyncio>=30"
15. Testing Strategy¶
Test Layers¶
| Layer | What | How | Target |
|---|---|---|---|
| Route tests | Status codes, shapes, auth, RBAC | httpx, mocked services | ~60 |
| Unit — connectors | Parsing, events, GraphDelta | Mock APIs, fixture payloads | ~40 |
| Unit — entity resolution | Normalization, alias, Jaccard | Pure functions | ~15 |
| Unit — graph writer | MERGE patterns, confidence, validation | Mock Neo4j | ~10 |
| Unit — governance | OPA input, violation parsing | Mock OPA HTTP | ~10 |
| Unit — RBAC | Role checks, ownership filtering | Mock user, mock Neo4j | ~10 |
| Unit — SCIM | Event parsing, graph mutations | Mock Neo4j | ~10 |
| Unit — AST CLI | JSON parsing, service detection | Fixture repos | ~15 |
Total: ~170 tests
Connector Fixtures¶
tests/fixtures/{connector}/ — sample payloads per connector
tests/fixtures/ast/ — small fixture repos for CLI testing
Not Testing in Phase 1¶
- Integration tests with testcontainers
- End-to-end webhook → Neo4j flows
- OPA with real server
- Load/performance testing
16. Explicitly Out of Scope¶
| Item | Target Phase |
|---|---|
| LLM entity resolution (Dense resolve-lora) | Phase 2+ |
| bge-m3 embeddings / pgvector semantic search | Phase 2+ |
| GitHub Checks API (PR blocking) | Phase 2 |
| Violation PR comments | Phase 2 |
| explain-lora violation explanations | Phase 2 |
| Leiden community detection | Phase 2 |
| GraphRAG / NL search | Phase 3 |
| Simulation engine (Neo4j sandboxes) | Phase 4 |
| Agent orchestration (Fix PR) | Phase 4 |
| Slack connector | v1.1 |
| Datadog connector | v1.1 |
| Self-hosted GitHub Enterprise | Later |
| Bitbucket / GitLab connectors | Not in blueprint |
| ServiceNow connector | Not in MVP |
| Integration tests with testcontainers | Phase 2 |
| Load/performance testing | Phase 2+ |
| WebGL 3D graph (Sigma.js) | v2 |
| Multi-team graph isolation | v2 |