Unified Multimodal Knowledge Base (UMKB) — Design Reference¶
Overview¶
The UMKB is the central artifact of the Substrate platform. Every feature — governance checks, NL queries, simulation, proactive alerts, sprint retro insights — derives its value from the fidelity and completeness of what is stored here. The UMKB is not a reporting database; it is a living, continuously updated representation of an engineering organization's architecture, intent, and institutional memory.
The UMKB is built on four purpose-selected data stores, each serving a distinct and non-overlapping role. Nothing is deployed out of convenience or familiarity — every technology choice is justified by a specific capability requirement.
Core Problems Solved¶
Substrate's UMKB exists to solve three failures that repeatedly break engineering organizations:
- Structural drift: The runtime architecture diverges from intended design and policy boundaries.
- Memory loss: Decision rationale, incident lessons, and tacit implementation context decay or disappear.
- Active governance gap: Policies exist as static documents but are not continuously enforced, validated, and used to proactively maintain docs/files.
The third problem is treated as a first-class capability, not a reporting add-on. The UMKB continuously labels artifacts, checks policy compliance, and routes remediation work to responsible users.
Database Selection¶
| Database | Role | Why This Tool |
|---|---|---|
| Neo4j 5.x | Observed, Intended, and Memory Graphs | Native graph storage; Cypher query language; GDS library provides Leiden community detection, PageRank, betweenness centrality, and weakly connected components; named graph sandboxing enables simulation via CREATE DATABASE IF NOT EXISTS without touching production data |
| PostgreSQL 16 | Policy store; node embeddings; drift scores; audit log; timestamped graph snapshots | pgvector extension with HNSW index enables sub-millisecond ANN search over node embeddings; pg_partman handles time-series partitioning of drift scores and audit events; append-only audit table enforces immutability at the schema level |
| Redis 7 | Job queue broker; vLLM prefix cache; hot subgraph cache; distributed locks | Sub-millisecond read latency; TTL-based cache eviction aligned with graph update events; SET NX EX for distributed locking across Celery workers; AOF persistence ensures job queue survives process restart |
| NATS JetStream | Event bus; at-least-once delivery; subject-based routing | Stream replay enables event-sourced graph rebuilds after Ingestion Service crash; substrate.> subject hierarchy allows fine-grained subscription by service type; consumer group load balancing across multiple Ingestion workers |
Why Not a Single Database¶
A single polystore is not a design flaw — it is a deliberate separation of concerns. Neo4j is irreplaceable for multi-hop graph traversals that would require recursive CTEs in SQL. PostgreSQL is irreplaceable for ACID-compliant transactional writes, pgvector HNSW index performance at scale, and the immutable append-only audit log. Redis is irreplaceable for sub-millisecond hot-path reads and Celery job coordination. NATS is irreplaceable for reliable event delivery with replay. Each technology does one thing exceptionally well. The integration overhead is bounded and predictable.
UML Component View¶
classDiagram
direction LR
class IngestionService
class GovernanceService
class ProactiveMaintenanceService
class ReasoningService
class PolicyEngine
class Neo4j
class PostgreSQL
class Redis
class NATS
IngestionService --> NATS : publishes ingestion events
IngestionService --> Neo4j : writes entities and edges
IngestionService --> PostgreSQL : writes relational metadata and pgvector embeddings
IngestionService --> Redis : dedup locks and cache keys
GovernanceService --> PolicyEngine : evaluates policy packs
GovernanceService --> Neo4j : reads graph deltas
GovernanceService --> PostgreSQL : writes violations, exceptions, audit
ProactiveMaintenanceService --> Neo4j : reads gaps and stale links
ProactiveMaintenanceService --> PostgreSQL : writes lifecycle labels and queue state
ProactiveMaintenanceService --> NATS : publishes proactive tasks
ReasoningService --> Neo4j : graph retrieval and Leiden communities
ReasoningService --> PostgreSQL : semantic search via pgvector
ReasoningService --> NATS : emits result events
Neo4j Graph Schema¶
Node Labels¶
Service¶
The primary unit of architectural concern. Every discovered or declared service in the organization maps to a Service node.
| Property | Type | Description |
|---|---|---|
| domain | string | Bounded context this service belongs to (e.g., "payments", "auth") |
| api_type | string | REST, gRPC, GraphQL, event-driven, internal |
| test_coverage | float | Current test coverage percentage (0.0–1.0) |
| page_rank | float | GDS PageRank score; updated nightly |
| betweenness | float | GDS betweenness centrality score; updated nightly |
| tension_score | float | Current structural tension against Intended Graph |
| confidence | float | Confidence that this node represents a canonical service |
| source | string | Originating data source (github, terraform, kubernetes, ssh) |
| verification_status | enum | Verified / Unverified / Disputed / Stale / Deprecated |
| last_verification_timestamp | datetime | When this node was last confirmed accurate |
| extraction_timestamp | datetime | When this node was first created |
Function / Module¶
Represents a discrete code unit: a function, class, or module within a service. Populated by AST parsing via the Rust CLI during PR ingestion.
| Property | Type | Description |
|---|---|---|
| signature | string | Fully qualified function or module name |
| file | string | Repository-relative file path |
| line | integer | Line number within file |
| hash | string | SHA-256 of the function body; change detection |
| confidence | float | Confidence in extracted metadata |
| source | string | Originating repository + commit |
InfraResource¶
Represents a declared or observed infrastructure component: VM, container, load balancer, database, message queue.
| Property | Type | Description |
|---|---|---|
| type | string | ec2, rds, k8s_pod, k8s_service, load_balancer, s3, etc. |
| provider | string | aws, gcp, azure, bare_metal, on_prem |
| state | string | running, stopped, pending, drifted, undeclared |
| region | string | Deployment region or availability zone |
| declared_ports | list[integer] | Ports declared in Terraform / K8s spec |
| observed_ports | list[integer] | Ports discovered via SSH inspection |
| confidence | float | Confidence in current state accuracy |
DecisionNode¶
Represents a captured architectural decision — ADR, RFC, design review outcome, or approved exception. The institutional memory backbone.
| Property | Type | Description |
|---|---|---|
| rationale | string | Full decision rationale text |
| source_url | string | Canonical link (GitHub ADR file, Confluence page, Jira ticket) |
| confidence | float | Confidence in extraction accuracy |
| verified_at | datetime | When a human confirmed this decision is still active |
| reviewed_at | datetime | Last review timestamp (manual or automated) |
| decision_date | date | When the original decision was made |
| status | string | active, superseded, deprecated |
FailurePattern¶
Captures post-mortem lessons and incident root causes. Links causally to the services affected and the policies that were (or should have been) in place.
| Property | Type | Description |
|---|---|---|
| incident_date | date | Date of the incident |
| root_cause | string | Summary of the root cause |
| affected_domains | list[string] | Bounded contexts impacted |
| source_url | string | Link to post-mortem document |
| severity | string | P1 / P2 / P3 |
| lessons | string | Extracted lessons verbatim |
| confidence | float | Extraction confidence |
MemoryNode¶
Captures informal design rationale that does not rise to the level of a formal ADR: PR review comments, Slack design discussions, inline code documentation that explains a non-obvious decision.
| Property | Type | Description |
|---|---|---|
| content | string | Extracted decision or rationale text |
| source_type | string | pr_comment, slack_message, code_comment, readme |
| confidence | float | Confidence in relevance and accuracy |
| extraction_model | string | Model used for extraction (dense-extract, moe-scout) |
| source_url | string | Link to originating content |
ExceptionNode¶
Represents an approved policy exception: a deliberate, time-bounded acknowledgment that a service is violating a policy for legitimate reasons.
| Property | Type | Description |
|---|---|---|
| rationale | string | Why this exception was approved |
| approved_by | string | Keycloak user ID of approving Architect |
| expires_at | datetime | Mandatory expiry; Substrate re-raises violation after this timestamp |
| policy_id | string | OPA policy pack and rule identifier |
| created_at | datetime | Timestamp of approval |
Developer¶
Represents a human member of the engineering organization. Linked to Keycloak and SCIM identity.
| Property | Type | Description |
|---|---|---|
| github_handle | string | GitHub username |
| scim_id | string | SCIM external ID from IdP |
| keycloak_id | string | Keycloak user UUID |
| active | boolean | Set false on SCIM deactivation; triggers key-person risk scan |
| name | string | Display name |
| string | Primary email |
Team¶
Represents an engineering team. Supports hierarchical ownership via CHILD_OF edges.
| Property | Type | Description |
|---|---|---|
| name | string | Canonical team name |
| child_of | string | Parent team name (denormalized for fast lookup) |
| owns_services | list[string] | Denormalized list for quick key-person risk queries |
SprintNode¶
Represents a sprint or iteration in GitHub Projects v2 or Jira. Linked to structural debt reports generated at sprint close.
| Property | Type | Description |
|---|---|---|
| sprint_id | string | Unique identifier from source system |
| health | string | healthy / at_risk / critical |
| start_date | date | Sprint start |
| end_date | date | Sprint end |
| velocity | integer | Story points completed |
| violation_delta | integer | Change in violation count over sprint |
| debt_score | float | Aggregate structural debt score at sprint close |
IntentAssertion¶
Represents a captured engineering intent: what a developer declared they were building, extracted from a Jira ticket, GitHub Projects item, or PR description. Used for intent mismatch detection against actual code changes.
| Property | Type | Description |
|---|---|---|
| linked_ticket | string | Jira or GitHub Projects item ID |
| intent_embedding | vector(1024) | bge-m3 embedding of intent text; stored in PostgreSQL pgvector table |
| confidence | float | Confidence that this accurately captures intent |
| source_text | string | Original intent description |
DocumentAsset¶
Represents any user-contributed artifact: ADRs, design docs, epics, user stories, source files, PR comments, runbooks, sprint notes, and policy documents.
| Property | Type | Description |
|---|---|---|
| document_id | string | Canonical document identifier across source systems |
| source_system | string | github, jira, confluence, slack, git, runtime, etc. |
| source_path | string | Path, URL, or tool-native locator |
| lifecycle_state | enum | latest, active, stale, outdated, incomplete, archived, oldest_snapshot |
| llm_labeled_at | datetime | Last LLM-based lifecycle labeling timestamp |
| owner_ref | string | Developer, Team, or role accountable for updates |
| confidence | float | Confidence in extracted metadata and state label |
DocumentChunk¶
Represents a chunked segment of a DocumentAsset, optimized for vector search and graph linking.
| Property | Type | Description |
|---|---|---|
| chunk_id | string | Stable ID for deduplication and updates |
| chunk_profile | string | code_ast, adr_section, ticket_thread, runtime_window, markdown_heading |
| token_count | integer | Tokenized size of chunk |
| overlap_tokens | integer | Overlap applied to preserve context continuity |
| embedding_ref | string | Foreign key to PostgreSQL pgvector embedding row |
| confidence | float | Confidence in chunk boundaries and extracted entities |
KnowledgeCommunity¶
Represents an automatically generated Leiden community grouping related architecture and delivery artifacts.
| Property | Type | Description |
|---|---|---|
| community_id | string | Stable Leiden partition identifier |
| summary | string | LLM-generated synopsis of the community |
| dominant_domain | string | Dominant bounded context represented |
| refresh_timestamp | datetime | Last recompute time |
| member_count | integer | Count of nodes/chunks in the community |
Key Edge Types¶
CALLS (Function → Function)¶
Represents a runtime or static call relationship between two code functions. Populated by AST analysis during PR ingestion.
| Property | Type | Description |
|---|---|---|
| count | integer | Call frequency from trace data or static analysis |
| last_seen | datetime | Last observed call timestamp |
| confidence | float | Confidence based on analysis method (static vs dynamic) |
DEPENDS_ON (Module → Module)¶
Package or module-level dependency relationship. Populated from package manifests (package.json, requirements.txt, go.mod, etc.).
| Property | Type | Description |
|---|---|---|
| version | string | Declared version constraint |
| confidence | float | Confidence in dependency accuracy |
| last_verified | datetime | Last verification timestamp |
IMPORTS (Module → Module)¶
Static import relationship. Distinguishes static from dynamic imports for cycle detection and blast radius analysis.
| Property | Type | Description |
|---|---|---|
| static | boolean | True for compile-time imports |
| dynamic | boolean | True for runtime/conditional imports |
HOSTS (InfraResource → Service)¶
Declares that an infrastructure resource runs a given service.
| Property | Type | Description |
|---|---|---|
| port | integer | Port on which service is hosted |
| protocol | string | TCP, UDP, HTTP, HTTPS, gRPC |
| declared | boolean | True if declared in Terraform/K8s; false if discovered via SSH inspection |
ACTUALLY_CALLS (Service → Service)¶
Observed runtime call relationship between services, as distinct from the declared or intended topology. The gap between ACTUALLY_CALLS edges and declared DEPENDS_ON edges is a primary source of structural tension.
| Property | Type | Description |
|---|---|---|
| direct | boolean | Direct HTTP/gRPC call without intermediary |
| via_gateway | boolean | Routed through API gateway |
| verified_at | datetime | Last SSH or trace verification timestamp |
WHY (DecisionNode → Service/Policy)¶
Links a captured decision to the architectural artifact it explains. The primary edge for answering "why was this decision made?" queries and surfacing ADR references in violation comments.
| Property | Type | Description |
|---|---|---|
| context | string | Relationship context (why this decision applies to this service) |
| rationale_excerpt | string | Verbatim excerpt linking decision to artifact |
OWNS (Developer/Team → Service)¶
Ownership relationship. Populated from CODEOWNERS files, SCIM team membership, and confidence-weighted heuristics. Critical for key-person risk detection on Developer deactivation.
| Property | Type | Description |
|---|---|---|
| since | datetime | Ownership start date |
| primary | boolean | True for primary owner; false for secondary/on-call |
| confidence | float | Confidence in ownership accuracy |
| last_verified | datetime | Last verification timestamp |
CAUSES (FailurePattern → Service)¶
Links a documented failure pattern to the service it affected. Used in blast radius analysis and violation explanation generation.
| Property | Type | Description |
|---|---|---|
| severity | string | P1 / P2 / P3 |
| date | date | Date of the causation event |
PREVENTED_BY (Service → Policy)¶
Declares that a specific policy is in place to prevent a class of failures. Populated on ADR ingestion when a policy is linked to a post-mortem lesson.
| Property | Type | Description |
|---|---|---|
| via | string | Policy pack ID and rule name |
| effective_since | datetime | When the policy became effective |
MEMBER_OF (Developer → Team)¶
Team membership edge. Created atomically on SCIM POST /Groups event.
| Property | Type | Description |
|---|---|---|
| since | datetime | Membership start date |
| role | string | member, lead, on_call |
CHILD_OF (Team → Team)¶
Team hierarchy edge. Enables transitive ownership queries across the full organizational tree without denormalization.
| Property | Type | Description |
|---|---|---|
| hierarchy_depth | integer | Depth from root team; used for Cypher query optimization |
CHUNK_OF (DocumentChunk → DocumentAsset)¶
Connects each chunk to its source artifact for traceability and deterministic rehydration.
| Property | Type | Description |
|---|---|---|
| order_index | integer | Stable chunk ordering within a source document |
| chunking_version | string | Versioned chunking strategy used for the split |
IN_COMMUNITY (Service/DecisionNode/DocumentChunk → KnowledgeCommunity)¶
Connects heterogeneous entities to automatically discovered Leiden communities.
| Property | Type | Description |
|---|---|---|
| membership_score | float | Strength of membership in the community |
| assigned_by | string | gds-leiden or llm-community-curator |
| assigned_at | datetime | Assignment timestamp |
NEEDS_ATTENTION (DocumentAsset → Developer/Team)¶
Represents governance-driven delegation for stale, outdated, or incomplete artifacts.
| Property | Type | Description |
|---|---|---|
| reason | string | stale, policy_gap, outdated, incomplete, conflict |
| requested_at | datetime | When verification/update was requested |
| sla_due_at | datetime | Resolution target timestamp |
| status | string | open, acknowledged, resolved, escalated |
The Two Graph Layers¶
The UMKB maintains two structurally distinct representations of the system simultaneously. The entire value proposition of Substrate depends on the separation and continuous comparison of these two layers.
The Intended Graph¶
The Intended Graph is what the organization has declared the system should look like. It contains:
- Policies: OPA/Rego rules expressed as graph constraints (e.g., "no service in the payments domain may call auth directly without going through the gateway")
- ADRs: Architectural Decision Records ingested as DecisionNode nodes with WHY edges to the services and policies they govern
- Golden Paths: Declared preferred patterns for service-to-service communication, deployment topology, and data access
- Desired topology: Declared DEPENDS_ON and HOSTS relationships representing the intended architecture
- Structural constraints: Explicit rules about which service-to-service calls are permitted, which dependencies are approved, and what the correct ownership model is
- Approved exceptions: ExceptionNode records that explicitly acknowledge known violations with bounded expiry dates
Updated by: Architecture team via policy authoring UI (CodeMirror 6 Rego editor); ADR ingestion pipeline on git push webhook; Governance Service on exception approval; manual Architect-role graph mutations via FastAPI gateway.
The Observed Graph¶
The Observed Graph is what the system actually looks like right now. It contains:
- Runtime services: Service nodes discovered via SSH inspection, Kubernetes API, and service registry
- Live dependencies: ACTUALLY_CALLS edges derived from SSH-captured network state and service mesh traces
- Deployed infrastructure: InfraResource nodes from Terraform state files and Kubernetes resource queries
- PR deltas: Function and Module node changes ingested on every PR open event via the Rust AST CLI
- SSH-verified host state: Port mappings, running process list, and declared vs observed state diffs
- Project signals: SprintNode and IntentAssertion nodes from GitHub Projects v2 and Jira
Updated by: Ingestion Service (GitHub, Terraform, Kubernetes, Jira connectors); SSH Runtime Connector (15-minute scheduled + on-demand); Celery workers processing NATS JetStream substrate.ingestion.> events.
The Drift¶
The Drift is the measurable, continuously computed delta between the Intended Graph and the Observed Graph. It is not a binary "compliant/non-compliant" flag — it is a spectrum of tension scores, violation counts, and confidence-weighted discrepancies.
Every governance check, structural tension score, proactive alert, and simulation result is derived from comparing these two layers. The Drift is computed in two forms:
Structural Tension Score: Computed for every observed edge immediately after ingestion.
tension_score = |intended_weight - observed_weight| / intended_weight
Stored as a float on the edge and aggregated per domain for dashboard display. A service may have structural tension without a policy violation (it diverged from the preferred pattern but not from a hard rule).
Policy Violation: A boolean result from OPA/Rego evaluation against the Observed Graph serialized as JSON. Violations are the subset of Drift that exceeds an explicit policy rule. A service can violate a policy without high structural tension (a new forbidden edge was added that did not substantially change the overall graph structure).
Institutional Memory Graph Layer¶
The Memory Graph is a third layer overlaid on both the Intended and Observed Graphs. It captures the organizational knowledge that explains why things are the way they are — decisions made, mistakes learned from, rationale embedded in PR reviews, and exceptions deliberately approved.
Without the Memory Graph, Substrate can detect drift but cannot explain it. With the Memory Graph, every violation comment can cite the ADR that established the rule, every blast radius analysis can cite the post-mortem that established why the dependency is dangerous, and every policy exception is traceable to the Architect who approved it and when it expires.
| Memory Type | Source | Node Type | Linked To | Captured By |
|---|---|---|---|---|
| ADR | GitHub repo / Confluence | DecisionNode | Service nodes, Policy nodes | Ingestion Service on git push webhook (ADR commit) |
| Post-mortem lesson | Confluence / GitHub Pages | FailurePattern | Affected service nodes | Ingestion Service on Confluence webhook or GitHub Pages build |
| Design rationale | PR review comments | MemoryNode | Function nodes, Module nodes | Ingestion Service on PR merge event |
| Policy exception | Governance Service approval flow | ExceptionNode | Policy node, violating Service node | Governance Service on Architect approval action |
| Sprint retro insight | Jira sprint close / GitHub Projects iteration close | SprintInsight | Domain node, SprintNode | Proactive Maintenance Service on sprint close webhook |
| Informal decision | Slack keyword trigger (configurable patterns) | IntentAssertion | Service node | Ingestion Service channel watch (v1.1) |
| Tribal knowledge gap | Proactive scanner — services with no WHY edges | MemoryGap flag | Service node | Proactive Maintenance Service nightly scan (PRO-UC-06) |
Entity Resolution¶
Entity resolution is the most technically challenging component of the ingestion pipeline. The same service will appear under different names across different source systems:
- GitHub repository name:
payment-service - Terraform resource label:
srv-payment - Kubernetes service name:
payments - Jira component name:
Payment Service - Slack channel reference:
#payment - CODEOWNERS entry:
service-payments
Without canonicalization, the UMKB becomes a fragmented collection of duplicate nodes that cannot be reliably queried or compared. Entity resolution is what makes the UMKB a single coherent graph rather than a multi-source data dump.
Resolution Strategy¶
Substrate uses the Dense resolve-lora LoRA adapter fine-tuned against real-world service naming conventions to perform entity resolution. The adapter is applied to all inbound entity names during the nightly resolution pass (ING-UC-09, Celery beat at 2am) and at ingestion time for high-confidence cases.
The resolution pipeline:
- Tokenization normalization: Strip common prefixes and suffixes (
srv-,-service,-svc), normalize separators (hyphens, underscores, camelCase to space-separated), lowercase all tokens - Embedding similarity: bge-m3 embedding of normalized name against all existing canonical node names stored in PostgreSQL pgvector; cosine similarity above 0.92 triggers auto-merge
- LLM resolution: For similarity range 0.70–0.92, Dense resolve-lora classifies same-entity vs different-entity with confidence score
- Human escalation: Below 0.70 similarity, or when resolver confidence falls below 0.70, the entity candidate is queued in the Verification Queue for human review
- Canonical ID assignment: On resolution, all source-system identifiers are stored as alias properties on the canonical node; subsequent lookups resolve immediately without re-running the pipeline
Ownership Schema¶
The ownership model uses a three-tier graph structure that supports both direct and inherited ownership:
(:Developer)-[:MEMBER_OF]->(:Team)-[:OWNS]->(:Service)
(:Team)-[:CHILD_OF]->(:Team)
Transitive ownership queries traverse the full organizational hierarchy:
MATCH (t:Team {name: "Engineering"})<-[:CHILD_OF*0..]-(sub:Team)-[:OWNS]->(s:Service)
RETURN sub.name AS owning_team, s.name AS service
This enables key-person risk detection: when a Developer node is deactivated via SCIM, Substrate traverses all OWNS edges (direct and team-inherited) to identify services that have lost their sole or primary owner. These services are immediately flagged CRITICAL in the Verification Queue.
Chunking and Vectorization Strategy¶
The UMKB receives data from heterogeneous tools and human inputs. A single chunking profile is insufficient. Substrate applies source-aware chunk profiles before vectorization:
| Input Type | Chunk Profile | Default Window | Overlap | Output |
|---|---|---|---|---|
| Source code and configs | code_ast |
function/class/module boundaries, max 350 tokens | 40 tokens | DocumentChunk + code entity edges |
| ADRs, RFCs, architecture docs | adr_section |
heading-based, max 700 tokens | 120 tokens | rationale chunks with WHY and PREVENTED_BY links |
| Jira/GitHub issue text, epics, user stories | ticket_thread |
semantic paragraph windows, max 450 tokens | 80 tokens | intent and requirement chunks |
| PR comments, commit messages, chat decisions | conversation_window |
turn-group windows, max 300 tokens | 60 tokens | memory and intent chunks |
| Runtime logs and telemetry summaries | runtime_window |
time-sliced windows, max 500 tokens | 50 tokens | drift and anomaly chunks |
Each chunk is embedded with bge-m3 and stored in PostgreSQL (pgvector HNSW index). PostgreSQL remains the relational system of record for chunk metadata, embeddings, lifecycle labels, and policy/audit traces.
Automatic Community Construction¶
When prerequisite graph density is reached, communities and their architectural subgraphs are generated automatically:
- Build projected graph with services, ADRs, incidents, source chunks, sprint artifacts, epics, and runtime signals.
- Run Leiden community detection (Neo4j GDS) to produce dense clusters.
- Ask the LLM community curator to produce:
- cluster summary,
- candidate boundary name,
- missing edge recommendations,
- policy gap hypotheses.
- Persist
KnowledgeCommunitynodes andIN_COMMUNITYedges. - Re-run community refresh incrementally as new entities arrive.
This clustering brings ADRs, source code, runtime reality, project documentation, sprint boards, user stories, and epics into explainable groups for retrieval, simulation, and governance.
Policies can be authored manually by architects or generated as draft policies automatically from repeated violation patterns and post-mortem lessons. Auto-generated policies always pass through a human approval gate before activation.
Confidence Scoring Model¶
Every node and edge in the UMKB carries a confidence score. Confidence is not optional — it is a first-class schema property that governs whether data is included in the graph, whether it requires human verification, and how much weight it carries in blast radius calculations and tension scores.
Source Trust Weights¶
| Source Type | Confidence Weight | Rationale |
|---|---|---|
| Code analysis (AST, static analysis) | 0.95 | Deterministic; directly derived from source of truth |
| CI/CD data (test results, build metadata) | 0.90 | Automated, reliable, low noise |
| Infrastructure state (Terraform, K8s API) | 0.85 | Declarative sources; occasionally stale between applies |
| Documentation (ADRs, READMEs, Confluence) | 0.70 | Human-authored; may be outdated |
| PR review comments | 0.60 | Contextual but informal; requires extraction model |
| Slack conversations | 0.30 | High noise; valuable only when explicitly flagged by keyword trigger |
Confidence Thresholds¶
| Range | Action |
|---|---|
| ≥ 0.90 | Auto-accepted; written to graph without review; verification_status = Verified |
| 0.60–0.89 | Written to graph with verification_status = Unverified; queued for periodic automated re-check |
| 0.50–0.59 | Written to graph with verification_status = Unverified; queued for human review in Verification Queue |
| < 0.50 | Rejected; not written to graph; logged to PostgreSQL audit table with rejection reason and source |
The minimum confidence threshold of 0.50 is derived from the Diffby knowledge graph inclusion model, which established that below-0.50 confidence produces more noise than signal in downstream queries.
Node and Edge Confidence Schema¶
Every node and edge in the UMKB carries the following confidence-related properties:
| Property | Type | Description |
|---|---|---|
| confidence | float (0.0–1.0) | Confidence score at time of extraction |
| source | string | Originating data source identifier |
| extraction_timestamp | datetime | When this node or edge was first created |
| last_verification_timestamp | datetime | When this node or edge was last confirmed accurate |
| verification_status | enum | Verified / Unverified / Disputed / Stale / Deprecated |
Stale is set automatically when last_verification_timestamp exceeds the configured staleness threshold: 30 days for documentation sources, 7 days for infrastructure sources, 1 day for SSH-verified runtime sources.
Disputed is set when two conflicting sources provide different values for the same property, or when a human reviewer explicitly marks the data as inaccurate pending resolution. Disputed nodes appear in the Verification Queue with both conflicting values shown side-by-side.
Graph Query Patterns¶
Multi-hop Dependency Traversal (RSN-UC-01)¶
MATCH (s:Service {name: $service_name})-[:DEPENDS_ON*1..5]->(dep:Service)
WHERE dep.confidence >= 0.60
RETURN dep.name, dep.domain, dep.tension_score
ORDER BY dep.tension_score DESC
Blast Radius (GOV-UC-04)¶
MATCH (target:Service {name: $target})<-[:DEPENDS_ON|ACTUALLY_CALLS*1..3]-(caller:Service)
RETURN caller.name, caller.domain, caller.page_rank
ORDER BY caller.page_rank DESC
Key-Person Risk Detection (PRO-UC-09)¶
MATCH (d:Developer {active: false})-[:OWNS]->(s:Service)
WHERE NOT EXISTS {
MATCH (other:Developer {active: true})-[:OWNS]->(s)
}
RETURN s.name AS orphaned_service, d.github_handle AS departing_owner
Transitive Team Ownership (Entity Resolution)¶
MATCH (t:Team {name: "Engineering"})<-[:CHILD_OF*0..]-(sub:Team)-[:OWNS]->(s:Service)
RETURN sub.name AS owning_team, s.name AS service
ADR Gap Detection (PRO-UC-06)¶
MATCH (s:Service)
WHERE NOT EXISTS {
MATCH (d:DecisionNode)-[:WHY]->(s)
}
AND s.confidence >= 0.60
RETURN s.name, s.domain, s.tension_score
ORDER BY s.tension_score DESC
Data Lifecycle¶
Snapshot Strategy¶
PostgreSQL stores timestamped graph snapshots as serialized JSON in a partitioned table managed by pg_partman. Snapshots are captured:
- On every PR merge (pre- and post-merge state)
- On every Terraform apply (pre- and post-apply state)
- Nightly at 2am (baseline snapshot for drift trend tracking)
- On demand via FastAPI gateway endpoint (for simulation setup)
Snapshots power the temporal query capability (RSN-UC-06: "What changed before last Friday's incident?") and the diff view in the PR Check Detail UI screen.
Retention Policy¶
| Data Type | Retention | Storage |
|---|---|---|
| Graph snapshots | 90 days | PostgreSQL (pg_partman partitioned by month) |
| Audit log events | 2 years | PostgreSQL (append-only, no DELETE privilege for app user) |
| Node embeddings | Indefinite (versioned) | PostgreSQL (pgvector HNSW index) |
| Drift scores | 90 days | PostgreSQL (time-series partitioned) |
| NATS stream events | 7 days | NATS JetStream (configurable per stream) |
| Redis subgraph cache | TTL-based per query type | Redis (AOF for job queue durability) |
Cache Invalidation¶
When the Ingestion Service writes a graph update, it publishes a substrate.cache.invalidate event to NATS. The Cache Service subscribes to this event and issues TTL resets or explicit DEL commands to the Redis subgraph cache for all affected query patterns. This prevents stale cache responses from masking newly ingested drift without requiring full cache flushes on every write.
Document Lifecycle Governance¶
All role-contributed artifacts are lifecycle-labeled after LLM inspection and deterministic freshness checks. The same item can move between states over time as new evidence appears.
Lifecycle State Model¶
stateDiagram-v2
[*] --> latest : first validated version
latest --> active : superseded but still referenced
active --> stale : freshness threshold breached
stale --> outdated : contradicted by newer code/runtime evidence
outdated --> incomplete : missing required policy/traceability fields
incomplete --> needs_verification : delegation to responsible user
needs_verification --> curated_update : user responds and evidence supplied
curated_update --> latest : LLM formats and applies validated update
stale --> archived : de-prioritized from active reasoning
archived --> oldest_snapshot : retained for history/audit only
Governance Delegation Workflow (UML Sequence)¶
sequenceDiagram
participant Connector as Source Connector
participant ING as Ingestion Service
participant CH as Chunker/Embedder
participant KB as UMKB (Neo4j + PostgreSQL)
participant GOV as Governance Service
participant PRO as Proactive Maintenance
participant USER as Responsible User
participant CUR as Curator LLM
Connector->>ING: New/updated artifact event
ING->>CH: Select chunk profile and split content
CH->>KB: Write chunks, embeddings, and graph links
GOV->>KB: Evaluate active policies on changed scope
PRO->>KB: Inspect lifecycle state and completeness
PRO->>USER: Verification/update request with SLA
USER-->>PRO: Response and supporting context
PRO->>CUR: Normalize and format accepted response
CUR->>KB: Update affected docs/files and lifecycle labels
KB-->>GOV: Re-run impacted policy checks
This workflow operationalizes active governance: it not only detects issues, it routes them to the correct owner and closes the loop with policy-aligned updates.
Functional Requirements (Cross-Cutting UMKB)¶
| ID | Requirement | Priority |
|---|---|---|
| KB-01 | Store relational metadata, policy data, audit records, and vectors in PostgreSQL 16 with pgvector HNSW indexes | Must Have |
| KB-02 | Store graph entities, relationships, and community topology in Neo4j 5.x | Must Have |
| KB-03 | Apply source-aware chunking profiles for code, ADRs, tickets, conversations, and runtime data before embedding | Must Have |
| KB-04 | Label every DocumentAsset with lifecycle state (latest, active, stale, outdated, incomplete, archived, oldest_snapshot) after LLM + deterministic checks |
Must Have |
| KB-05 | Run Leiden community detection and auto-create KnowledgeCommunity nodes when minimum graph density is reached |
Must Have |
| KB-06 | Generate community summaries and candidate policy gaps from LLM community curator output | Must Have |
| KB-07 | Support manual policy authoring and automatic draft policy generation from repeated incident/violation motifs | Must Have |
| KB-08 | Require human approval before activating any auto-generated policy | Must Have |
| KB-09 | Detect stale/outdated/incomplete artifacts and create NEEDS_ATTENTION delegation edges to the accountable role |
Must Have |
| KB-10 | On user response, curate and format updates before applying document/file changes and re-running policy checks | Must Have |
| KB-11 | Track full lifecycle transitions and remediation actions in append-only audit logs | Must Have |
| KB-12 | Provide time-windowed retrieval of latest, archived, and oldest snapshots for audit and simulation replay | Must Have |