Skip to content

Unified Multimodal Knowledge Base (UMKB) — Design Reference

Overview

The UMKB is the central artifact of the Substrate platform. Every feature — governance checks, NL queries, simulation, proactive alerts, sprint retro insights — derives its value from the fidelity and completeness of what is stored here. The UMKB is not a reporting database; it is a living, continuously updated representation of an engineering organization's architecture, intent, and institutional memory.

The UMKB is built on four purpose-selected data stores, each serving a distinct and non-overlapping role. Nothing is deployed out of convenience or familiarity — every technology choice is justified by a specific capability requirement.


Core Problems Solved

Substrate's UMKB exists to solve three failures that repeatedly break engineering organizations:

  1. Structural drift: The runtime architecture diverges from intended design and policy boundaries.
  2. Memory loss: Decision rationale, incident lessons, and tacit implementation context decay or disappear.
  3. Active governance gap: Policies exist as static documents but are not continuously enforced, validated, and used to proactively maintain docs/files.

The third problem is treated as a first-class capability, not a reporting add-on. The UMKB continuously labels artifacts, checks policy compliance, and routes remediation work to responsible users.


Database Selection

Database Role Why This Tool
Neo4j 5.x Observed, Intended, and Memory Graphs Native graph storage; Cypher query language; GDS library provides Leiden community detection, PageRank, betweenness centrality, and weakly connected components; named graph sandboxing enables simulation via CREATE DATABASE IF NOT EXISTS without touching production data
PostgreSQL 16 Policy store; node embeddings; drift scores; audit log; timestamped graph snapshots pgvector extension with HNSW index enables sub-millisecond ANN search over node embeddings; pg_partman handles time-series partitioning of drift scores and audit events; append-only audit table enforces immutability at the schema level
Redis 7 Job queue broker; vLLM prefix cache; hot subgraph cache; distributed locks Sub-millisecond read latency; TTL-based cache eviction aligned with graph update events; SET NX EX for distributed locking across Celery workers; AOF persistence ensures job queue survives process restart
NATS JetStream Event bus; at-least-once delivery; subject-based routing Stream replay enables event-sourced graph rebuilds after Ingestion Service crash; substrate.> subject hierarchy allows fine-grained subscription by service type; consumer group load balancing across multiple Ingestion workers

Why Not a Single Database

A single polystore is not a design flaw — it is a deliberate separation of concerns. Neo4j is irreplaceable for multi-hop graph traversals that would require recursive CTEs in SQL. PostgreSQL is irreplaceable for ACID-compliant transactional writes, pgvector HNSW index performance at scale, and the immutable append-only audit log. Redis is irreplaceable for sub-millisecond hot-path reads and Celery job coordination. NATS is irreplaceable for reliable event delivery with replay. Each technology does one thing exceptionally well. The integration overhead is bounded and predictable.


UML Component View

classDiagram
direction LR

class IngestionService
class GovernanceService
class ProactiveMaintenanceService
class ReasoningService
class PolicyEngine
class Neo4j
class PostgreSQL
class Redis
class NATS

IngestionService --> NATS : publishes ingestion events
IngestionService --> Neo4j : writes entities and edges
IngestionService --> PostgreSQL : writes relational metadata and pgvector embeddings
IngestionService --> Redis : dedup locks and cache keys

GovernanceService --> PolicyEngine : evaluates policy packs
GovernanceService --> Neo4j : reads graph deltas
GovernanceService --> PostgreSQL : writes violations, exceptions, audit

ProactiveMaintenanceService --> Neo4j : reads gaps and stale links
ProactiveMaintenanceService --> PostgreSQL : writes lifecycle labels and queue state
ProactiveMaintenanceService --> NATS : publishes proactive tasks

ReasoningService --> Neo4j : graph retrieval and Leiden communities
ReasoningService --> PostgreSQL : semantic search via pgvector
ReasoningService --> NATS : emits result events

Neo4j Graph Schema

Node Labels

Service

The primary unit of architectural concern. Every discovered or declared service in the organization maps to a Service node.

Property Type Description
domain string Bounded context this service belongs to (e.g., "payments", "auth")
api_type string REST, gRPC, GraphQL, event-driven, internal
test_coverage float Current test coverage percentage (0.0–1.0)
page_rank float GDS PageRank score; updated nightly
betweenness float GDS betweenness centrality score; updated nightly
tension_score float Current structural tension against Intended Graph
confidence float Confidence that this node represents a canonical service
source string Originating data source (github, terraform, kubernetes, ssh)
verification_status enum Verified / Unverified / Disputed / Stale / Deprecated
last_verification_timestamp datetime When this node was last confirmed accurate
extraction_timestamp datetime When this node was first created

Function / Module

Represents a discrete code unit: a function, class, or module within a service. Populated by AST parsing via the Rust CLI during PR ingestion.

Property Type Description
signature string Fully qualified function or module name
file string Repository-relative file path
line integer Line number within file
hash string SHA-256 of the function body; change detection
confidence float Confidence in extracted metadata
source string Originating repository + commit

InfraResource

Represents a declared or observed infrastructure component: VM, container, load balancer, database, message queue.

Property Type Description
type string ec2, rds, k8s_pod, k8s_service, load_balancer, s3, etc.
provider string aws, gcp, azure, bare_metal, on_prem
state string running, stopped, pending, drifted, undeclared
region string Deployment region or availability zone
declared_ports list[integer] Ports declared in Terraform / K8s spec
observed_ports list[integer] Ports discovered via SSH inspection
confidence float Confidence in current state accuracy

DecisionNode

Represents a captured architectural decision — ADR, RFC, design review outcome, or approved exception. The institutional memory backbone.

Property Type Description
rationale string Full decision rationale text
source_url string Canonical link (GitHub ADR file, Confluence page, Jira ticket)
confidence float Confidence in extraction accuracy
verified_at datetime When a human confirmed this decision is still active
reviewed_at datetime Last review timestamp (manual or automated)
decision_date date When the original decision was made
status string active, superseded, deprecated

FailurePattern

Captures post-mortem lessons and incident root causes. Links causally to the services affected and the policies that were (or should have been) in place.

Property Type Description
incident_date date Date of the incident
root_cause string Summary of the root cause
affected_domains list[string] Bounded contexts impacted
source_url string Link to post-mortem document
severity string P1 / P2 / P3
lessons string Extracted lessons verbatim
confidence float Extraction confidence

MemoryNode

Captures informal design rationale that does not rise to the level of a formal ADR: PR review comments, Slack design discussions, inline code documentation that explains a non-obvious decision.

Property Type Description
content string Extracted decision or rationale text
source_type string pr_comment, slack_message, code_comment, readme
confidence float Confidence in relevance and accuracy
extraction_model string Model used for extraction (dense-extract, moe-scout)
source_url string Link to originating content

ExceptionNode

Represents an approved policy exception: a deliberate, time-bounded acknowledgment that a service is violating a policy for legitimate reasons.

Property Type Description
rationale string Why this exception was approved
approved_by string Keycloak user ID of approving Architect
expires_at datetime Mandatory expiry; Substrate re-raises violation after this timestamp
policy_id string OPA policy pack and rule identifier
created_at datetime Timestamp of approval

Developer

Represents a human member of the engineering organization. Linked to Keycloak and SCIM identity.

Property Type Description
github_handle string GitHub username
scim_id string SCIM external ID from IdP
keycloak_id string Keycloak user UUID
active boolean Set false on SCIM deactivation; triggers key-person risk scan
name string Display name
email string Primary email

Team

Represents an engineering team. Supports hierarchical ownership via CHILD_OF edges.

Property Type Description
name string Canonical team name
child_of string Parent team name (denormalized for fast lookup)
owns_services list[string] Denormalized list for quick key-person risk queries

SprintNode

Represents a sprint or iteration in GitHub Projects v2 or Jira. Linked to structural debt reports generated at sprint close.

Property Type Description
sprint_id string Unique identifier from source system
health string healthy / at_risk / critical
start_date date Sprint start
end_date date Sprint end
velocity integer Story points completed
violation_delta integer Change in violation count over sprint
debt_score float Aggregate structural debt score at sprint close

IntentAssertion

Represents a captured engineering intent: what a developer declared they were building, extracted from a Jira ticket, GitHub Projects item, or PR description. Used for intent mismatch detection against actual code changes.

Property Type Description
linked_ticket string Jira or GitHub Projects item ID
intent_embedding vector(1024) bge-m3 embedding of intent text; stored in PostgreSQL pgvector table
confidence float Confidence that this accurately captures intent
source_text string Original intent description

DocumentAsset

Represents any user-contributed artifact: ADRs, design docs, epics, user stories, source files, PR comments, runbooks, sprint notes, and policy documents.

Property Type Description
document_id string Canonical document identifier across source systems
source_system string github, jira, confluence, slack, git, runtime, etc.
source_path string Path, URL, or tool-native locator
lifecycle_state enum latest, active, stale, outdated, incomplete, archived, oldest_snapshot
llm_labeled_at datetime Last LLM-based lifecycle labeling timestamp
owner_ref string Developer, Team, or role accountable for updates
confidence float Confidence in extracted metadata and state label

DocumentChunk

Represents a chunked segment of a DocumentAsset, optimized for vector search and graph linking.

Property Type Description
chunk_id string Stable ID for deduplication and updates
chunk_profile string code_ast, adr_section, ticket_thread, runtime_window, markdown_heading
token_count integer Tokenized size of chunk
overlap_tokens integer Overlap applied to preserve context continuity
embedding_ref string Foreign key to PostgreSQL pgvector embedding row
confidence float Confidence in chunk boundaries and extracted entities

KnowledgeCommunity

Represents an automatically generated Leiden community grouping related architecture and delivery artifacts.

Property Type Description
community_id string Stable Leiden partition identifier
summary string LLM-generated synopsis of the community
dominant_domain string Dominant bounded context represented
refresh_timestamp datetime Last recompute time
member_count integer Count of nodes/chunks in the community

Key Edge Types

CALLS (Function → Function)

Represents a runtime or static call relationship between two code functions. Populated by AST analysis during PR ingestion.

Property Type Description
count integer Call frequency from trace data or static analysis
last_seen datetime Last observed call timestamp
confidence float Confidence based on analysis method (static vs dynamic)

DEPENDS_ON (Module → Module)

Package or module-level dependency relationship. Populated from package manifests (package.json, requirements.txt, go.mod, etc.).

Property Type Description
version string Declared version constraint
confidence float Confidence in dependency accuracy
last_verified datetime Last verification timestamp

IMPORTS (Module → Module)

Static import relationship. Distinguishes static from dynamic imports for cycle detection and blast radius analysis.

Property Type Description
static boolean True for compile-time imports
dynamic boolean True for runtime/conditional imports

HOSTS (InfraResource → Service)

Declares that an infrastructure resource runs a given service.

Property Type Description
port integer Port on which service is hosted
protocol string TCP, UDP, HTTP, HTTPS, gRPC
declared boolean True if declared in Terraform/K8s; false if discovered via SSH inspection

ACTUALLY_CALLS (Service → Service)

Observed runtime call relationship between services, as distinct from the declared or intended topology. The gap between ACTUALLY_CALLS edges and declared DEPENDS_ON edges is a primary source of structural tension.

Property Type Description
direct boolean Direct HTTP/gRPC call without intermediary
via_gateway boolean Routed through API gateway
verified_at datetime Last SSH or trace verification timestamp

WHY (DecisionNode → Service/Policy)

Links a captured decision to the architectural artifact it explains. The primary edge for answering "why was this decision made?" queries and surfacing ADR references in violation comments.

Property Type Description
context string Relationship context (why this decision applies to this service)
rationale_excerpt string Verbatim excerpt linking decision to artifact

OWNS (Developer/Team → Service)

Ownership relationship. Populated from CODEOWNERS files, SCIM team membership, and confidence-weighted heuristics. Critical for key-person risk detection on Developer deactivation.

Property Type Description
since datetime Ownership start date
primary boolean True for primary owner; false for secondary/on-call
confidence float Confidence in ownership accuracy
last_verified datetime Last verification timestamp

CAUSES (FailurePattern → Service)

Links a documented failure pattern to the service it affected. Used in blast radius analysis and violation explanation generation.

Property Type Description
severity string P1 / P2 / P3
date date Date of the causation event

PREVENTED_BY (Service → Policy)

Declares that a specific policy is in place to prevent a class of failures. Populated on ADR ingestion when a policy is linked to a post-mortem lesson.

Property Type Description
via string Policy pack ID and rule name
effective_since datetime When the policy became effective

MEMBER_OF (Developer → Team)

Team membership edge. Created atomically on SCIM POST /Groups event.

Property Type Description
since datetime Membership start date
role string member, lead, on_call

CHILD_OF (Team → Team)

Team hierarchy edge. Enables transitive ownership queries across the full organizational tree without denormalization.

Property Type Description
hierarchy_depth integer Depth from root team; used for Cypher query optimization

CHUNK_OF (DocumentChunk → DocumentAsset)

Connects each chunk to its source artifact for traceability and deterministic rehydration.

Property Type Description
order_index integer Stable chunk ordering within a source document
chunking_version string Versioned chunking strategy used for the split

IN_COMMUNITY (Service/DecisionNode/DocumentChunk → KnowledgeCommunity)

Connects heterogeneous entities to automatically discovered Leiden communities.

Property Type Description
membership_score float Strength of membership in the community
assigned_by string gds-leiden or llm-community-curator
assigned_at datetime Assignment timestamp

NEEDS_ATTENTION (DocumentAsset → Developer/Team)

Represents governance-driven delegation for stale, outdated, or incomplete artifacts.

Property Type Description
reason string stale, policy_gap, outdated, incomplete, conflict
requested_at datetime When verification/update was requested
sla_due_at datetime Resolution target timestamp
status string open, acknowledged, resolved, escalated

The Two Graph Layers

The UMKB maintains two structurally distinct representations of the system simultaneously. The entire value proposition of Substrate depends on the separation and continuous comparison of these two layers.

The Intended Graph

The Intended Graph is what the organization has declared the system should look like. It contains:

  • Policies: OPA/Rego rules expressed as graph constraints (e.g., "no service in the payments domain may call auth directly without going through the gateway")
  • ADRs: Architectural Decision Records ingested as DecisionNode nodes with WHY edges to the services and policies they govern
  • Golden Paths: Declared preferred patterns for service-to-service communication, deployment topology, and data access
  • Desired topology: Declared DEPENDS_ON and HOSTS relationships representing the intended architecture
  • Structural constraints: Explicit rules about which service-to-service calls are permitted, which dependencies are approved, and what the correct ownership model is
  • Approved exceptions: ExceptionNode records that explicitly acknowledge known violations with bounded expiry dates

Updated by: Architecture team via policy authoring UI (CodeMirror 6 Rego editor); ADR ingestion pipeline on git push webhook; Governance Service on exception approval; manual Architect-role graph mutations via FastAPI gateway.

The Observed Graph

The Observed Graph is what the system actually looks like right now. It contains:

  • Runtime services: Service nodes discovered via SSH inspection, Kubernetes API, and service registry
  • Live dependencies: ACTUALLY_CALLS edges derived from SSH-captured network state and service mesh traces
  • Deployed infrastructure: InfraResource nodes from Terraform state files and Kubernetes resource queries
  • PR deltas: Function and Module node changes ingested on every PR open event via the Rust AST CLI
  • SSH-verified host state: Port mappings, running process list, and declared vs observed state diffs
  • Project signals: SprintNode and IntentAssertion nodes from GitHub Projects v2 and Jira

Updated by: Ingestion Service (GitHub, Terraform, Kubernetes, Jira connectors); SSH Runtime Connector (15-minute scheduled + on-demand); Celery workers processing NATS JetStream substrate.ingestion.> events.

The Drift

The Drift is the measurable, continuously computed delta between the Intended Graph and the Observed Graph. It is not a binary "compliant/non-compliant" flag — it is a spectrum of tension scores, violation counts, and confidence-weighted discrepancies.

Every governance check, structural tension score, proactive alert, and simulation result is derived from comparing these two layers. The Drift is computed in two forms:

Structural Tension Score: Computed for every observed edge immediately after ingestion.

tension_score = |intended_weight - observed_weight| / intended_weight

Stored as a float on the edge and aggregated per domain for dashboard display. A service may have structural tension without a policy violation (it diverged from the preferred pattern but not from a hard rule).

Policy Violation: A boolean result from OPA/Rego evaluation against the Observed Graph serialized as JSON. Violations are the subset of Drift that exceeds an explicit policy rule. A service can violate a policy without high structural tension (a new forbidden edge was added that did not substantially change the overall graph structure).


Institutional Memory Graph Layer

The Memory Graph is a third layer overlaid on both the Intended and Observed Graphs. It captures the organizational knowledge that explains why things are the way they are — decisions made, mistakes learned from, rationale embedded in PR reviews, and exceptions deliberately approved.

Without the Memory Graph, Substrate can detect drift but cannot explain it. With the Memory Graph, every violation comment can cite the ADR that established the rule, every blast radius analysis can cite the post-mortem that established why the dependency is dangerous, and every policy exception is traceable to the Architect who approved it and when it expires.

Memory Type Source Node Type Linked To Captured By
ADR GitHub repo / Confluence DecisionNode Service nodes, Policy nodes Ingestion Service on git push webhook (ADR commit)
Post-mortem lesson Confluence / GitHub Pages FailurePattern Affected service nodes Ingestion Service on Confluence webhook or GitHub Pages build
Design rationale PR review comments MemoryNode Function nodes, Module nodes Ingestion Service on PR merge event
Policy exception Governance Service approval flow ExceptionNode Policy node, violating Service node Governance Service on Architect approval action
Sprint retro insight Jira sprint close / GitHub Projects iteration close SprintInsight Domain node, SprintNode Proactive Maintenance Service on sprint close webhook
Informal decision Slack keyword trigger (configurable patterns) IntentAssertion Service node Ingestion Service channel watch (v1.1)
Tribal knowledge gap Proactive scanner — services with no WHY edges MemoryGap flag Service node Proactive Maintenance Service nightly scan (PRO-UC-06)

Entity Resolution

Entity resolution is the most technically challenging component of the ingestion pipeline. The same service will appear under different names across different source systems:

  • GitHub repository name: payment-service
  • Terraform resource label: srv-payment
  • Kubernetes service name: payments
  • Jira component name: Payment Service
  • Slack channel reference: #payment
  • CODEOWNERS entry: service-payments

Without canonicalization, the UMKB becomes a fragmented collection of duplicate nodes that cannot be reliably queried or compared. Entity resolution is what makes the UMKB a single coherent graph rather than a multi-source data dump.

Resolution Strategy

Substrate uses the Dense resolve-lora LoRA adapter fine-tuned against real-world service naming conventions to perform entity resolution. The adapter is applied to all inbound entity names during the nightly resolution pass (ING-UC-09, Celery beat at 2am) and at ingestion time for high-confidence cases.

The resolution pipeline:

  1. Tokenization normalization: Strip common prefixes and suffixes (srv-, -service, -svc), normalize separators (hyphens, underscores, camelCase to space-separated), lowercase all tokens
  2. Embedding similarity: bge-m3 embedding of normalized name against all existing canonical node names stored in PostgreSQL pgvector; cosine similarity above 0.92 triggers auto-merge
  3. LLM resolution: For similarity range 0.70–0.92, Dense resolve-lora classifies same-entity vs different-entity with confidence score
  4. Human escalation: Below 0.70 similarity, or when resolver confidence falls below 0.70, the entity candidate is queued in the Verification Queue for human review
  5. Canonical ID assignment: On resolution, all source-system identifiers are stored as alias properties on the canonical node; subsequent lookups resolve immediately without re-running the pipeline

Ownership Schema

The ownership model uses a three-tier graph structure that supports both direct and inherited ownership:

(:Developer)-[:MEMBER_OF]->(:Team)-[:OWNS]->(:Service)
(:Team)-[:CHILD_OF]->(:Team)

Transitive ownership queries traverse the full organizational hierarchy:

MATCH (t:Team {name: "Engineering"})<-[:CHILD_OF*0..]-(sub:Team)-[:OWNS]->(s:Service)
RETURN sub.name AS owning_team, s.name AS service

This enables key-person risk detection: when a Developer node is deactivated via SCIM, Substrate traverses all OWNS edges (direct and team-inherited) to identify services that have lost their sole or primary owner. These services are immediately flagged CRITICAL in the Verification Queue.


Chunking and Vectorization Strategy

The UMKB receives data from heterogeneous tools and human inputs. A single chunking profile is insufficient. Substrate applies source-aware chunk profiles before vectorization:

Input Type Chunk Profile Default Window Overlap Output
Source code and configs code_ast function/class/module boundaries, max 350 tokens 40 tokens DocumentChunk + code entity edges
ADRs, RFCs, architecture docs adr_section heading-based, max 700 tokens 120 tokens rationale chunks with WHY and PREVENTED_BY links
Jira/GitHub issue text, epics, user stories ticket_thread semantic paragraph windows, max 450 tokens 80 tokens intent and requirement chunks
PR comments, commit messages, chat decisions conversation_window turn-group windows, max 300 tokens 60 tokens memory and intent chunks
Runtime logs and telemetry summaries runtime_window time-sliced windows, max 500 tokens 50 tokens drift and anomaly chunks

Each chunk is embedded with bge-m3 and stored in PostgreSQL (pgvector HNSW index). PostgreSQL remains the relational system of record for chunk metadata, embeddings, lifecycle labels, and policy/audit traces.


Automatic Community Construction

When prerequisite graph density is reached, communities and their architectural subgraphs are generated automatically:

  1. Build projected graph with services, ADRs, incidents, source chunks, sprint artifacts, epics, and runtime signals.
  2. Run Leiden community detection (Neo4j GDS) to produce dense clusters.
  3. Ask the LLM community curator to produce:
  4. cluster summary,
  5. candidate boundary name,
  6. missing edge recommendations,
  7. policy gap hypotheses.
  8. Persist KnowledgeCommunity nodes and IN_COMMUNITY edges.
  9. Re-run community refresh incrementally as new entities arrive.

This clustering brings ADRs, source code, runtime reality, project documentation, sprint boards, user stories, and epics into explainable groups for retrieval, simulation, and governance.

Policies can be authored manually by architects or generated as draft policies automatically from repeated violation patterns and post-mortem lessons. Auto-generated policies always pass through a human approval gate before activation.


Confidence Scoring Model

Every node and edge in the UMKB carries a confidence score. Confidence is not optional — it is a first-class schema property that governs whether data is included in the graph, whether it requires human verification, and how much weight it carries in blast radius calculations and tension scores.

Source Trust Weights

Source Type Confidence Weight Rationale
Code analysis (AST, static analysis) 0.95 Deterministic; directly derived from source of truth
CI/CD data (test results, build metadata) 0.90 Automated, reliable, low noise
Infrastructure state (Terraform, K8s API) 0.85 Declarative sources; occasionally stale between applies
Documentation (ADRs, READMEs, Confluence) 0.70 Human-authored; may be outdated
PR review comments 0.60 Contextual but informal; requires extraction model
Slack conversations 0.30 High noise; valuable only when explicitly flagged by keyword trigger

Confidence Thresholds

Range Action
≥ 0.90 Auto-accepted; written to graph without review; verification_status = Verified
0.60–0.89 Written to graph with verification_status = Unverified; queued for periodic automated re-check
0.50–0.59 Written to graph with verification_status = Unverified; queued for human review in Verification Queue
< 0.50 Rejected; not written to graph; logged to PostgreSQL audit table with rejection reason and source

The minimum confidence threshold of 0.50 is derived from the Diffby knowledge graph inclusion model, which established that below-0.50 confidence produces more noise than signal in downstream queries.

Node and Edge Confidence Schema

Every node and edge in the UMKB carries the following confidence-related properties:

Property Type Description
confidence float (0.0–1.0) Confidence score at time of extraction
source string Originating data source identifier
extraction_timestamp datetime When this node or edge was first created
last_verification_timestamp datetime When this node or edge was last confirmed accurate
verification_status enum Verified / Unverified / Disputed / Stale / Deprecated

Stale is set automatically when last_verification_timestamp exceeds the configured staleness threshold: 30 days for documentation sources, 7 days for infrastructure sources, 1 day for SSH-verified runtime sources.

Disputed is set when two conflicting sources provide different values for the same property, or when a human reviewer explicitly marks the data as inaccurate pending resolution. Disputed nodes appear in the Verification Queue with both conflicting values shown side-by-side.


Graph Query Patterns

Multi-hop Dependency Traversal (RSN-UC-01)

MATCH (s:Service {name: $service_name})-[:DEPENDS_ON*1..5]->(dep:Service)
WHERE dep.confidence >= 0.60
RETURN dep.name, dep.domain, dep.tension_score
ORDER BY dep.tension_score DESC

Blast Radius (GOV-UC-04)

MATCH (target:Service {name: $target})<-[:DEPENDS_ON|ACTUALLY_CALLS*1..3]-(caller:Service)
RETURN caller.name, caller.domain, caller.page_rank
ORDER BY caller.page_rank DESC

Key-Person Risk Detection (PRO-UC-09)

MATCH (d:Developer {active: false})-[:OWNS]->(s:Service)
WHERE NOT EXISTS {
  MATCH (other:Developer {active: true})-[:OWNS]->(s)
}
RETURN s.name AS orphaned_service, d.github_handle AS departing_owner

Transitive Team Ownership (Entity Resolution)

MATCH (t:Team {name: "Engineering"})<-[:CHILD_OF*0..]-(sub:Team)-[:OWNS]->(s:Service)
RETURN sub.name AS owning_team, s.name AS service

ADR Gap Detection (PRO-UC-06)

MATCH (s:Service)
WHERE NOT EXISTS {
  MATCH (d:DecisionNode)-[:WHY]->(s)
}
AND s.confidence >= 0.60
RETURN s.name, s.domain, s.tension_score
ORDER BY s.tension_score DESC

Data Lifecycle

Snapshot Strategy

PostgreSQL stores timestamped graph snapshots as serialized JSON in a partitioned table managed by pg_partman. Snapshots are captured:

  • On every PR merge (pre- and post-merge state)
  • On every Terraform apply (pre- and post-apply state)
  • Nightly at 2am (baseline snapshot for drift trend tracking)
  • On demand via FastAPI gateway endpoint (for simulation setup)

Snapshots power the temporal query capability (RSN-UC-06: "What changed before last Friday's incident?") and the diff view in the PR Check Detail UI screen.

Retention Policy

Data Type Retention Storage
Graph snapshots 90 days PostgreSQL (pg_partman partitioned by month)
Audit log events 2 years PostgreSQL (append-only, no DELETE privilege for app user)
Node embeddings Indefinite (versioned) PostgreSQL (pgvector HNSW index)
Drift scores 90 days PostgreSQL (time-series partitioned)
NATS stream events 7 days NATS JetStream (configurable per stream)
Redis subgraph cache TTL-based per query type Redis (AOF for job queue durability)

Cache Invalidation

When the Ingestion Service writes a graph update, it publishes a substrate.cache.invalidate event to NATS. The Cache Service subscribes to this event and issues TTL resets or explicit DEL commands to the Redis subgraph cache for all affected query patterns. This prevents stale cache responses from masking newly ingested drift without requiring full cache flushes on every write.


Document Lifecycle Governance

All role-contributed artifacts are lifecycle-labeled after LLM inspection and deterministic freshness checks. The same item can move between states over time as new evidence appears.

Lifecycle State Model

stateDiagram-v2
[*] --> latest : first validated version
latest --> active : superseded but still referenced
active --> stale : freshness threshold breached
stale --> outdated : contradicted by newer code/runtime evidence
outdated --> incomplete : missing required policy/traceability fields
incomplete --> needs_verification : delegation to responsible user
needs_verification --> curated_update : user responds and evidence supplied
curated_update --> latest : LLM formats and applies validated update
stale --> archived : de-prioritized from active reasoning
archived --> oldest_snapshot : retained for history/audit only

Governance Delegation Workflow (UML Sequence)

sequenceDiagram
participant Connector as Source Connector
participant ING as Ingestion Service
participant CH as Chunker/Embedder
participant KB as UMKB (Neo4j + PostgreSQL)
participant GOV as Governance Service
participant PRO as Proactive Maintenance
participant USER as Responsible User
participant CUR as Curator LLM

Connector->>ING: New/updated artifact event
ING->>CH: Select chunk profile and split content
CH->>KB: Write chunks, embeddings, and graph links
GOV->>KB: Evaluate active policies on changed scope
PRO->>KB: Inspect lifecycle state and completeness
PRO->>USER: Verification/update request with SLA
USER-->>PRO: Response and supporting context
PRO->>CUR: Normalize and format accepted response
CUR->>KB: Update affected docs/files and lifecycle labels
KB-->>GOV: Re-run impacted policy checks

This workflow operationalizes active governance: it not only detects issues, it routes them to the correct owner and closes the loop with policy-aligned updates.


Functional Requirements (Cross-Cutting UMKB)

ID Requirement Priority
KB-01 Store relational metadata, policy data, audit records, and vectors in PostgreSQL 16 with pgvector HNSW indexes Must Have
KB-02 Store graph entities, relationships, and community topology in Neo4j 5.x Must Have
KB-03 Apply source-aware chunking profiles for code, ADRs, tickets, conversations, and runtime data before embedding Must Have
KB-04 Label every DocumentAsset with lifecycle state (latest, active, stale, outdated, incomplete, archived, oldest_snapshot) after LLM + deterministic checks Must Have
KB-05 Run Leiden community detection and auto-create KnowledgeCommunity nodes when minimum graph density is reached Must Have
KB-06 Generate community summaries and candidate policy gaps from LLM community curator output Must Have
KB-07 Support manual policy authoring and automatic draft policy generation from repeated incident/violation motifs Must Have
KB-08 Require human approval before activating any auto-generated policy Must Have
KB-09 Detect stale/outdated/incomplete artifacts and create NEEDS_ATTENTION delegation edges to the accountable role Must Have
KB-10 On user response, curate and format updates before applying document/file changes and re-running policy checks Must Have
KB-11 Track full lifecycle transitions and remediation actions in append-only audit logs Must Have
KB-12 Provide time-windowed retrieval of latest, archived, and oldest snapshots for audit and simulation replay Must Have