Unified Multimodal Knowledge Base (UMKB) — Design Reference¶

Overview¶

The UMKB is the central artifact of the Substrate platform. Every feature — governance checks, NL queries, simulation, proactive alerts, sprint retro insights — derives its value from the fidelity and completeness of what is stored here. The UMKB is not a reporting database; it is a living, continuously updated representation of an engineering organization's architecture, intent, and institutional memory.

The UMKB is built on four purpose-selected data stores, each serving a distinct and non-overlapping role. Nothing is deployed out of convenience or familiarity — every technology choice is justified by a specific capability requirement.

Core Problems Solved¶

Substrate's UMKB exists to solve three failures that repeatedly break engineering organizations:

Structural drift: The runtime architecture diverges from intended design and policy boundaries.
Memory loss: Decision rationale, incident lessons, and tacit implementation context decay or disappear.
Active governance gap: Policies exist as static documents but are not continuously enforced, validated, and used to proactively maintain docs/files.

The third problem is treated as a first-class capability, not a reporting add-on. The UMKB continuously labels artifacts, checks policy compliance, and routes remediation work to responsible users.

Database Selection¶

Database	Role	Why This Tool
Neo4j 5.x	Observed, Intended, and Memory Graphs	Native graph storage; Cypher query language; GDS library provides Leiden community detection, PageRank, betweenness centrality, and weakly connected components; named graph sandboxing enables simulation via `CREATE DATABASE IF NOT EXISTS` without touching production data
PostgreSQL 16	Policy store; node embeddings; drift scores; audit log; timestamped graph snapshots	pgvector extension with HNSW index enables sub-millisecond ANN search over node embeddings; pg_partman handles time-series partitioning of drift scores and audit events; append-only audit table enforces immutability at the schema level
Redis 7	Job queue broker; vLLM prefix cache; hot subgraph cache; distributed locks	Sub-millisecond read latency; TTL-based cache eviction aligned with graph update events; SET NX EX for distributed locking across Celery workers; AOF persistence ensures job queue survives process restart
NATS JetStream	Event bus; at-least-once delivery; subject-based routing	Stream replay enables event-sourced graph rebuilds after Ingestion Service crash; `substrate.>` subject hierarchy allows fine-grained subscription by service type; consumer group load balancing across multiple Ingestion workers

Why Not a Single Database¶

A single polystore is not a design flaw — it is a deliberate separation of concerns. Neo4j is irreplaceable for multi-hop graph traversals that would require recursive CTEs in SQL. PostgreSQL is irreplaceable for ACID-compliant transactional writes, pgvector HNSW index performance at scale, and the immutable append-only audit log. Redis is irreplaceable for sub-millisecond hot-path reads and Celery job coordination. NATS is irreplaceable for reliable event delivery with replay. Each technology does one thing exceptionally well. The integration overhead is bounded and predictable.

UML Component View¶

classDiagram
direction LR

class IngestionService
class GovernanceService
class ProactiveMaintenanceService
class ReasoningService
class PolicyEngine
class Neo4j
class PostgreSQL
class Redis
class NATS

IngestionService --> NATS : publishes ingestion events
IngestionService --> Neo4j : writes entities and edges
IngestionService --> PostgreSQL : writes relational metadata and pgvector embeddings
IngestionService --> Redis : dedup locks and cache keys

GovernanceService --> PolicyEngine : evaluates policy packs
GovernanceService --> Neo4j : reads graph deltas
GovernanceService --> PostgreSQL : writes violations, exceptions, audit

ProactiveMaintenanceService --> Neo4j : reads gaps and stale links
ProactiveMaintenanceService --> PostgreSQL : writes lifecycle labels and queue state
ProactiveMaintenanceService --> NATS : publishes proactive tasks

ReasoningService --> Neo4j : graph retrieval and Leiden communities
ReasoningService --> PostgreSQL : semantic search via pgvector
ReasoningService --> NATS : emits result events

Neo4j Graph Schema¶

Node Labels¶

Service¶

The primary unit of architectural concern. Every discovered or declared service in the organization maps to a Service node.

Property	Type	Description
domain	string	Bounded context this service belongs to (e.g., "payments", "auth")
api_type	string	REST, gRPC, GraphQL, event-driven, internal
test_coverage	float	Current test coverage percentage (0.0–1.0)
page_rank	float	GDS PageRank score; updated nightly
betweenness	float	GDS betweenness centrality score; updated nightly
tension_score	float	Current structural tension against Intended Graph
confidence	float	Confidence that this node represents a canonical service
source	string	Originating data source (github, terraform, kubernetes, ssh)
verification_status	enum	Verified / Unverified / Disputed / Stale / Deprecated
last_verification_timestamp	datetime	When this node was last confirmed accurate
extraction_timestamp	datetime	When this node was first created

Function / Module¶

Represents a discrete code unit: a function, class, or module within a service. Populated by AST parsing via the Rust CLI during PR ingestion.

Property	Type	Description
signature	string	Fully qualified function or module name
file	string	Repository-relative file path
line	integer	Line number within file
hash	string	SHA-256 of the function body; change detection
confidence	float	Confidence in extracted metadata
source	string	Originating repository + commit

InfraResource¶

Represents a declared or observed infrastructure component: VM, container, load balancer, database, message queue.

Property	Type	Description
type	string	ec2, rds, k8s_pod, k8s_service, load_balancer, s3, etc.
provider	string	aws, gcp, azure, bare_metal, on_prem
state	string	running, stopped, pending, drifted, undeclared
region	string	Deployment region or availability zone
declared_ports	list[integer]	Ports declared in Terraform / K8s spec
observed_ports	list[integer]	Ports discovered via SSH inspection
confidence	float	Confidence in current state accuracy

DecisionNode¶

Represents a captured architectural decision — ADR, RFC, design review outcome, or approved exception. The institutional memory backbone.

Property	Type	Description
rationale	string	Full decision rationale text
source_url	string	Canonical link (GitHub ADR file, Confluence page, Jira ticket)
confidence	float	Confidence in extraction accuracy
verified_at	datetime	When a human confirmed this decision is still active
reviewed_at	datetime	Last review timestamp (manual or automated)
decision_date	date	When the original decision was made
status	string	active, superseded, deprecated

FailurePattern¶

Captures post-mortem lessons and incident root causes. Links causally to the services affected and the policies that were (or should have been) in place.

Property	Type	Description
incident_date	date	Date of the incident
root_cause	string	Summary of the root cause
affected_domains	list[string]	Bounded contexts impacted
source_url	string	Link to post-mortem document
severity	string	P1 / P2 / P3
lessons	string	Extracted lessons verbatim
confidence	float	Extraction confidence

MemoryNode¶

Captures informal design rationale that does not rise to the level of a formal ADR: PR review comments, Slack design discussions, inline code documentation that explains a non-obvious decision.

Property	Type	Description
content	string	Extracted decision or rationale text
source_type	string	pr_comment, slack_message, code_comment, readme
confidence	float	Confidence in relevance and accuracy
extraction_model	string	Model used for extraction (dense-extract, moe-scout)
source_url	string	Link to originating content

ExceptionNode¶

Represents an approved policy exception: a deliberate, time-bounded acknowledgment that a service is violating a policy for legitimate reasons.

Property	Type	Description
rationale	string	Why this exception was approved
approved_by	string	Keycloak user ID of approving Architect
expires_at	datetime	Mandatory expiry; Substrate re-raises violation after this timestamp
policy_id	string	OPA policy pack and rule identifier
created_at	datetime	Timestamp of approval

Developer¶

Represents a human member of the engineering organization. Linked to Keycloak and SCIM identity.

Property	Type	Description
github_handle	string	GitHub username
scim_id	string	SCIM external ID from IdP
keycloak_id	string	Keycloak user UUID
active	boolean	Set false on SCIM deactivation; triggers key-person risk scan
name	string	Display name
email	string	Primary email

Team¶

Represents an engineering team. Supports hierarchical ownership via CHILD_OF edges.

Property	Type	Description
name	string	Canonical team name
child_of	string	Parent team name (denormalized for fast lookup)
owns_services	list[string]	Denormalized list for quick key-person risk queries

SprintNode¶

Represents a sprint or iteration in GitHub Projects v2 or Jira. Linked to structural debt reports generated at sprint close.

Property	Type	Description
sprint_id	string	Unique identifier from source system
health	string	healthy / at_risk / critical
start_date	date	Sprint start
end_date	date	Sprint end
velocity	integer	Story points completed
violation_delta	integer	Change in violation count over sprint
debt_score	float	Aggregate structural debt score at sprint close

IntentAssertion¶

Represents a captured engineering intent: what a developer declared they were building, extracted from a Jira ticket, GitHub Projects item, or PR description. Used for intent mismatch detection against actual code changes.

Property	Type	Description
linked_ticket	string	Jira or GitHub Projects item ID
intent_embedding	vector(1024)	bge-m3 embedding of intent text; stored in PostgreSQL pgvector table
confidence	float	Confidence that this accurately captures intent
source_text	string	Original intent description

DocumentAsset¶

Represents any user-contributed artifact: ADRs, design docs, epics, user stories, source files, PR comments, runbooks, sprint notes, and policy documents.

Property	Type	Description
document_id	string	Canonical document identifier across source systems
source_system	string	github, jira, confluence, slack, git, runtime, etc.
source_path	string	Path, URL, or tool-native locator
lifecycle_state	enum	latest, active, stale, outdated, incomplete, archived, oldest_snapshot
llm_labeled_at	datetime	Last LLM-based lifecycle labeling timestamp
owner_ref	string	Developer, Team, or role accountable for updates
confidence	float	Confidence in extracted metadata and state label

DocumentChunk¶

Represents a chunked segment of a DocumentAsset, optimized for vector search and graph linking.

Property	Type	Description
chunk_id	string	Stable ID for deduplication and updates
chunk_profile	string	code_ast, adr_section, ticket_thread, runtime_window, markdown_heading
token_count	integer	Tokenized size of chunk
overlap_tokens	integer	Overlap applied to preserve context continuity
embedding_ref	string	Foreign key to PostgreSQL pgvector embedding row
confidence	float	Confidence in chunk boundaries and extracted entities

KnowledgeCommunity¶

Represents an automatically generated Leiden community grouping related architecture and delivery artifacts.

Property	Type	Description
community_id	string	Stable Leiden partition identifier
summary	string	LLM-generated synopsis of the community
dominant_domain	string	Dominant bounded context represented
refresh_timestamp	datetime	Last recompute time
member_count	integer	Count of nodes/chunks in the community

Key Edge Types¶

CALLS (Function → Function)¶

Represents a runtime or static call relationship between two code functions. Populated by AST analysis during PR ingestion.

Property	Type	Description
count	integer	Call frequency from trace data or static analysis
last_seen	datetime	Last observed call timestamp
confidence	float	Confidence based on analysis method (static vs dynamic)

DEPENDS_ON (Module → Module)¶

Package or module-level dependency relationship. Populated from package manifests (package.json, requirements.txt, go.mod, etc.).

Property	Type	Description
version	string	Declared version constraint
confidence	float	Confidence in dependency accuracy
last_verified	datetime	Last verification timestamp

IMPORTS (Module → Module)¶

Static import relationship. Distinguishes static from dynamic imports for cycle detection and blast radius analysis.

Property	Type	Description
static	boolean	True for compile-time imports
dynamic	boolean	True for runtime/conditional imports

HOSTS (InfraResource → Service)¶

Declares that an infrastructure resource runs a given service.

Property	Type	Description
port	integer	Port on which service is hosted
protocol	string	TCP, UDP, HTTP, HTTPS, gRPC
declared	boolean	True if declared in Terraform/K8s; false if discovered via SSH inspection

ACTUALLY_CALLS (Service → Service)¶

Observed runtime call relationship between services, as distinct from the declared or intended topology. The gap between ACTUALLY_CALLS edges and declared DEPENDS_ON edges is a primary source of structural tension.

Property	Type	Description
direct	boolean	Direct HTTP/gRPC call without intermediary
via_gateway	boolean	Routed through API gateway
verified_at	datetime	Last SSH or trace verification timestamp

WHY (DecisionNode → Service/Policy)¶

Links a captured decision to the architectural artifact it explains. The primary edge for answering "why was this decision made?" queries and surfacing ADR references in violation comments.

Property	Type	Description
context	string	Relationship context (why this decision applies to this service)
rationale_excerpt	string	Verbatim excerpt linking decision to artifact

OWNS (Developer/Team → Service)¶

Ownership relationship. Populated from CODEOWNERS files, SCIM team membership, and confidence-weighted heuristics. Critical for key-person risk detection on Developer deactivation.

Property	Type	Description
since	datetime	Ownership start date
primary	boolean	True for primary owner; false for secondary/on-call
confidence	float	Confidence in ownership accuracy
last_verified	datetime	Last verification timestamp

CAUSES (FailurePattern → Service)¶

Links a documented failure pattern to the service it affected. Used in blast radius analysis and violation explanation generation.

Property	Type	Description
severity	string	P1 / P2 / P3
date	date	Date of the causation event

PREVENTED_BY (Service → Policy)¶

Declares that a specific policy is in place to prevent a class of failures. Populated on ADR ingestion when a policy is linked to a post-mortem lesson.

Property	Type	Description
via	string	Policy pack ID and rule name
effective_since	datetime	When the policy became effective

MEMBER_OF (Developer → Team)¶

Team membership edge. Created atomically on SCIM POST /Groups event.

Property	Type	Description
since	datetime	Membership start date
role	string	member, lead, on_call

CHILD_OF (Team → Team)¶

Team hierarchy edge. Enables transitive ownership queries across the full organizational tree without denormalization.

Property	Type	Description
hierarchy_depth	integer	Depth from root team; used for Cypher query optimization

CHUNK_OF (DocumentChunk → DocumentAsset)¶

Connects each chunk to its source artifact for traceability and deterministic rehydration.

Property	Type	Description
order_index	integer	Stable chunk ordering within a source document
chunking_version	string	Versioned chunking strategy used for the split

IN_COMMUNITY (Service/DecisionNode/DocumentChunk → KnowledgeCommunity)¶

Connects heterogeneous entities to automatically discovered Leiden communities.

Property	Type	Description
membership_score	float	Strength of membership in the community
assigned_by	string	gds-leiden or llm-community-curator
assigned_at	datetime	Assignment timestamp

NEEDS_ATTENTION (DocumentAsset → Developer/Team)¶

Represents governance-driven delegation for stale, outdated, or incomplete artifacts.

Property	Type	Description
reason	string	stale, policy_gap, outdated, incomplete, conflict
requested_at	datetime	When verification/update was requested
sla_due_at	datetime	Resolution target timestamp
status	string	open, acknowledged, resolved, escalated

The Two Graph Layers¶

The UMKB maintains two structurally distinct representations of the system simultaneously. The entire value proposition of Substrate depends on the separation and continuous comparison of these two layers.

The Intended Graph¶

The Intended Graph is what the organization has declared the system should look like. It contains:

Policies: OPA/Rego rules expressed as graph constraints (e.g., "no service in the payments domain may call auth directly without going through the gateway")
ADRs: Architectural Decision Records ingested as DecisionNode nodes with WHY edges to the services and policies they govern
Golden Paths: Declared preferred patterns for service-to-service communication, deployment topology, and data access
Desired topology: Declared DEPENDS_ON and HOSTS relationships representing the intended architecture
Structural constraints: Explicit rules about which service-to-service calls are permitted, which dependencies are approved, and what the correct ownership model is
Approved exceptions: ExceptionNode records that explicitly acknowledge known violations with bounded expiry dates

Updated by: Architecture team via policy authoring UI (CodeMirror 6 Rego editor); ADR ingestion pipeline on git push webhook; Governance Service on exception approval; manual Architect-role graph mutations via FastAPI gateway.

The Observed Graph¶

The Observed Graph is what the system actually looks like right now. It contains:

Runtime services: Service nodes discovered via SSH inspection, Kubernetes API, and service registry
Live dependencies: ACTUALLY_CALLS edges derived from SSH-captured network state and service mesh traces
Deployed infrastructure: InfraResource nodes from Terraform state files and Kubernetes resource queries
PR deltas: Function and Module node changes ingested on every PR open event via the Rust AST CLI
SSH-verified host state: Port mappings, running process list, and declared vs observed state diffs
Project signals: SprintNode and IntentAssertion nodes from GitHub Projects v2 and Jira

Updated by: Ingestion Service (GitHub, Terraform, Kubernetes, Jira connectors); SSH Runtime Connector (15-minute scheduled + on-demand); Celery workers processing NATS JetStream substrate.ingestion.> events.

The Drift¶

The Drift is the measurable, continuously computed delta between the Intended Graph and the Observed Graph. It is not a binary "compliant/non-compliant" flag — it is a spectrum of tension scores, violation counts, and confidence-weighted discrepancies.

Every governance check, structural tension score, proactive alert, and simulation result is derived from comparing these two layers. The Drift is computed in two forms:

Structural Tension Score: Computed for every observed edge immediately after ingestion.

tension_score = |intended_weight - observed_weight| / intended_weight

Stored as a float on the edge and aggregated per domain for dashboard display. A service may have structural tension without a policy violation (it diverged from the preferred pattern but not from a hard rule).

Policy Violation: A boolean result from OPA/Rego evaluation against the Observed Graph serialized as JSON. Violations are the subset of Drift that exceeds an explicit policy rule. A service can violate a policy without high structural tension (a new forbidden edge was added that did not substantially change the overall graph structure).

Institutional Memory Graph Layer¶

The Memory Graph is a third layer overlaid on both the Intended and Observed Graphs. It captures the organizational knowledge that explains why things are the way they are — decisions made, mistakes learned from, rationale embedded in PR reviews, and exceptions deliberately approved.

Without the Memory Graph, Substrate can detect drift but cannot explain it. With the Memory Graph, every violation comment can cite the ADR that established the rule, every blast radius analysis can cite the post-mortem that established why the dependency is dangerous, and every policy exception is traceable to the Architect who approved it and when it expires.

Memory Type	Source	Node Type	Linked To	Captured By
ADR	GitHub repo / Confluence	DecisionNode	Service nodes, Policy nodes	Ingestion Service on git push webhook (ADR commit)
Post-mortem lesson	Confluence / GitHub Pages	FailurePattern	Affected service nodes	Ingestion Service on Confluence webhook or GitHub Pages build
Design rationale	PR review comments	MemoryNode	Function nodes, Module nodes	Ingestion Service on PR merge event
Policy exception	Governance Service approval flow	ExceptionNode	Policy node, violating Service node	Governance Service on Architect approval action
Sprint retro insight	Jira sprint close / GitHub Projects iteration close	SprintInsight	Domain node, SprintNode	Proactive Maintenance Service on sprint close webhook
Informal decision	Slack keyword trigger (configurable patterns)	IntentAssertion	Service node	Ingestion Service channel watch (v1.1)
Tribal knowledge gap	Proactive scanner — services with no WHY edges	MemoryGap flag	Service node	Proactive Maintenance Service nightly scan (PRO-UC-06)

Entity Resolution¶

Entity resolution is the most technically challenging component of the ingestion pipeline. The same service will appear under different names across different source systems:

GitHub repository name: payment-service
Terraform resource label: srv-payment
Kubernetes service name: payments
Jira component name: Payment Service
Slack channel reference: #payment
CODEOWNERS entry: service-payments

Without canonicalization, the UMKB becomes a fragmented collection of duplicate nodes that cannot be reliably queried or compared. Entity resolution is what makes the UMKB a single coherent graph rather than a multi-source data dump.

Resolution Strategy¶

Substrate uses the Dense resolve-lora LoRA adapter fine-tuned against real-world service naming conventions to perform entity resolution. The adapter is applied to all inbound entity names during the nightly resolution pass (ING-UC-09, Celery beat at 2am) and at ingestion time for high-confidence cases.

The resolution pipeline:

Tokenization normalization: Strip common prefixes and suffixes (srv-, -service, -svc), normalize separators (hyphens, underscores, camelCase to space-separated), lowercase all tokens
Embedding similarity: bge-m3 embedding of normalized name against all existing canonical node names stored in PostgreSQL pgvector; cosine similarity above 0.92 triggers auto-merge
LLM resolution: For similarity range 0.70–0.92, Dense resolve-lora classifies same-entity vs different-entity with confidence score
Human escalation: Below 0.70 similarity, or when resolver confidence falls below 0.70, the entity candidate is queued in the Verification Queue for human review
Canonical ID assignment: On resolution, all source-system identifiers are stored as alias properties on the canonical node; subsequent lookups resolve immediately without re-running the pipeline

Ownership Schema¶

The ownership model uses a three-tier graph structure that supports both direct and inherited ownership:

(:Developer)-[:MEMBER_OF]->(:Team)-[:OWNS]->(:Service)
(:Team)-[:CHILD_OF]->(:Team)

Transitive ownership queries traverse the full organizational hierarchy:

MATCH (t:Team {name: "Engineering"})<-[:CHILD_OF*0..]-(sub:Team)-[:OWNS]->(s:Service)
RETURN sub.name AS owning_team, s.name AS service

This enables key-person risk detection: when a Developer node is deactivated via SCIM, Substrate traverses all OWNS edges (direct and team-inherited) to identify services that have lost their sole or primary owner. These services are immediately flagged CRITICAL in the Verification Queue.

Chunking and Vectorization Strategy¶

The UMKB receives data from heterogeneous tools and human inputs. A single chunking profile is insufficient. Substrate applies source-aware chunk profiles before vectorization:

Input Type	Chunk Profile	Default Window	Overlap	Output
Source code and configs	`code_ast`	function/class/module boundaries, max 350 tokens	40 tokens	`DocumentChunk` + code entity edges
ADRs, RFCs, architecture docs	`adr_section`	heading-based, max 700 tokens	120 tokens	rationale chunks with `WHY` and `PREVENTED_BY` links
Jira/GitHub issue text, epics, user stories	`ticket_thread`	semantic paragraph windows, max 450 tokens	80 tokens	intent and requirement chunks
PR comments, commit messages, chat decisions	`conversation_window`	turn-group windows, max 300 tokens	60 tokens	memory and intent chunks
Runtime logs and telemetry summaries	`runtime_window`	time-sliced windows, max 500 tokens	50 tokens	drift and anomaly chunks

Each chunk is embedded with bge-m3 and stored in PostgreSQL (pgvector HNSW index). PostgreSQL remains the relational system of record for chunk metadata, embeddings, lifecycle labels, and policy/audit traces.

Automatic Community Construction¶

When prerequisite graph density is reached, communities and their architectural subgraphs are generated automatically:

Build projected graph with services, ADRs, incidents, source chunks, sprint artifacts, epics, and runtime signals.
Run Leiden community detection (Neo4j GDS) to produce dense clusters.
Ask the LLM community curator to produce:
cluster summary,
candidate boundary name,
missing edge recommendations,
policy gap hypotheses.
Persist KnowledgeCommunity nodes and IN_COMMUNITY edges.
Re-run community refresh incrementally as new entities arrive.

This clustering brings ADRs, source code, runtime reality, project documentation, sprint boards, user stories, and epics into explainable groups for retrieval, simulation, and governance.

Policies can be authored manually by architects or generated as draft policies automatically from repeated violation patterns and post-mortem lessons. Auto-generated policies always pass through a human approval gate before activation.

Confidence Scoring Model¶

Every node and edge in the UMKB carries a confidence score. Confidence is not optional — it is a first-class schema property that governs whether data is included in the graph, whether it requires human verification, and how much weight it carries in blast radius calculations and tension scores.

Source Trust Weights¶

Source Type	Confidence Weight	Rationale
Code analysis (AST, static analysis)	0.95	Deterministic; directly derived from source of truth
CI/CD data (test results, build metadata)	0.90	Automated, reliable, low noise
Infrastructure state (Terraform, K8s API)	0.85	Declarative sources; occasionally stale between applies
Documentation (ADRs, READMEs, Confluence)	0.70	Human-authored; may be outdated
PR review comments	0.60	Contextual but informal; requires extraction model
Slack conversations	0.30	High noise; valuable only when explicitly flagged by keyword trigger

Confidence Thresholds¶

Range	Action
≥ 0.90	Auto-accepted; written to graph without review; `verification_status = Verified`
0.60–0.89	Written to graph with `verification_status = Unverified`; queued for periodic automated re-check
0.50–0.59	Written to graph with `verification_status = Unverified`; queued for human review in Verification Queue
< 0.50	Rejected; not written to graph; logged to PostgreSQL audit table with rejection reason and source

The minimum confidence threshold of 0.50 is derived from the Diffby knowledge graph inclusion model, which established that below-0.50 confidence produces more noise than signal in downstream queries.

Node and Edge Confidence Schema¶

Every node and edge in the UMKB carries the following confidence-related properties:

Property	Type	Description
confidence	float (0.0–1.0)	Confidence score at time of extraction
source	string	Originating data source identifier
extraction_timestamp	datetime	When this node or edge was first created
last_verification_timestamp	datetime	When this node or edge was last confirmed accurate
verification_status	enum	Verified / Unverified / Disputed / Stale / Deprecated

Stale is set automatically when last_verification_timestamp exceeds the configured staleness threshold: 30 days for documentation sources, 7 days for infrastructure sources, 1 day for SSH-verified runtime sources.

Disputed is set when two conflicting sources provide different values for the same property, or when a human reviewer explicitly marks the data as inaccurate pending resolution. Disputed nodes appear in the Verification Queue with both conflicting values shown side-by-side.

Graph Query Patterns¶

Multi-hop Dependency Traversal (RSN-UC-01)¶

MATCH (s:Service {name: $service_name})-[:DEPENDS_ON*1..5]->(dep:Service)
WHERE dep.confidence >= 0.60
RETURN dep.name, dep.domain, dep.tension_score
ORDER BY dep.tension_score DESC

Blast Radius (GOV-UC-04)¶

MATCH (target:Service {name: $target})<-[:DEPENDS_ON|ACTUALLY_CALLS*1..3]-(caller:Service)
RETURN caller.name, caller.domain, caller.page_rank
ORDER BY caller.page_rank DESC

Key-Person Risk Detection (PRO-UC-09)¶

MATCH (d:Developer {active: false})-[:OWNS]->(s:Service)
WHERE NOT EXISTS {
  MATCH (other:Developer {active: true})-[:OWNS]->(s)
}
RETURN s.name AS orphaned_service, d.github_handle AS departing_owner

Transitive Team Ownership (Entity Resolution)¶

MATCH (t:Team {name: "Engineering"})<-[:CHILD_OF*0..]-(sub:Team)-[:OWNS]->(s:Service)
RETURN sub.name AS owning_team, s.name AS service

ADR Gap Detection (PRO-UC-06)¶

MATCH (s:Service)
WHERE NOT EXISTS {
  MATCH (d:DecisionNode)-[:WHY]->(s)
}
AND s.confidence >= 0.60
RETURN s.name, s.domain, s.tension_score
ORDER BY s.tension_score DESC

Data Lifecycle¶

Snapshot Strategy¶

PostgreSQL stores timestamped graph snapshots as serialized JSON in a partitioned table managed by pg_partman. Snapshots are captured:

On every PR merge (pre- and post-merge state)
On every Terraform apply (pre- and post-apply state)
Nightly at 2am (baseline snapshot for drift trend tracking)
On demand via FastAPI gateway endpoint (for simulation setup)

Snapshots power the temporal query capability (RSN-UC-06: "What changed before last Friday's incident?") and the diff view in the PR Check Detail UI screen.

Retention Policy¶

Data Type	Retention	Storage
Graph snapshots	90 days	PostgreSQL (pg_partman partitioned by month)
Audit log events	2 years	PostgreSQL (append-only, no DELETE privilege for app user)
Node embeddings	Indefinite (versioned)	PostgreSQL (pgvector HNSW index)
Drift scores	90 days	PostgreSQL (time-series partitioned)
NATS stream events	7 days	NATS JetStream (configurable per stream)
Redis subgraph cache	TTL-based per query type	Redis (AOF for job queue durability)

Cache Invalidation¶

When the Ingestion Service writes a graph update, it publishes a substrate.cache.invalidate event to NATS. The Cache Service subscribes to this event and issues TTL resets or explicit DEL commands to the Redis subgraph cache for all affected query patterns. This prevents stale cache responses from masking newly ingested drift without requiring full cache flushes on every write.

Document Lifecycle Governance¶

All role-contributed artifacts are lifecycle-labeled after LLM inspection and deterministic freshness checks. The same item can move between states over time as new evidence appears.

Lifecycle State Model¶

stateDiagram-v2
[*] --> latest : first validated version
latest --> active : superseded but still referenced
active --> stale : freshness threshold breached
stale --> outdated : contradicted by newer code/runtime evidence
outdated --> incomplete : missing required policy/traceability fields
incomplete --> needs_verification : delegation to responsible user
needs_verification --> curated_update : user responds and evidence supplied
curated_update --> latest : LLM formats and applies validated update
stale --> archived : de-prioritized from active reasoning
archived --> oldest_snapshot : retained for history/audit only

Governance Delegation Workflow (UML Sequence)¶

sequenceDiagram
participant Connector as Source Connector
participant ING as Ingestion Service
participant CH as Chunker/Embedder
participant KB as UMKB (Neo4j + PostgreSQL)
participant GOV as Governance Service
participant PRO as Proactive Maintenance
participant USER as Responsible User
participant CUR as Curator LLM

Connector->>ING: New/updated artifact event
ING->>CH: Select chunk profile and split content
CH->>KB: Write chunks, embeddings, and graph links
GOV->>KB: Evaluate active policies on changed scope
PRO->>KB: Inspect lifecycle state and completeness
PRO->>USER: Verification/update request with SLA
USER-->>PRO: Response and supporting context
PRO->>CUR: Normalize and format accepted response
CUR->>KB: Update affected docs/files and lifecycle labels
KB-->>GOV: Re-run impacted policy checks

This workflow operationalizes active governance: it not only detects issues, it routes them to the correct owner and closes the loop with policy-aligned updates.

Functional Requirements (Cross-Cutting UMKB)¶

ID	Requirement	Priority
KB-01	Store relational metadata, policy data, audit records, and vectors in PostgreSQL 16 with pgvector HNSW indexes	Must Have
KB-02	Store graph entities, relationships, and community topology in Neo4j 5.x	Must Have
KB-03	Apply source-aware chunking profiles for code, ADRs, tickets, conversations, and runtime data before embedding	Must Have
KB-04	Label every `DocumentAsset` with lifecycle state (`latest`, `active`, `stale`, `outdated`, `incomplete`, `archived`, `oldest_snapshot`) after LLM + deterministic checks	Must Have
KB-05	Run Leiden community detection and auto-create `KnowledgeCommunity` nodes when minimum graph density is reached	Must Have
KB-06	Generate community summaries and candidate policy gaps from LLM community curator output	Must Have
KB-07	Support manual policy authoring and automatic draft policy generation from repeated incident/violation motifs	Must Have
KB-08	Require human approval before activating any auto-generated policy	Must Have
KB-09	Detect stale/outdated/incomplete artifacts and create `NEEDS_ATTENTION` delegation edges to the accountable role	Must Have
KB-10	On user response, curate and format updates before applying document/file changes and re-running policy checks	Must Have
KB-11	Track full lifecycle transitions and remediation actions in append-only audit logs	Must Have
KB-12	Provide time-windowed retrieval of latest, archived, and oldest snapshots for audit and simulation replay	Must Have