Skip to content

Proactive Maintenance Service

The Proactive Maintenance Service is the platform's maintenance department and organisational memory curator. It continuously watches the graph for decay, routes uncertain knowledge through a human verification queue, detects memory gaps and staleness, and escalates findings to the right people at the right time.

Responsibility

Watch the Observed Graph over time to detect structural tension, drift patterns, knowledge gaps, and memory decay. Own the Verification Queue that gates uncertain entities before they enter the graph. Generate sprint insights, tribal knowledge extractions, and proactive alerts that surface architectural debt before it becomes an incident.


Architecture Overview

The Proactive Maintenance Service operates on three trigger types:

  1. Event-driven: Reacts to graph update events published by the Ingestion Service to NATS (substrate.ingestion.>) — re-scoring tension, re-evaluating staleness, and routing new low-confidence entities to the verification queue.

  2. Scheduled: Nightly batch jobs for PageRank recomputation, tribal knowledge extraction, community summary rebuilds, and escalation queue processing.

  3. Webhook-driven: Reacts to external webhooks (Jira sprint close, team membership changes) to generate sprint retro structural insights and key-person risk updates.

All proactive alerts are published to NATS subjects under substrate.proactive.> for downstream consumption by the UI WebSocket layer (PRO-17).


Structural Tension Scoring (PRO-01, PRO-02)

After each ingestion event, the Proactive Maintenance Service recomputes the structural tension score for every edge in the subgraph affected by that event.

Tension Formula

tension = |intended_weight - observed_weight| / intended_weight

Where: - intended_weight is the declared edge weight from the Intended Graph (architectural intent) - observed_weight is the empirically observed edge weight from runtime data, SSH connector output, or code analysis

A tension score of 0 means the observed state perfectly matches intent. A score approaching 1.0 means the observed state has diverged maximally from what was declared.

Tension scores are stored in a partitioned PostgreSQL table (edge_tension partitioned by month) with (edge_id, source_connector, timestamp) as the composite key. This preserves the historical trend for predictive analysis.

When a tension score exceeds the configurable threshold for its domain (default: 0.3 for service dependency edges, 0.5 for infrastructure edges), a drift alert is generated. The alert includes a plain English narrative generated by the Dense explain-lora adapter:

"PaymentService's observed dependency graph has diverged 42% from its declared intent. The CALLS edge to legacy-auth-v1 is present in runtime observations but declared as removed in the Intended Graph. Last updated 8 days ago."

Predictive Drift Detection (PRO-18 — v1.1)

By computing the rate of change of tension scores over time (stored in the partitioned PostgreSQL table), the service can identify domains with accelerating tension trends. If a domain's tension score is increasing at a rate that projects it to breach the alert threshold within two sprints, a predictive warning is issued before the threshold is actually crossed.


Verification Queue (PRO-03)

Every entity extracted by the Ingestion Service carries a confidence score. Before writing to the Observed Graph, the entity passes through the verification queue routing logic:

Confidence Band Routing

Confidence Band Routing Decision SLA
Above 90% Auto-accept — node enters graph immediately Real-time
60% – 90% Queue for human review — assigned to owning team member Within 7 days
Below 60% Escalate to expert review — assigned to team lead or architect Within 3 days

The target human escalation rate is 10–15% of all extracted entities. Significantly higher rates indicate degraded extractor quality; significantly lower rates may indicate over-permissive confidence thresholds.

Entities in the queue are displayed in the Substrate UI as pending items with their confidence score, the source document, the proposed node properties, and a diff against the existing graph state (if an existing node would be updated). Reviewers can accept, reject, or modify the proposed node.


Staleness Detection (PRO-04)

The service monitors all graph entities against per-type staleness thresholds. An entity is considered stale when the time since its last confirmed ingestion or review exceeds the threshold for its type.

Staleness Thresholds by Entity Type

Entity Type Staleness Threshold Trigger
Service dependency edge 14 days since last ingestion PR merge touching dependent service
API contract / OpenAPI spec 30 days since last rebuild Deployment to any environment
Service ownership (OWNS edge) 90 days Team org change or developer departure
Architecture Decision Record 180 days Sprint close or major release tag
Post-mortem lesson node 365 days New incident in same service domain

When an entity crosses its staleness threshold, a staleness alert is generated and routed through the escalation chain.


Memory Gap Detection (PRO-05 to PRO-09)

The service runs a suite of graph pattern queries to detect gaps in institutional memory:

Undocumented services (PRO-05): Service nodes with no outbound DOCUMENTS edge to any documentation node are flagged. Alert text: "ServiceName has no documentation — a documentation page must be linked or created."

Orphaned documentation (PRO-06): Documentation nodes with no incoming DOCUMENTS edge from any live service node are flagged as orphaned. Alert text: "This Confluence page describes a service that no longer exists in the graph — archive or update."

ADR gaps (PRO-07): Service nodes with no WHY edge connecting them to any DecisionNode have no recorded architectural rationale. Alert text: "ServiceName has no architectural decision record — assign an owner to document the design intent."

Stale ADRs (PRO-08): DecisionNode entries older than 180 days with no REVIEWED_AT property are flagged for review. Alert text: "ADR-032 is 22 months old and has never been reviewed — assign to architect for validation or deprecation."

Unencoded post-mortem lessons (PRO-09): FailurePattern nodes with no outbound edge to a Policy node represent incidents where lessons were recorded but never translated into governance rules. Alert text: "PM-019 identified an architectural failure but no policy was created from it — the lesson remains vulnerable to recurrence."


Key-Person Risk Detection (PRO-10)

After each team membership change event (ingested from Jira or GitHub), the service queries for service nodes that are exclusively owned by a single developer (a single OWNS edge from that developer, no other developers or teams with OWNS edges to the same service).

These services are ranked by criticality (PageRank score). High-criticality services with exclusive single-developer ownership are treated as key-person risk. Alert text: "3 services exclusively owned by @alice who is leaving next sprint — redistribute ownership before their departure."


Sprint Retro Structural Insights (PRO-11)

On Jira sprint close webhook, the service generates a structural retro report for the team. The report includes:

  • Violation delta: New violations opened vs. closed during the sprint; trend direction
  • Drift delta: Services whose tension scores increased or decreased during the sprint
  • New memory gaps: ADR gaps, undocumented services, and unencoded post-mortems discovered during the sprint
  • Coverage changes: Services whose test coverage dropped below threshold during the sprint
  • Key changes: New DecisionNode entries created (new institutional memory captured)

The report is published to the team's #substrate-insights Slack channel and persisted in PostgreSQL for historical comparison.


Graft Pattern Suggestions (PRO-12)

On PR open events, the service analyses the changed code's style and dependency patterns against the style graph of neighbouring files in the same module. If the changed code deviates from the dominant patterns of its neighbours (e.g., uses a different error handling pattern, a deprecated library, or a non-standard service call structure), a Graft Pattern suggestion is generated.

Graft Patterns are advisory suggestions included in the PR comment: "The surrounding files in this module use the Result<T, AppError> error handling pattern. Consider aligning this function for consistency."


Tribal Knowledge Extraction (PRO-13)

Every night, the service processes all ingested content from the previous 24 hours — PR review comments, Slack decision threads, post-mortem documents, ADR updates — and extracts persistent MemoryNode entries. A MemoryNode is a piece of institutional knowledge that is not tied to a specific code artifact but is important for understanding the system: historical design tradeoffs, rejected alternatives, known landmines, performance characteristics that are not captured in any spec.

MemoryNode extraction uses the Dense extract-lora adapter. Extracted nodes are routed through the verification queue with a target confidence threshold of 0.75 before entering the graph.


Duplicate Documentation Detection (PRO-14)

The service runs a nightly batch query against the pgvector HNSW index to find pairs of documentation nodes that:

  • Have an embedding cosine similarity above 0.85
  • Both have DOCUMENTS edges pointing to the same service node

These pairs are flagged as duplicate documentation candidates. Alert text: "Two documentation pages describe InventoryService with 91% semantic similarity — merge or deprecate one to prevent knowledge fragmentation."


Nightly Graph Analytics (PRO-15)

Each night, the service re-runs two graph analytics computations and updates the results as node properties:

PageRank: Applied to the full CALLS + DEPENDS_ON subgraph. The resulting PageRank score is written to the criticality property on each ServiceNode. Services with PageRank > 0.3 are treated as critical for governance purposes (higher test coverage requirements, blast radius weighting, key-person risk thresholds).

Betweenness Centrality: Applied to the same subgraph. High betweenness nodes are structural bottlenecks — services that lie on the shortest paths between many other services. These are flagged as architectural chokepoints and surfaced in the weekly structural health summary.

Both computations are run using Neo4j's native Graph Data Science (GDS) library and take approximately 2–5 minutes on a graph of typical enterprise size.


Escalation Chain (PRO-16)

All alerts generated by the Proactive Maintenance Service follow a structured 5-step escalation chain:

Day Action Channel
0 Notify assigned owner Slack DM + in-app badge
3 Reminder to owner + last editor Slack DM
7 Escalate to team channel #substrate-alerts
14 Critical escalation to team lead + architecture board Slack + email
30 Auto-deprecation: node marked Stale, removed from active reasoning context System action + notification

Auto-deprecation at Day 30 means the node is tagged with status: stale and excluded from Reasoning Service retrieval results. It is not deleted; it remains in the graph for audit and historical query purposes but does not surface in active governance or reasoning responses.

Escalation state for each alert is persisted in PostgreSQL. If an alert is acknowledged and actioned at any step, the escalation chain is halted.


Active Governance Loop (PRO-19 to PRO-22)

Beyond structural drift and memory decay, this service closes the active governance loop. It continuously scans for policy-required updates across all major contribution surfaces:

  • ADR files and architecture docs
  • Jira tickets, epics, and sprint artifacts
  • PR comments, commit messages, and review threads
  • Source code files and runtime discrepancy evidence
  • Runbooks, post-mortems, and platform documentation

When stale, outdated, incomplete, or conflicting artifacts are detected, the service resolves ownership (Developer, Team, or role), sends a targeted verification request, and tracks response SLA. On response, a curator LLM reformats the human input into policy-compliant content before updating the affected artifact and lifecycle labels.

Lifecycle Labels Managed by Proactive Maintenance

Label Meaning Governance Action
latest Newest verified version Used as default retrieval context
active Still valid but not newest Eligible for retrieval with lower rank
stale Freshness SLA exceeded Escalation chain started
outdated Contradicted by newer code/runtime evidence Blocked from policy evidence usage until verified
incomplete Missing mandatory policy metadata or traceability Delegated to accountable owner
archived Historic, low-activity item Excluded from active reasoning by default
oldest_snapshot Long-term retained baseline Audit/simulation-only access

Delegated Remediation Flow

  1. Detect artifact requiring attention and classify reason.
  2. Resolve accountable owner via OWNS, MEMBER_OF, and role-routing rules.
  3. Create verification task with required evidence checklist.
  4. Notify owner and enforce escalation schedule.
  5. Receive response; run curator LLM for formatting and policy alignment.
  6. Update graph + relational records and re-evaluate impacted policies.

Memory Decay Signal Reference

Signal Detection Method Example Alert Generated
Service with no ADR Service node with no WHY edge to any DecisionNode "PaymentService has no architectural decision record — assign an owner to create ADR"
ADR references deleted service DecisionNode linked to non-existent service node "ADR-047 references srv-checkout which no longer exists — update or archive"
Stale decision never reviewed DecisionNode age > 180 days, no REVIEWED_AT property "ADR-032 is 22 months old and has never been reviewed — assign to architect"
Post-mortem not encoded as policy FailurePattern node with no linked Policy node "PM-019 identified an architectural failure but no policy was created from it"
Key person departure risk Developer node deactivated; their services have no other OWNS edges "3 services exclusively owned by @alice who is leaving next sprint"
Duplicate documentation Two Confluence nodes with embedding similarity > 85% pointing to same service "Two documentation pages describe InventoryService — merge or deprecate one"

Functional Requirements

ID Requirement Priority
PRO-01 Compute structural tension score for every observed edge after each ingestion event; store in PostgreSQL partitioned table Must Have
PRO-02 Alert when tension score exceeds configurable threshold per domain; generate plain English drift narrative Must Have
PRO-03 Run verification queue: assess all entities needing reverification; route by confidence band (auto-accept/review/expert escalate) Must Have
PRO-04 Staleness detection by entity type: dependencies 14d, API contracts 30d, ownership 90d, ADRs 180d Must Have
PRO-05 Detect undocumented services: service node with no DOCUMENTS edge to any documentation node Must Have
PRO-06 Detect orphaned documentation: doc node with no matching live service node Must Have
PRO-07 Detect ADR gaps: service nodes with no WHY edge to any DecisionNode Must Have
PRO-08 Detect stale ADRs: DecisionNode age > 180 days with no REVIEWED_AT property Must Have
PRO-09 Detect post-mortem lessons not encoded as policies: FailurePattern with no linked Policy node Must Have
PRO-10 Detect key-person risk: services exclusively owned by a single developer; weight by service criticality (PageRank) Must Have
PRO-11 Generate sprint retro structural insight on Jira sprint close webhook; include violation delta, drift delta, new memory gaps Must Have
PRO-12 Suggest Graft Pattern rewrite on PR open: match changed code patterns against style graph of neighboring files Must Have
PRO-13 Nightly tribal knowledge extraction from all ingested sources into persistent MemoryNodes Must Have
PRO-14 Detect duplicate documentation: node pairs with embedding similarity > 85% pointing to same service Must Have
PRO-15 Nightly PageRank + betweenness centrality recompute; update criticality properties on all service nodes Must Have
PRO-16 Escalation notifications: Day 0 owner DM, Day 3 reminder to owner + last editor, Day 7 team channel, Day 14 team lead + architecture board, Day 30 auto-deprecation Must Have
PRO-17 Publish all proactive alerts to NATS substrate.proactive.> for downstream UI WebSocket consumption Must Have
PRO-18 Predictive drift detection: identify domains with accelerating tension score trend likely to breach threshold within 2 sprints Nice to Have (v1.1)
PRO-19 Continuously classify artifact lifecycle states (latest, active, stale, outdated, incomplete, archived, oldest_snapshot) after LLM + deterministic checks Must Have
PRO-20 Detect policy-required gaps across ADRs, Jira artifacts, PR comments, commit messages, source code, and runtime evidence; delegate remediation to accountable role Must Have
PRO-21 Capture user verification responses, run curator formatting, and update affected artifact state/content with audit trail Must Have
PRO-22 Re-trigger governance evaluation automatically after curated remediation updates Must Have

Infrastructure Dependencies

Component Role in Proactive Maintenance Service
Neo4j 5.x Graph pattern queries for memory gap detection; PageRank and betweenness centrality via GDS
PostgreSQL 16 + pgvector Tension score time-series table (partitioned); verification queue; staleness tracking; escalation state
Redis 7 Alert deduplication cache; scheduled job locks
NATS JetStream Receives graph update events; publishes proactive alerts to substrate.proactive.>
Celery Scheduled nightly batch jobs; event-driven re-scoring tasks
DGX Spark port 8001 Dense 70B + extract-lora: tribal knowledge extraction; explain-lora: drift narrative generation
DGX Spark port 8003 bge-m3: duplicate documentation similarity scoring
DGX Spark port 8004 bge-reranker-v2-m3: verification queue candidate ranking