Non-Functional Requirements (NFRs) — Roadmap Reference¶

Overview¶

Non-functional requirements define the operational envelope within which Substrate must function. They are not aspirational guidelines — they are acceptance criteria. A feature that meets its functional requirements but violates an NFR is not shippable.

NFRs fall into four categories: Performance, Reliability, Security, and Graph Accuracy. Graph accuracy is treated as a first-class NFR category because inaccurate graph data undermines every other capability. A fast system that surfaces wrong information is worse than a slow system that surfaces correct information.

Performance Targets¶

All latency targets are measured end-to-end from the triggering event or HTTP request to the result being available to the consumer (UI rendered, API response sent, GitHub Checks API result posted, NATS message published). p95 targets allow for 5% of requests to exceed the stated threshold due to infrastructure variance.

Metric	Target	Context and Rationale
PR governance check, end-to-end	< 2 seconds p95	From GitHub PR webhook receipt to GitHub Checks API result posted. OPA/Rego evaluation on a 10,000-node graph is sub-100ms; the budget is consumed by graph context serialization, OPA HTTP call, explain-lora violation explanation generation, and GitHub API call. The 2-second threshold is a psychological threshold: 2 seconds feels like CI, 5 seconds feels like "something is thinking", 10+ seconds breaks developer flow.
Simple graph query (RSN-UC-01, RSN-UC-02)	< 1 second p95	API response time for direct entity traversal queries (dependency list, ownership lookup). These are pure Cypher traversals with no LLM generation; 1 second is the upper bound for a user to perceive the query as "instant".
Complex reasoning (RSN-UC-04, RSN-UC-05)	< 8 seconds p95	End-to-end token generation for multi-hop, multi-source reasoning queries (RAPTOR map-reduce, memory retrieval with ADR + post-mortem chain). 8 seconds is the threshold established by user research at which waiting becomes frustrating without a progress indicator. Substrate shows a streaming progress indicator for queries expected to take > 2 seconds.
Simulation result (SIM-UC-01 through SIM-UC-04)	< 15 seconds p95	From simulation trigger to before/after delta table rendered in UI. Includes Neo4j sandbox creation, subgraph copy, Cypher mutation, OPA evaluation on sandbox, and result rendering. SIM-UC-05 allows 20 seconds for more complex sprint planning simulations.
Violation explanation generation	< 2 seconds	From OPA evaluation completion to PR comment posted. Included in the 2-second end-to-end PR governance budget. The explain-lora adapter on Dense 70B generates violation explanations at approximately 80 tokens/second; a 150-token explanation takes under 2 seconds.
Embedding generation per node	< 500ms	bge-m3 embedding throughput per node. At this rate, a 1,000-node repository is fully embedded within 500 seconds (~8 minutes) of initial ingestion — within the 1-hour Phase 1 success criterion. Batch embedding via the bge-m3 endpoint processes up to 32 nodes per request.
PR intent mismatch check (RSN-UC-03)	< 1 second	bge-m3 embedding of PR description + diff summary, cosine similarity comparison against IntentAssertion embedding, bge-reranker RRF fusion. The reranker adds approximately 200ms.
Ingestion event processing lag	< 30 seconds	From webhook receipt at FastAPI gateway to first graph node written in Neo4j. Includes HMAC verification, NATS publish, Celery worker pickup, entity resolution, and Neo4j write. For large PR diffs (>50 files), the Rust AST CLI parse may extend this to 60 seconds; this is acceptable for the first node written, with subsequent nodes arriving within 30 seconds thereafter.
SSH host inspection, end-to-end	< 3 minutes	From SSH connection initiation to diff result written to NATS. Includes Vault certificate issuance (~500ms), SSH connection establishment, ForceCommand script execution, JSON parsing, topology diff, and NATS publish.
CVE feed to affected nodes alert	< 5 minutes	From CVE poll completion to NATS alert published identifying affected dependency nodes. The 15-minute poll frequency means the worst-case lag from CVE publication to Substrate alert is 20 minutes.
Concurrent users supported	12 at 128k context	The DGX Spark KV cache pool ceiling of 26.1 GB supports approximately 12 concurrent 128k-context sessions. Beyond 12 concurrent users, new requests queue until an active session completes. For read-heavy workloads (graph browsing, no LLM generation), the limit does not apply.
Nightly GDS batch (PageRank + betweenness)	< 5 minutes	PageRank and betweenness centrality on a 500,000-node graph using Neo4j GDS. This is the ceiling for the nightly maintenance window before the daily digest generation begins at 8:30 AM.

Reliability Targets¶

Reliability targets govern availability, data durability, event delivery, and recovery time. These targets assume a single-node DGX Spark deployment with local persistent storage.

Requirement	Target	Implementation
Service availability	99.9% uptime	Bare metal vLLM endpoints under systemd with `Restart=on-failure`; Docker services with `restart: on-failure`; health check polling every 30 seconds; automated restart within 10 seconds of detected failure
Data durability	99.999%	Neo4j data on persistent volume with nightly automated backup; PostgreSQL on persistent volume with nightly pg_dump + WAL archiving; Redis AOF persistence (fsync every second)
Event delivery	At-least-once	NATS JetStream with consumer acknowledgment; `ackWait` timeout of 30 seconds; unacknowledged messages replayed automatically; stream retained for 7 days to allow replay after extended outage
Job persistence	At-least-once	Celery with Redis broker; AOF persistence ensures job queue survives Redis restart; visibility timeout prevents duplicate processing under normal conditions
Recovery time objective (RTO)	< 60 seconds	`Restart=on-failure` and Docker `restart: on-failure` with no delay; database connections re-established on service restart; NATS stream replay restores in-flight events; target: all services healthy within 60 seconds of a single-service crash
Recovery point objective (RPO)	< 24 hours	Daily backup cadence; Redis AOF provides near-zero RPO for the job queue; PostgreSQL WAL archiving provides sub-minute RPO for relational data
Graph accuracy	> 95% within 1 hour of connection	Entity resolution must correctly canonicalize service names across all 6 connected sources; ≥ 95% of discoverable services must be present as correct Service nodes (not duplicates, not missing) within 1 hour of connector activation

Security Requirements¶

Security requirements are non-negotiable for a tool with access to source code, infrastructure topology, and organizational decision records.

Data Locality¶

Requirement: No inference data, source code, architecture diagrams, policy logic, or institutional memory ever leaves the DGX Spark hardware.

Implementation: All AI inference runs locally on DGX Spark vLLM endpoints. There are no API keys for OpenAI, Anthropic, or any external LLM service. No telemetry data is transmitted. License verification is offline via Ed25519 JWT. Docker image distribution is via signed OCI bundle, not a registry pull at runtime.

Verification: Network egress monitoring should show zero outbound connections to non-customer-controlled hosts after initial image pull. A firewall rule blocking all outbound traffic except to customer-internal systems must not break any Substrate functionality.

Credential Storage¶

Requirement: No credentials in plain text, Dockerfiles, environment variables, source control, or logs.

Implementation: All secrets (database passwords, webhook secrets, Vault AppRole credentials, JWT signing keys, Keycloak client secrets) stored as Docker secrets mounted at /run/secrets/. Application reads secrets from files. .env files contain only non-secret configuration (ports, feature flags, timeout values). A pre-commit hook blocks commits containing common secret patterns.

Ephemeral SSH Certificates¶

Requirement: No long-lived SSH keys anywhere in the system. SSH access to infrastructure is certificate-only, time-bounded, and command-restricted.

Implementation: Vault SSH Secrets Engine issues Ed25519 certificates with a 5-minute TTL on every SSH Runtime Connector invocation. AppRole credentials (role_id + secret_id) stored as Docker secrets. ForceCommand on target hosts restricts SSH session to the inspection script. No agent forwarding. Certificate expires before any manual session could be established.

Inter-Service Authentication¶

Requirement: All traffic between application services is authenticated and encrypted.

Implementation: mTLS enforced on all inter-service traffic on the Docker bridge network. Certificates issued by the internal Vault CA with 24-hour TTL and automated rotation via Vault Agent Sidecar. The vLLM endpoints on localhost ports 8000–8004 accept connections only from application service IP ranges on the bridge network.

Authentication and Session Management¶

Requirement: OIDC authentication via Keycloak with short-lived tokens. No long-lived credentials for human users.

Implementation: Access tokens have a 15-minute TTL (configurable). Refresh tokens have a 1-hour TTL (configurable). SPA uses Authorization Code + PKCE. Local JWKS-based JWT validation for most requests (no per-request network call). Introspection only for high-consequence actions (PR block, Fix PR trigger, exception approval).

Audit Immutability¶

Requirement: All user and system actions are permanently recorded and cannot be deleted or modified.

Implementation: PostgreSQL append-only audit table. Application user has INSERT privilege only — no UPDATE, DELETE, or TRUNCATE. Schema-level enforcement. Table partitioned by month for query performance; no partition is dropped within the 2-year retention window. Audit events published to NATS substrate.audit.> for external SIEM subscription.

Webhook Security¶

Requirement: All inbound webhooks are verified before processing.

Implementation: HMAC-SHA256 verification using constant-time comparison (hmac.compare_digest) on all inbound webhooks. Failures return HTTP 403 with no information disclosure (no "signature mismatch" message — just 403). Failed webhook attempts logged to audit table with source IP and truncated request hash.

Air-Gap License Verification¶

Requirement: License verification requires no outbound network connection.

Implementation: Ed25519-signed offline JWT verified against pre-distributed public key at runtime. On expiry or signature failure, Substrate enters read-only mode. No outbound call is made under any circumstances. License JWT and public key distributed with the offline OCI bundle.

Graph Accuracy Requirements¶

Graph accuracy is a first-class NFR category. Without accurate graph data, every downstream capability — governance checks, NL reasoning, simulation, proactive alerts — produces wrong or misleading results. The system must earn and maintain trust by being provably accurate.

Accuracy Threshold¶

Requirement: Graph accuracy must exceed 95% within the first hour of connection and be maintained at >95% on an ongoing basis.

Definition of accuracy: The percentage of discoverable services (those visible via the connected sources) that are correctly represented as canonical, non-duplicate Service nodes in Neo4j with correct DEPENDS_ON and OWNS edges. An incorrect entity resolution (e.g., two different services merged into one, or one service split into two nodes) counts as an accuracy failure.

Entity Resolution Accuracy¶

Entity resolution is the most challenging accuracy component. The Dense resolve-lora adapter must be fine-tuned against real-world service naming conventions before v1.0 launch. Fine-tuning data must include:

Positive examples: different names for the same service (e.g., payment-service, srv-payment, payments, Payment Service) from real GitHub organizations
Negative examples: different services that have similar names (e.g., auth-service vs authorization-service when they are genuinely separate services)
Edge cases: monorepo services, versioned services (auth-v2), deprecated services still appearing in old Terraform state

The adapter's precision and recall on a held-out test set must exceed 94% before it is deployed to production. Precision failures (false merges) are more harmful than recall failures (false splits), because a false merge silently destroys information.

Confidence Thresholds for Trust¶

The Verification Queue maintains a 10–15% human escalation rate. This rate is not a bug — it is a feature:

Below 10% escalation: Humans feel disconnected from the graph. They do not trust data they were never asked to verify. Confidence thresholds are too permissive.
Above 15% escalation: Alert fatigue. Humans stop reviewing items. The queue becomes a dumping ground. Confidence thresholds are too restrictive.

Confidence Range	Action	Human Involvement
≥ 0.90	Auto-accept, `verification_status = Verified`	None (auto-proceed)
0.60–0.89	Accept with `verification_status = Unverified`, add to periodic review queue	Human reviews within 7 days; lower range items flagged sooner
0.50–0.59	Accept with `verification_status = Unverified`, add to human review queue immediately	Human reviews within 48 hours
< 0.50	Reject; not written to graph	Human reviews rejection log periodically for patterns
Disputed	Flagged in Verification Queue	Human resolves conflict between two sources; expert escalation if domain knowledge required

The 10–15% escalation rate target means approximately 10–15% of accepted nodes should be in the 0.50–0.89 confidence range and actively under human review at any given time. If this rate falls below 10%, the confidence thresholds are too permissive and should be tightened. If it exceeds 15%, the entity resolution model needs retraining or the ingestion sources need better normalization.

Source Freshness Requirements¶

Graph accuracy degrades over time if sources are not re-verified. Staleness thresholds define when a node's data is considered potentially outdated:

Source Type	Staleness Threshold	Action on Staleness
SSH runtime verification	1 day	Set `verification_status = Stale`; queue for re-inspection
Kubernetes API state	1 day	Set `verification_status = Stale`; queue for re-poll
Terraform state file	7 days	Set `verification_status = Stale`; alert Architect
GitHub repository data	7 days	Set `verification_status = Stale`; queue for re-ingestion
Documentation (ADRs, READMEs)	30 days	Set `verification_status = Stale`; flag in ADR gap detection
Jira/Projects data	7 days	Set `verification_status = Stale`; queue for re-poll

Stale nodes appear in the Architecture Graph with a visual staleness indicator (dimmed border). They are not removed from the graph — stale data is better than no data — but they are excluded from high-confidence enforcement decisions until re-verified.

Graph Accuracy Monitoring¶

The following metrics are tracked in PostgreSQL and displayed on the Drift and Alerts Dashboard:

Metric	Target	Measured By
Entity resolution success rate	> 96%	`(correctly resolved entities) / (total entities processed)` per nightly resolution pass
Verification Queue throughput	10–15% escalation rate	`(items escalated to human review) / (total items processed)` per day
Graph staleness rate	< 5% stale nodes	`(stale nodes) / (total nodes)` computed nightly
Mean time to re-verify	< 7 days	Average age of items in the Verification Queue when marked Verified
Confidence distribution	P50 ≥ 0.80, P25 ≥ 0.65	Distribution of confidence scores across all nodes in the graph

Active Governance and Knowledge Lifecycle NFRs¶

These NFRs constrain the third core problem domain: active governance and proactive maintenance of organizational artifacts.

Requirement	Target	Implementation
Lifecycle labeling latency	< 10 seconds p95 per updated artifact	Event-driven labeling pipeline with deterministic freshness checks plus LLM classification
Lifecycle labeling accuracy	> 95% precision for `stale`, `outdated`, and `incomplete` labels on validated benchmark sets	Weekly labeled-sample evaluation with human adjudication loop
Delegation routing accuracy	> 98% correct owner routing (Developer/Team/Role)	Ownership graph traversal (`OWNS`, `MEMBER_OF`, fallback role maps) with nightly reconciliation
Remediation acknowledgment SLA	90% of delegated tasks acknowledged within 24 hours	Escalation chain with Slack/in-app reminders and team-level escalation at day 7
Curated update turnaround	< 30 seconds p95 from human response to formatted update proposal	Curator LLM with deterministic schema validation prior to apply
Post-remediation policy re-check	< 2 seconds p95 for scoped governance re-evaluation	Delta-scoped OPA evaluation on affected subgraph only
Chunk profile selection accuracy	> 97% correct profile assignment by source type and MIME/content heuristics	Deterministic router + confidence checks; fallback to review queue on ambiguity
Vector retrieval quality	Recall@20 >= 0.92 on role-specific benchmark queries	bge-m3 embeddings in PostgreSQL pgvector HNSW with periodic benchmark runs
Community refresh freshness	Incremental Leiden community updates visible within 2 minutes of significant graph delta	Incremental GDS job triggered by ingestion events; nightly full recompute
Community stability	Jaccard overlap >= 0.80 for unchanged regions across consecutive nightly runs	Partition drift monitoring; alert when stability drops below threshold
Governance digest timeliness	Weekly digest delivered by Monday 9:00 AM local tenant time	Scheduled digest pipeline with retry policy and delivery audit event
Evidence traceability completeness	100% of lifecycle changes carry source refs, actor identity, and audit event IDs	Append-only audit model enforced at schema level in PostgreSQL