Infrastructure & Deployment — Design Reference¶

Overview¶

Substrate is designed self-hosted first. All AI inference runs locally on a NVIDIA DGX Spark node with 128GB LPDDR5x unified memory — no calls to OpenAI, Anthropic, or any external LLM API. The deployment model is Docker Compose for evaluation and a Helm umbrella chart for production Kubernetes, following the GitLab pattern for conditional external dependencies.

Every infrastructure decision is made against two constraints: the DGX Spark unified memory budget is fixed at 128GB, and every byte must be accounted for; and the system must survive process crashes without losing events, jobs, or graph state.

DGX Spark Memory Budget¶

The DGX Spark provides 128GB LPDDR5x unified memory shared between CPU and GPU workloads. There is no discrete GPU VRAM — all model weights load directly into this shared pool. This requires explicit, non-overlapping memory allocation planning. The following budget is the authoritative allocation.

Allocation	Size	Persistence	Primary Model / Use
OS and vLLM Runtime	8.0 GB	Fixed	Kernel, systemd, Python runtime, vLLM process overhead
Llama 4 Scout (MoE)	55.0 GB	Always resident	Global reasoning, RAPTOR community map-reduce
Dense 70B + LoRA adapters	38.0 GB	Always resident	Extraction, explanation, NL→Cypher (extract-lora, explain-lora, cypher-lora, resolve-lora)
bge-m3 / bge-reranker	0.9 GB	Always resident	Node embeddings and RRF fusion reranking
KV Cache Pool	26.1 GB	Persistent	Approximately 12 concurrent users at 128k context window
Qwen2.5-Coder	18.0 GB	On-demand (socket-activated)	AST enrichment, Fix PR generation
SDXL + ControlNet	5.0 GB	Burst (on demand)	Executive architecture visualizations

Total budgeted: 128.0 GB. No headroom above this total; Qwen2.5-Coder and SDXL are on-demand and not co-resident with each other or with burst workloads. Socket activation ensures they are loaded only when a job is queued and evicted when idle.

KV Cache constraint: The 26.1 GB KV cache pool supports approximately 12 concurrent users at 128k context. This is the hard ceiling on concurrent LLM sessions and is reflected in the NFR for concurrent user support.

Why LPDDR5x Unified Memory Matters¶

The DGX Spark uses LPDDR5x unified memory, meaning there is a single memory pool shared between the ARM CPU and the GPU compute cores. This is architecturally distinct from a discrete GPU setup where model weights live in VRAM and CPU memory is separate. On the DGX Spark:

All model weights, KV caches, activation buffers, and CPU working memory draw from the same 128GB pool
NUMA-aware allocation is essential — container memory overhead wastes bandwidth that the GPU needs for weight loading
This is why vLLM model endpoints run bare metal under systemd rather than in Docker

Bare Metal vs Container¶

The deployment splits cleanly into two categories based on the unified memory constraint.

vLLM Model Endpoints — Bare Metal Under systemd¶

All five vLLM serving processes run bare metal, managed by systemd units. This is a non-negotiable architectural decision driven by the DGX Spark unified memory architecture:

Container runtimes introduce memory overhead and NUMA scheduling latency that reduces effective memory bandwidth for weight loading
Direct bare metal allocation allows NUMA-aware placement that the GPU compute cores can access at peak bandwidth
systemd provides Restart=on-failure semantics, socket activation for on-demand models, and journal-based logging without container orchestration overhead

Application and Databases — Docker Compose¶

All application services and databases run in Docker Compose (evaluation) or Kubernetes (production):

Neo4j 5.x
PostgreSQL 16 (with pgvector and pg_partman extensions)
Redis 7 (AOF persistence enabled)
NATS JetStream
FastAPI gateway
Celery workers
All Substrate microservices (Ingestion, Governance, Reasoning, Proactive Maintenance, Simulation, Agent Orchestration)

This split is explicit and permanent. The model endpoints communicate with the application tier over localhost TCP (ports 8000–8004) via the bridge network.

vLLM Endpoint Layout¶

Port	Model	Role	LoRA Adapters	Activation
8000	Llama 4 Scout (MoE, 55GB)	Global reasoning, RAPTOR community map-reduce	—	Always-on
8001	Dense 70B	Extraction, explanation, NL→Cypher	extract-lora, explain-lora, cypher-lora, resolve-lora	Always-on
8002	Qwen2.5-Coder (18GB)	Fix PR generation, AST enrichment	—	Socket-activated (on-demand)
8003	bge-m3	Node embedding generation	—	Always-on
8004	bge-reranker-v2-m3	Reranking for RRF fusion	—	Always-on

LoRA Adapter Details (Dense 70B, Port 8001)¶

The Dense 70B model loads four LoRA adapters that are hot-swapped per request without reloading base weights:

extract-lora: Fine-tuned for structured extraction from unstructured text (ADRs, post-mortems, PR descriptions). Extracts entities, relationships, and confidence scores as JSON.
explain-lora: Fine-tuned for generating plain English violation explanations from OPA evaluation results and graph context. Used in PR comments (GOV-UC-01, GOV-UC-05).
cypher-lora: Fine-tuned for NL→Cypher translation. Accepts a natural language query and the current graph schema context; returns a valid Cypher query (RSN-UC-07).
resolve-lora: Fine-tuned against real-world service naming conventions for entity resolution. Classifies whether two normalized entity names refer to the same canonical service (ING-UC-09).

Startup Order¶

The startup sequence is strictly ordered. Services that depend on databases must not start until databases are healthy and migrations have completed.

1. Critical Persistent Databases¶

Start first, in any order (they have no inter-dependencies):

PostgreSQL 16
Neo4j 5.x
Redis 7
NATS JetStream

Wait for all four to pass health checks before proceeding.

2. Migrations¶

Run migrations before any application service starts:

make migrate

This runs: - Flyway for PostgreSQL schema migrations (non-negotiable from day one — schema changes not migration-controlled become invisible technical debt) - neo4j-migrations for graph schema constraint management

Migrations are idempotent and versioned. A failed migration aborts startup and logs the failing SQL or Cypher to stderr.

3. Application Services¶

Start after migrations complete successfully:

FastAPI gateway
Celery workers (ingestion, governance, proactive)
Substrate microservices (Ingestion, Governance, Reasoning, Proactive Maintenance, Simulation, Agent Orchestration)

4. vLLM Endpoints¶

Start last, in dependency order (embedding model must be available before extraction models begin processing queued jobs):

bge-m3 (port 8003) — embedding; needed by all subsequent models for context
bge-reranker-v2-m3 (port 8004)
Dense 70B (port 8001) — extraction and explanation
Llama 4 Scout (port 8000) — resident, always-on
Qwen2.5-Coder (port 8002) — socket-activated; systemd starts listening on port but only loads model weights when a request arrives

Database Migration Policy¶

PostgreSQL — Flyway¶

Flyway manages all PostgreSQL schema migrations. This is non-negotiable from day one of development, not a "add later when it gets complicated" decision. Schema changes not migration-controlled become invisible technical debt within weeks.

Conventions: - Migrations live in migrations/postgres/V{version}__{description}.sql - All migrations are forward-only (no rollback scripts — rollback is a new forward migration) - Migration filenames use underscores, not hyphens - Breaking changes (column drops, type changes) require a multi-step migration with a compatibility window

Neo4j — neo4j-migrations¶

neo4j-migrations manages graph schema constraints, indexes, and node label definitions.

Conventions: - Migrations live in migrations/neo4j/V{version}__{description}.cypher - All constraint creation is idempotent (CREATE CONSTRAINT IF NOT EXISTS) - Index creation uses CREATE INDEX IF NOT EXISTS - Node label additions are additive; removals require explicit deprecation migration

Deployment Architecture¶

Self-Hosted First¶

Substrate is self-hosted first. There is no SaaS option in v1. The customer owns the hardware, the data, and the deployment. This is the "Zero Data Leaves the Building" policy enforced at the infrastructure level.

Evaluation: Docker Compose¶

For evaluation and small team deployments:

docker compose up -d

All application services and databases start in a single Compose file. The vLLM endpoints are managed by separate systemd units that must be started independently on the DGX Spark host.

Production: Helm Umbrella Chart¶

For production Kubernetes deployment, following the GitLab Helm chart pattern:

Single umbrella chart with sub-charts for each service
Conditional external dependencies:
postgresql.enabled=true → bundles Bitnami PostgreSQL sub-chart
postgresql.enabled=false → reads connection string from externalDatabase.host configuration
Same pattern applies to Neo4j, Redis, and NATS

This enables customers to use their existing managed database infrastructure (RDS, Cloud SQL, Redis Enterprise) rather than being forced to run bundled instances.

Air-Gap Support¶

Substrate runs fully air-gapped with no outbound network dependencies:

License verification: Ed25519-signed offline JWT verified against pre-distributed public key. No license server call. No outbound network required.
All AI inference: Runs on local DGX Spark. Zero calls to external LLM APIs.
All source connectors: Pull from customer-internal systems (GitHub Enterprise, self-hosted Confluence, self-hosted Jira).
Container images: Distributed as a signed OCI bundle for offline import via docker load.

Licensing Model¶

Substrate uses the BSL/FSL model following the GitLab and Sentry pattern:

Released under Business Source License (BSL) at launch
2-year change date to Apache 2.0: All code automatically relicenses to Apache 2.0 two years after each version's release date
This prevents hyperscaler commoditization during the commercial window while ensuring long-term open access
Enterprise features (multi-team isolation, Scale Plan) gated by license tier, not code forks

Service Communication¶

All inter-service communication uses explicitly chosen transports. There are no undocumented or ad-hoc HTTP calls between services.

NATS JetStream (Event Bus)¶

All inter-service events flow through NATS JetStream:

Subject hierarchy: substrate.> — e.g., substrate.ingestion.pr_opened, substrate.governance.violation_raised, substrate.graph.node_written
Delivery guarantee: At-least-once; Celery workers ack after successful processing
Stream replay: If the Ingestion Service crashes mid-parse, the event replays on restart without silent loss
Consumer groups: Multiple Ingestion workers can consume from the same stream with load balancing

FastAPI Gateway (HTTP Ingress)¶

All inbound webhooks (GitHub, Jira, Confluence, Terraform) arrive at the FastAPI gateway
HMAC SHA-256 verification on all inbound webhooks before forwarding to NATS (GitHub X-Hub-Signature-256, Jira shared secret, Confluence webhook token)
OAuth2/JWT validation for all API requests; JWT verified locally against cached JWKS

Redis (Cache and Job Coordination)¶

Three distinct uses, all through the same Redis 7 instance:

Hot subgraph cache: Frequently queried subgraph results cached with TTL; invalidated on graph update events
Celery job queue: All async jobs dispatched via Celery with Redis as broker; AOF persistence ensures job survival through Redis restarts
Distributed locks: SET NX EX pattern prevents concurrent Celery workers from running duplicate ingestion jobs for the same source

mTLS (Inter-Service Authentication)¶

All traffic between application services on the Docker bridge network is protected by mTLS:

Mutual TLS certificates issued by the internal Vault CA (same Ed25519 CA used for SSH certificates)
Certificate rotation handled by Vault Agent Sidecar on each service
This applies to all service-to-service HTTP traffic; it does not apply to database connections (separate credential rotation path)