Infrastructure & Deployment — Design Reference¶
Overview¶
Substrate is designed self-hosted first. All AI inference runs locally on a NVIDIA DGX Spark node with 128GB LPDDR5x unified memory — no calls to OpenAI, Anthropic, or any external LLM API. The deployment model is Docker Compose for evaluation and a Helm umbrella chart for production Kubernetes, following the GitLab pattern for conditional external dependencies.
Every infrastructure decision is made against two constraints: the DGX Spark unified memory budget is fixed at 128GB, and every byte must be accounted for; and the system must survive process crashes without losing events, jobs, or graph state.
DGX Spark Memory Budget¶
The DGX Spark provides 128GB LPDDR5x unified memory shared between CPU and GPU workloads. There is no discrete GPU VRAM — all model weights load directly into this shared pool. This requires explicit, non-overlapping memory allocation planning. The following budget is the authoritative allocation.
| Allocation | Size | Persistence | Primary Model / Use |
|---|---|---|---|
| OS and vLLM Runtime | 8.0 GB | Fixed | Kernel, systemd, Python runtime, vLLM process overhead |
| Llama 4 Scout (MoE) | 55.0 GB | Always resident | Global reasoning, RAPTOR community map-reduce |
| Dense 70B + LoRA adapters | 38.0 GB | Always resident | Extraction, explanation, NL→Cypher (extract-lora, explain-lora, cypher-lora, resolve-lora) |
| bge-m3 / bge-reranker | 0.9 GB | Always resident | Node embeddings and RRF fusion reranking |
| KV Cache Pool | 26.1 GB | Persistent | Approximately 12 concurrent users at 128k context window |
| Qwen2.5-Coder | 18.0 GB | On-demand (socket-activated) | AST enrichment, Fix PR generation |
| SDXL + ControlNet | 5.0 GB | Burst (on demand) | Executive architecture visualizations |
Total budgeted: 128.0 GB. No headroom above this total; Qwen2.5-Coder and SDXL are on-demand and not co-resident with each other or with burst workloads. Socket activation ensures they are loaded only when a job is queued and evicted when idle.
KV Cache constraint: The 26.1 GB KV cache pool supports approximately 12 concurrent users at 128k context. This is the hard ceiling on concurrent LLM sessions and is reflected in the NFR for concurrent user support.
Why LPDDR5x Unified Memory Matters¶
The DGX Spark uses LPDDR5x unified memory, meaning there is a single memory pool shared between the ARM CPU and the GPU compute cores. This is architecturally distinct from a discrete GPU setup where model weights live in VRAM and CPU memory is separate. On the DGX Spark:
- All model weights, KV caches, activation buffers, and CPU working memory draw from the same 128GB pool
- NUMA-aware allocation is essential — container memory overhead wastes bandwidth that the GPU needs for weight loading
- This is why vLLM model endpoints run bare metal under systemd rather than in Docker
Bare Metal vs Container¶
The deployment splits cleanly into two categories based on the unified memory constraint.
vLLM Model Endpoints — Bare Metal Under systemd¶
All five vLLM serving processes run bare metal, managed by systemd units. This is a non-negotiable architectural decision driven by the DGX Spark unified memory architecture:
- Container runtimes introduce memory overhead and NUMA scheduling latency that reduces effective memory bandwidth for weight loading
- Direct bare metal allocation allows NUMA-aware placement that the GPU compute cores can access at peak bandwidth
- systemd provides Restart=on-failure semantics, socket activation for on-demand models, and journal-based logging without container orchestration overhead
Application and Databases — Docker Compose¶
All application services and databases run in Docker Compose (evaluation) or Kubernetes (production):
- Neo4j 5.x
- PostgreSQL 16 (with pgvector and pg_partman extensions)
- Redis 7 (AOF persistence enabled)
- NATS JetStream
- FastAPI gateway
- Celery workers
- All Substrate microservices (Ingestion, Governance, Reasoning, Proactive Maintenance, Simulation, Agent Orchestration)
This split is explicit and permanent. The model endpoints communicate with the application tier over localhost TCP (ports 8000–8004) via the bridge network.
vLLM Endpoint Layout¶
| Port | Model | Role | LoRA Adapters | Activation |
|---|---|---|---|---|
| 8000 | Llama 4 Scout (MoE, 55GB) | Global reasoning, RAPTOR community map-reduce | — | Always-on |
| 8001 | Dense 70B | Extraction, explanation, NL→Cypher | extract-lora, explain-lora, cypher-lora, resolve-lora | Always-on |
| 8002 | Qwen2.5-Coder (18GB) | Fix PR generation, AST enrichment | — | Socket-activated (on-demand) |
| 8003 | bge-m3 | Node embedding generation | — | Always-on |
| 8004 | bge-reranker-v2-m3 | Reranking for RRF fusion | — | Always-on |
LoRA Adapter Details (Dense 70B, Port 8001)¶
The Dense 70B model loads four LoRA adapters that are hot-swapped per request without reloading base weights:
- extract-lora: Fine-tuned for structured extraction from unstructured text (ADRs, post-mortems, PR descriptions). Extracts entities, relationships, and confidence scores as JSON.
- explain-lora: Fine-tuned for generating plain English violation explanations from OPA evaluation results and graph context. Used in PR comments (GOV-UC-01, GOV-UC-05).
- cypher-lora: Fine-tuned for NL→Cypher translation. Accepts a natural language query and the current graph schema context; returns a valid Cypher query (RSN-UC-07).
- resolve-lora: Fine-tuned against real-world service naming conventions for entity resolution. Classifies whether two normalized entity names refer to the same canonical service (ING-UC-09).
Startup Order¶
The startup sequence is strictly ordered. Services that depend on databases must not start until databases are healthy and migrations have completed.
1. Critical Persistent Databases¶
Start first, in any order (they have no inter-dependencies):
- PostgreSQL 16
- Neo4j 5.x
- Redis 7
- NATS JetStream
Wait for all four to pass health checks before proceeding.
2. Migrations¶
Run migrations before any application service starts:
make migrate
This runs: - Flyway for PostgreSQL schema migrations (non-negotiable from day one — schema changes not migration-controlled become invisible technical debt) - neo4j-migrations for graph schema constraint management
Migrations are idempotent and versioned. A failed migration aborts startup and logs the failing SQL or Cypher to stderr.
3. Application Services¶
Start after migrations complete successfully:
- FastAPI gateway
- Celery workers (ingestion, governance, proactive)
- Substrate microservices (Ingestion, Governance, Reasoning, Proactive Maintenance, Simulation, Agent Orchestration)
4. vLLM Endpoints¶
Start last, in dependency order (embedding model must be available before extraction models begin processing queued jobs):
- bge-m3 (port 8003) — embedding; needed by all subsequent models for context
- bge-reranker-v2-m3 (port 8004)
- Dense 70B (port 8001) — extraction and explanation
- Llama 4 Scout (port 8000) — resident, always-on
- Qwen2.5-Coder (port 8002) — socket-activated; systemd starts listening on port but only loads model weights when a request arrives
Database Migration Policy¶
PostgreSQL — Flyway¶
Flyway manages all PostgreSQL schema migrations. This is non-negotiable from day one of development, not a "add later when it gets complicated" decision. Schema changes not migration-controlled become invisible technical debt within weeks.
Conventions:
- Migrations live in migrations/postgres/V{version}__{description}.sql
- All migrations are forward-only (no rollback scripts — rollback is a new forward migration)
- Migration filenames use underscores, not hyphens
- Breaking changes (column drops, type changes) require a multi-step migration with a compatibility window
Neo4j — neo4j-migrations¶
neo4j-migrations manages graph schema constraints, indexes, and node label definitions.
Conventions:
- Migrations live in migrations/neo4j/V{version}__{description}.cypher
- All constraint creation is idempotent (CREATE CONSTRAINT IF NOT EXISTS)
- Index creation uses CREATE INDEX IF NOT EXISTS
- Node label additions are additive; removals require explicit deprecation migration
Deployment Architecture¶
Self-Hosted First¶
Substrate is self-hosted first. There is no SaaS option in v1. The customer owns the hardware, the data, and the deployment. This is the "Zero Data Leaves the Building" policy enforced at the infrastructure level.
Evaluation: Docker Compose¶
For evaluation and small team deployments:
docker compose up -d
All application services and databases start in a single Compose file. The vLLM endpoints are managed by separate systemd units that must be started independently on the DGX Spark host.
Production: Helm Umbrella Chart¶
For production Kubernetes deployment, following the GitLab Helm chart pattern:
- Single umbrella chart with sub-charts for each service
- Conditional external dependencies:
postgresql.enabled=true→ bundles Bitnami PostgreSQL sub-chartpostgresql.enabled=false→ reads connection string fromexternalDatabase.hostconfiguration- Same pattern applies to Neo4j, Redis, and NATS
This enables customers to use their existing managed database infrastructure (RDS, Cloud SQL, Redis Enterprise) rather than being forced to run bundled instances.
Air-Gap Support¶
Substrate runs fully air-gapped with no outbound network dependencies:
- License verification: Ed25519-signed offline JWT verified against pre-distributed public key. No license server call. No outbound network required.
- All AI inference: Runs on local DGX Spark. Zero calls to external LLM APIs.
- All source connectors: Pull from customer-internal systems (GitHub Enterprise, self-hosted Confluence, self-hosted Jira).
- Container images: Distributed as a signed OCI bundle for offline import via
docker load.
Licensing Model¶
Substrate uses the BSL/FSL model following the GitLab and Sentry pattern:
- Released under Business Source License (BSL) at launch
- 2-year change date to Apache 2.0: All code automatically relicenses to Apache 2.0 two years after each version's release date
- This prevents hyperscaler commoditization during the commercial window while ensuring long-term open access
- Enterprise features (multi-team isolation, Scale Plan) gated by license tier, not code forks
Service Communication¶
All inter-service communication uses explicitly chosen transports. There are no undocumented or ad-hoc HTTP calls between services.
NATS JetStream (Event Bus)¶
All inter-service events flow through NATS JetStream:
- Subject hierarchy:
substrate.>— e.g.,substrate.ingestion.pr_opened,substrate.governance.violation_raised,substrate.graph.node_written - Delivery guarantee: At-least-once; Celery workers ack after successful processing
- Stream replay: If the Ingestion Service crashes mid-parse, the event replays on restart without silent loss
- Consumer groups: Multiple Ingestion workers can consume from the same stream with load balancing
FastAPI Gateway (HTTP Ingress)¶
- All inbound webhooks (GitHub, Jira, Confluence, Terraform) arrive at the FastAPI gateway
- HMAC SHA-256 verification on all inbound webhooks before forwarding to NATS (GitHub
X-Hub-Signature-256, Jira shared secret, Confluence webhook token) - OAuth2/JWT validation for all API requests; JWT verified locally against cached JWKS
Redis (Cache and Job Coordination)¶
Three distinct uses, all through the same Redis 7 instance:
- Hot subgraph cache: Frequently queried subgraph results cached with TTL; invalidated on graph update events
- Celery job queue: All async jobs dispatched via Celery with Redis as broker; AOF persistence ensures job survival through Redis restarts
- Distributed locks:
SET NX EXpattern prevents concurrent Celery workers from running duplicate ingestion jobs for the same source
mTLS (Inter-Service Authentication)¶
All traffic between application services on the Docker bridge network is protected by mTLS:
- Mutual TLS certificates issued by the internal Vault CA (same Ed25519 CA used for SSH certificates)
- Certificate rotation handled by Vault Agent Sidecar on each service
- This applies to all service-to-service HTTP traffic; it does not apply to database connections (separate credential rotation path)