Skip to content

Ingestion Service

The Ingestion Service is the sensor array of the Substrate platform, responsible for building and continuously maintaining the Unified Multimodal Knowledge Base (UMKB). It is the single source of truth construction pipeline — every node, edge, and embedding in the Observed Graph originates here.

Responsibility

Transform heterogeneous raw data — source code, infrastructure configuration, project planning data, runtime host state, documentation, and institutional memory — into a unified, queryable graph of typed nodes, directional edges, and vector embeddings. The Ingestion Service ensures that the Observed Graph reflects ground truth in near real-time, providing the factual substrate against which all governance, reasoning, and simulation operate.


Architecture Overview

The Ingestion Service is event-driven. Webhooks and scheduled pollers feed raw events into NATS JetStream topic streams. Celery workers consume from those streams, execute connector logic, and write results to:

  • Neo4j 5.x — graph nodes and edges (Observed Graph)
  • PostgreSQL 16 + pgvector — vector embeddings stored in an HNSW index, ingestion audit logs, deduplication metadata
  • Redis 7 — distributed locks (SET NX EX) per ingestion event, result cache keyed on SHA-256 content hash

Deduplication is enforced at two levels: (1) a Redis result cache keyed on SHA-256 content hash prevents redundant processing of identical payloads, and (2) a distributed Redis lock per ingestion event prevents parallel workers from processing the same event simultaneously.


Data Connectors

Connector Priority Table

Connector Data Ingested MVP Priority Trigger / Frequency
GitHub (Repos + Code) AST-parsed code graph, PR events, commits, CODEOWNERS, releases, ADRs in markdown Must Have Webhook: PR open/update/merge, push
GitHub Projects v2 Sprint items, custom fields, iteration data, status updates Must Have Webhook: projects_v2_item; GraphQL poll hourly
GitHub Pages Published docs sites, staleness delta vs code commits Must Have Poll: 6-hour
SSH Runtime Connector Running processes, port bindings, packages, config checksums Must Have Scheduled: 15 min per host; on-demand
Terraform InfraResource nodes, HOSTS/CONNECTS_TO edges, state diffs Must Have Webhook: post-apply or file watcher
Kubernetes Deployments, Services, Ingress, ConfigMaps — live watch Must Have K8s Watch API: continuous
Jira Tickets, epics, sprints, component links Nice to Have (MVP) Webhook: issue create/update
Confluence Pages, ADRs, post-mortems, runbooks Nice to Have (MVP) Webhook: page save
Slack Decision threads via keyword trigger Nice to Have (v1.1) Event API: message.channels
Datadog Runtime alert signals, anomaly events, SLO state Nice to Have (v1.1) Webhook: monitor alerts

Specialized Connector Details

GitHub (Repos + Code)

The GitHub connector registers as either an OAuth App or a GitHub App against the target organization and installs webhooks for pull_request, push, and check_run events (ING-01).

On each push or PR event, the connector clones or shallow-fetches the repository, then invokes the Tree-sitter + stack-graphs Rust CLI to parse every changed source file into an Abstract Syntax Tree (ING-02). The stack-graphs engine performs compiler-level name resolution, producing Module, Function, Class, and Import nodes tagged with file path and line-range metadata.

Edges extracted include (ING-03):

  • CALLS — direct function invocation between two Function nodes
  • DEPENDS_ON — module-level import or package dependency
  • IMPORTS — file-level import relationship
  • DEFINES — ownership of a Function or Class by a Module

On PR open events, the connector computes an incremental graph delta (ING-04): only files present in the PR diff are re-parsed. The resulting subgraph delta — new nodes, removed nodes, changed edges — is published to the substrate.ingestion.code.delta NATS subject for downstream consumption by the Governance and Reasoning services.

ADRs stored as markdown files in the repository are ingested as DecisionNode entries with WHY edges linking them to the relevant service or module nodes (ING-11).

GitHub Projects v2

GitHub Classic Projects reached end-of-life in April 2025. All sprint and project planning integration uses the GraphQL API exclusively.

The connector queries the ProjectV2 object to extract:

  • SprintNode entities from iteration fields
  • ProjectItem nodes representing individual work items
  • Custom fields: text, number, date, single-select, and iteration types
  • Status update values: ON_TRACK, AT_RISK, OFF_TRACK, COMPLETE — mapped to SprintNode.health graph properties (ING-05)

Rate limits apply: 5,000 points/hour for Personal Access Tokens, 10,000 points/hour for GitHub Apps on Enterprise Cloud. Pagination is cursor-based with a maximum of 100 items per page.

In addition to webhook delivery for projects_v2_item events, a GraphQL hourly poll is scheduled as a consistency backstop to catch any missed webhook deliveries.

GitHub Pages

The GitHub Pages connector calls the Pages REST API to retrieve the last build timestamp for each configured Pages site. It then compares that timestamp against the last code commit in the source branch to compute a documentation staleness signal (ING-06). This signal is attached to the corresponding documentation node as a staleness_delta_days property and surfaced by the Proactive Maintenance Service.

Polling frequency: every 6 hours.

SSH Runtime Connector (Agentless)

No existing IDP platform implements agentless SSH-based runtime verification at this level of fidelity. The SSH Runtime Connector closes the gap between what the graph believes is deployed and what is actually running on production hosts.

Architecture: A single JSON-outputting shell script is deployed to each target host via SSH. No persistent daemon, no agent installation. Round-trips are minimised by collecting all inspection data in a single script invocation and returning a structured JSON document (ING-07).

Security model:

  • HashiCorp Vault SSH CA issues ephemeral certificates with a 5-minute TTL
  • Connections route through a ProxyJump bastion; SSH agent forwarding is never used
  • ForceCommand on the target restricts the session to the single inspection script
  • SSH multiplexing via ControlMaster=auto and ControlPersist=600s reduces execution time by more than 50% across multi-host batches

Verification checks performed per host:

Check Command Graph Comparison Target
Running services systemctl list-units --type=service --state=running --output=json Service nodes in Observed Graph
Port bindings ss -tlnp --json Declared port mappings on InfraResource nodes
Package drift dpkg -l \| awk \| sha256sum Dependency nodes + lockfile hash
Config integrity sha256sum of critical config files Baselines stored at last deploy
Container state docker inspect K8s/Docker service declarations
Env var presence Hash only (no values exported) Required env vars declared in service manifest

The output JSON is parsed and diffed against the current Observed Graph. Any divergence is published as a RuntimeDriftEvent to NATS for the Governance Service to evaluate against the substrate/runtime-drift policy threshold (ING-07, GOV-12).

Execution schedule: every 15 minutes per host, plus on-demand trigger via API or workflow.

Terraform

The Terraform connector ingests state files either via a post-apply webhook from Terraform Cloud/Enterprise or by watching a configured state file path (ING-08).

From each state file, the connector extracts:

  • InfraResource nodes for every managed resource (EC2 instances, RDS clusters, S3 buckets, load balancers, etc.)
  • HOSTS edges connecting infrastructure resources to the services they run
  • CONNECTS_TO edges representing declared network connectivity between resources

On each ingestion, the new state is diffed against the prior state stored in Neo4j. Drift — resources added, removed, or with changed properties outside a known deployment window — is flagged as a TerraformDriftEvent.

Kubernetes

The Kubernetes connector establishes a Kubernetes Watch API connection (using the watch=true query parameter on the relevant resource endpoints) to receive real-time change notifications for (ING-09):

  • Deployment objects → ServiceNode with replica count, image tag, resource limits
  • Service objects → InfraResource with port mappings and selector labels
  • Ingress objects → routing rules as ROUTES_TO edges
  • ConfigMap objects → configuration nodes with CONFIGURES edges to dependent services

The Watch connection is long-lived and reconnects with exponential backoff on disconnection. The Kubernetes connector maintains the live cluster state in the Observed Graph continuously, not on a schedule.

Jira (Nice to Have — MVP)

Jira tickets are ingested as IntentAssertion nodes and linked to the service nodes they describe via entity resolution (ING-17). Epics produce EpicNode entries; sprints produce SprintNode entries. Component links map to service ownership edges. Ingestion is triggered by issue_created and issue_updated webhooks.

Confluence (Nice to Have — MVP)

Confluence pages are ingested as Policy or Documentation nodes depending on their space and label classification (ING-18). The connector detects orphaned documentation (doc node with no matching live service node) and undocumented services (service node with no DOCUMENTS edge). Ingestion is triggered by page_created and page_updated webhooks.

ADRs and post-mortems stored in Confluence are parsed and cross-linked with their GitHub markdown counterparts through entity resolution to avoid duplicate DecisionNode entries.

Slack (Nice to Have — v1.1)

Slack decision threads matching configured keyword triggers or channel subscriptions are ingested as IntentAssertion nodes (ING-20). The connector uses the Slack Event API (message.channels) and applies a Dense extract-lora classifier to identify messages containing architectural decisions, risk acknowledgements, or scope commitments before creating graph entries.


Cross-Cutting Concerns

Entity Resolution (ING-10)

Every ingestion pipeline passes candidate node identifiers through the Dense resolve-lora adapter before writing to Neo4j. This canonicalises variant representations of the same entity — for example, "Service A", "srv-a", "service-a", and "ServiceA" — into a single canonical node with ALIAS edges recording the source-system identifiers.

Entity resolution runs as a synchronous step within the Celery task prior to graph writes. If the resolver confidence is below 0.6, the candidate node is written to the verification queue (see Proactive Maintenance Service) rather than directly to the graph.

Institutional Memory Ingestion

ADRs (ING-11): Architectural Decision Records are ingested from both GitHub markdown files and linked Confluence pages. Each produces a DecisionNode with:

  • title, status (Accepted/Superseded/Deprecated), date, context, decision, consequences properties
  • WHY edge to the service or component the decision governs
  • SUPERSEDES edge to any prior ADR it replaces

Post-mortems (ING-12): Post-incident review documents from Confluence and GitHub are parsed to extract FailurePattern nodes. Each FailurePattern carries:

  • incident_date, severity, affected_services, root_cause_summary properties
  • CAUSED_BY edges to the contributing code or infrastructure nodes
  • AFFECTED edges to all service nodes impacted

CVE / Advisory Feed (ING-13)

The GitHub Advisory Database and CVE feed are polled every 15 minutes. Each advisory is classified for relevance to the current dependency graph using the Dense extract-lora adapter. If the advisory references a package that appears in any DEPENDS_ON edge, the affected dependency node is tagged with a cve_ids array property and a CVE_AFFECTS edge is created to a VulnerabilityNode. A CveAlertEvent is published to NATS for Governance Service evaluation.

Embeddings and Deduplication (ING-14, ING-15, ING-16)

Every node written to the graph also receives a bge-m3 vector embedding (1024-dimensional) generated via the embedding service on DGX Spark port 8003. Embeddings are stored in the pgvector HNSW index in PostgreSQL with cosine distance metric.

Before inserting any new node, a cosine similarity query is run against the HNSW index. Nodes with similarity above 0.95 to an existing node are flagged for deduplication review rather than inserted as new entries.

Deduplication of ingestion events is enforced via a Redis result cache keyed on the SHA-256 hash of the raw event payload (ING-15). If a cache hit is found, the event is acknowledged and discarded without processing.

Each ingestion event also acquires a distributed Redis lock using SET NX EX 300 keyed on the event's content hash (ING-16). This prevents two Celery workers from processing the same event concurrently during a transient cache miss.


Functional Requirements

ID Requirement Priority
ING-01 Connect to GitHub via OAuth App or GitHub App; register webhooks for PR, push, check_run events Must Have
ING-02 Clone repositories; parse AST via Tree-sitter + stack-graphs Rust CLI; extract Module/Function/Class/Import nodes with file+line metadata Must Have
ING-03 Extract CALLS, DEPENDS_ON, IMPORTS, DEFINES edges from stack-graph with compiler-level name resolution precision Must Have
ING-04 On PR open: incremental graph delta — only changed files re-parsed and diffed; result published to NATS Must Have
ING-05 Ingest GitHub Projects v2 via GraphQL API; extract SprintNode, ProjectItem nodes; map status updates to graph properties Must Have
ING-06 Ingest GitHub Pages via Pages REST API; compare last build date to last code commit; extract documentation staleness signal Must Have
ING-07 SSH Runtime Connector: Vault-signed ephemeral cert, ProxyJump connection, inspection script execution, JSON output parse, diff against graph Must Have
ING-08 Parse Terraform state files; extract InfraResource nodes with HOSTS, CONNECTS_TO edges; detect drift from prior state Must Have
ING-09 Watch Kubernetes API; extract Deployment, Service, Ingress, ConfigMap nodes; maintain sync with live cluster state Must Have
ING-10 Entity resolution: canonicalise "Service A" / "srv-a" / "service-a" across all sources using Dense resolve-lora Must Have
ING-11 Ingest ADRs from GitHub markdown files and linked Confluence pages; extract DecisionNode with WHY edges; capture rationale as institutional memory Must Have
ING-12 Ingest post-mortems from Confluence/GitHub; extract FailurePattern nodes with CAUSED_BY and AFFECTED edges Must Have
ING-13 Poll GitHub Advisory/CVE feed every 15 minutes; classify relevance using Dense extract-lora; identify affected dependency nodes Must Have
ING-14 Generate bge-m3 embeddings for all nodes; store in pgvector HNSW index; run cosine deduplication check pre-insert Must Have
ING-15 Deduplicate ingestion events via Redis result cache keyed on SHA-256 content hash Must Have
ING-16 Distributed lock per ingestion event via Redis SET NX EX to prevent parallel duplicate processing Must Have
ING-17 Ingest Jira tickets as IntentAssertion nodes linked to service nodes via entity resolution Nice to Have (MVP)
ING-18 Ingest Confluence pages as Policy/Documentation nodes; detect orphaned docs and undocumented services Nice to Have (MVP)
ING-19 Ingest PR review comments as design rationale MemoryNodes linked to code nodes Nice to Have (v1.1)
ING-20 Ingest Slack decision threads via keyword/channel trigger as IntentAssertion nodes Nice to Have (v1.1)

Node and Edge Taxonomy

The Ingestion Service is the sole authoritative writer for the following node types in the Observed Graph:

Node Type Source Connectors Key Properties
ServiceNode GitHub, K8s, Terraform, SSH name, language, repo_url, criticality (PageRank)
FunctionNode GitHub (AST) name, file_path, line_start, line_end, signature
ModuleNode GitHub (AST) name, file_path, language
ClassNode GitHub (AST) name, file_path, line_start
InfraResourceNode Terraform, K8s, SSH provider, resource_type, region, host
SprintNode GitHub Projects v2, Jira title, start_date, end_date, health
ProjectItemNode GitHub Projects v2, Jira title, status, assignee, sprint_id
DecisionNode GitHub (ADR), Confluence title, status, date, rationale
FailurePatternNode GitHub, Confluence incident_date, severity, root_cause_summary
VulnerabilityNode GitHub Advisory, CVE feed cve_id, severity, affected_packages
DocumentationNode GitHub Pages, Confluence url, last_build_date, staleness_delta_days
IntentAssertionNode Jira, Slack source, text, linked_service, confidence

Infrastructure Dependencies

Component Role in Ingestion Service
NATS JetStream Event bus: ingestion events published to substrate.ingestion.> subjects; at-least-once delivery guarantee
Celery Distributed task queue consuming from NATS; handles retries with exponential backoff
Redis 7 Distributed locks (SET NX EX) + SHA-256 result cache for deduplication
Neo4j 5.x Primary graph store; all nodes and edges written here
PostgreSQL 16 + pgvector HNSW vector index for bge-m3 embeddings; ingestion audit log
DGX Spark port 8001 Dense 70B + resolve-lora: entity resolution, relevance classification
DGX Spark port 8003 bge-m3: vector embedding generation for all ingested nodes
HashiCorp Vault SSH CA for ephemeral certificate issuance (SSH Runtime Connector)