One graph, every source — Terraform, state, Kubernetes, ArgoCD, CUE, docs flow into kuberly-graph.

May 7, 2026 · Anton Grishko

One graph, every source — what kuberly-graph sees now

A few months ago we shipped a knowledge graph for the IaC repo. It now indexes six sources — Terraform code and state, live Kubernetes, ArgoCD, CUE, and the docs. Here's what that unlocks.

TL;DR — kuberly-graph started as an index of the Terragrunt repo. It now spans Terraform code and state, live Kubernetes, ArgoCD, CUE, and docs — all readable by AI agents over MCP. The same handful of tools (owners, consumers, depends_on, docs_for, path, drift) work across all six layers.

The recap

We argued earlier in Knowledge graphs are the missing piece for AI in your infra that infrastructure is a graph, not a document. We shipped kuberly-graph — an MCP-readable index of the Terragrunt repo, with blast_radius, consumers, and path queries. Agents could finally answer "what does this change touch" with a real transitive walk instead of a vector lookup.

That was one source. The graph now indexes six:

Terraform / OpenTofu code — modules, variables, outputs, dependency declarations
Terraform state — what's actually deployed, with the parameters it was deployed with
Live Kubernetes — Deployments, Services, ConfigMaps, ExternalSecrets, ownership chains, namespace boundaries
ArgoCD — Applications, ApplicationSets, sync status, target clusters, source repos
CUE — package imports, value references, schema constraints
Docs — design docs, runbooks, postmortems, with edges back to the resources they describe

Six sources. One graph. One MCP endpoint. Same handful of tools the agent already knew. For the storage shape (we use Memgraph and Cypher), see Teaching an Agent to Think in Graphs.

What one actually looks like

Numbers from one production customer install (anonymized — names omitted, structure is real):

Total                 1,339 nodes      4,186 edges
Environments          2 (dev, prod)
Modules               60
Components            43
Applications          4 (2 dev · 2 prod)
Docs indexed          31

Nodes by source layer
─────────────────────
Live Kubernetes       864    (39 namespaces, 67 Deployments,
                              110 Services, 207 ConfigMaps,
                              279 Secrets, 160 ServiceAccounts)
Terraform state       324    (top types: kubectl_manifest 81,
                              helm_release 23, aws_iam_role 14)
Terraform code         89    (modules, resources, variables, outputs)
Docs                   31    (26 runbooks/design docs · 5 OpenSpec)
ArgoCD rendered        21
CUE schemas             5
CI/CD workflows         5

Top edge relations
──────────────────
depends_on          3,128    (75% of all edges)
contains              350
reads_configmap        99
selects                97
uses_sa                92
provides               60
configures_module      59
mentions               59    (docs → resources)
reads_secret           59
owns                   53

That's a mid-sized stack. Bigger customers run 3–5x these numbers.

The shape is the part that matters more than the size. K8s is by far the biggest source layer (864 nodes — about 65% of the graph) because live cluster state has the most resources. Terraform state is second. Docs and CUE are small but punch above their weight: a 31-node doc layer produces 59 mentions edges that link postmortems and runbooks back to the exact resources they're about. That layer is what makes docs_for(resource) work.

Why six beat one

The single-source graph already beat vector RAG for "what does this change touch" — but it was answering an intent, not a reality. The repo says vpc consumes shared-vpc.json. State says vpc is at version 1.4.2 with enable_nat_gateway = true. Live Kubernetes says ten Deployments restarted in the last hour. ArgoCD says one app is OutOfSync. A postmortem from February says "we tried this and broke webhook signing keys."

Each is a different question. Stitching them is the point. Agents stop reasoning about what should be true and start reasoning about what is true.

A walked-through query

Composite from a real one we ran yesterday. The user asked: "I want to bump the Strapi image. What's the impact?"

1. find(kind=workload, name~"strapi")
   → Deployment kuberly-web-cms (namespace: kuberly)

2. owners(Deployment kuberly-web-cms)
   → ArgoCD App kuberly-web (env: prod, sync: Synced, health: Healthy)
   → ExternalSecret kuberly-web-secrets (last refresh: 1h ago, OK)

3. parents(ArgoCD App kuberly-web)
   → ApplicationSet kuberly-web-envs (3 clusters: prod, staging, dev)

4. consumers(Service kuberly-web-cms)
   → Deployment kuberly-web-frontend (reads STRAPI_URL)
   → Ingress kuberly-web (host: cms.kuberly.com)

5. docs_for(Deployment kuberly-web-cms)
   → runbook: "Strapi rollouts must drain admin sessions first"
   → postmortem 2026-02-14: "image bump broke webhook signing keys"

6. drift(state="kuberly-web-cms", live="kuberly-web-cms")
   → no drift; image tag in state matches Deployment

The agent's answer: "A Strapi image bump propagates to three envs via the ApplicationSet. The frontend reads the URL via STRAPI_URL — expect a brief 5xx during rollout. The runbook says drain admin sessions first. The February postmortem says verify webhook signing keys are valid in the new image. ExternalSecret is healthy. No state drift."

Six tool calls. Six structured queries. Every node ID quoted so a human can click through and verify.

What's wired and how often

Source              Refresh
──────              ───────
Terraform code      on push
Terraform state     on apply
K8s live            ~5s informer
ArgoCD              ~30s poll
CUE                 on push
Docs                on commit

Edges that span sources (e.g. ArgoCD App → Deployment → Terraform module that deploys it) are computed at query time, not pre-materialized. Each source is small enough that the join cost is in the low tens of milliseconds. We initially tried materializing every cross-source edge into Neo4j and it was a maintenance disaster — invalidations everywhere, drift between materialized and live, hours of debugging per release. The query-time approach is dumber and faster.

What we're not doing

No graph DSL. The agent has a fixed set of tools — find, owners, parents, depends_on, consumers, docs_for, path, drift, diff. We tried exposing a Cypher-style surface and the agent generated queries that were either too narrow or too broad. Well-shaped tools beat one expressive one.
No mutations. The graph is read-only from the agent's perspective. Changes happen via PRs the autopilot opens against the IaC repo. Same trust model kuberly-graph always had — see DevOps on autopilot.
No public hosting. The graph runs in your VPC with the same IAM scope as kuberly-monitor. Data does not leave your environment.

What it unlocks for the autopilot

PRs ship with impact pre-computed. When the autopilot opens a PR, the graph queries are already inlined in the body. Reviewers don't have to ask "what does this touch."
Cross-source drift becomes a tool, not a panic. Repo says one thing, live cluster says another, state says a third — the graph spots it. We catch a handful per month that previously nobody noticed.
Postmortems become active context. docs_for(resource) runs before every proposed change. If a February postmortem said "we tried this and it broke X," the agent flags it in the PR. This was the surprise — postmortems used to be dead text. They're now load-bearing. In the install above, 59 mentions edges quietly connect 31 docs to the resources they describe.

What's available

If you're a Kuberly customer: this is on. Open Cursor, Claude Code, Copilot, or OpenCode in your IaC repo and the MCP server registers automatically. The agent has six-source graph access. No setup.

If you're not: the design is repeatable. Pick your sources. Write a thin extractor per source. Normalize into nodes and edges with stable IDs. Expose a small set of tools over MCP. The hard part isn't the storage — it's deciding which tools to expose. Start with owners, consumers, depends_on, path, find, and docs_for. You'll add more later, but those six cover most of the "what does this touch" surface.

The graph stops being a feature when it becomes the substrate. That's where we are now.