Monitoring MCP servers — OOMKilled pod through Prometheus, Loki, Kuberly MCP and HolmesGPT to an agent.

May 3, 2026 · Anton Grishko

Monitoring MCP servers — what HolmesGPT got right, what's still missing

MCP turns Loki, Prometheus, and Grafana into first-class tools for AI agents. HolmesGPT was first; we built our own. Here's what worked, what didn't, and the design we settled on.

TL;DR — The Model Context Protocol lets AI agents call Loki, Prometheus, and Kubernetes events as structured tools. HolmesGPT pioneered the pattern. We shipped a thinner MCP server that returns "raw data plus the query string that produced it" — and one engineer can re-run anything the agent did.

What MCP changes for monitoring

The Model Context Protocol turns observability backends into first-class tools for AI agents. Instead of an LLM looking at a screenshot of a Grafana panel, the agent calls a tool, gets structured Loki/Prometheus/Tempo data back, and reasons over it.

That's a real shift. A dashboard is a fixed query baked at design time. An MCP-backed agent gets to write a query against your live data at the moment of the question. The difference shows up immediately during incidents — when the question you actually want answered is never the one a dashboard was built for. For the broader architecture, see MCP for DevOps.

We've been running monitoring MCP servers in production for a few months. HolmesGPT was the first ecosystem we integrated. Our own server is what we ended up shipping. Here's what we learned.

What HolmesGPT got right

Robusta's HolmesGPT was the first widely-deployed pattern for "give an LLM tools to investigate a cluster, not just summaries." The good ideas:

Alert-shaped investigations. When PagerDuty fires, the LLM gets the alert payload and a toolbox. It runs kubectl get events, hits Prometheus, checks Loki, and produces a written diagnosis. We borrowed this prompt-shape directly.
Opinionated runbooks. HolmesGPT bundles a library of "if you see X, check Y" investigation steps. These are basically prompts that bias the LLM toward the queries an experienced SRE would run. Worth the borrow.
Tool answers cite their queries. Every tool call returns both the data and the query string that produced it. That's a prerequisite for the human-in-the-loop step we'll get to below.

If you're standing up an LLM-driven on-call assistant from scratch, HolmesGPT is a perfectly reasonable place to start. We did.

Why we built our own

Three reasons HolmesGPT didn't fit our shape:

It's tied to a specific LLM and API surface. HolmesGPT's loop assumes a particular LLM provider and a particular tool-call format. We needed something that worked equally well as a tool inside Cursor, Claude Code, Copilot, and our own dashboard chat. The MCP standard exists exactly for this; HolmesGPT predates it.
It's customer-shaped, not multi-tenant. Each customer's monitoring stack lives in their own AWS account behind their own Istio gateway. HolmesGPT runs in a cluster and queries that cluster. Our autopilot needs to query N clusters from one agent process, with per-tenant credentials. That's a different topology — and one we solve with the fleet pattern in When one agent isn't enough.
It does too much. HolmesGPT bundles the investigation loop, the prompts, the runbooks, and the tools. We wanted just the tools — let each MCP client (Cursor, our dashboard, Claude Code) bring its own loop and prompts.

So we wrote a thin MCP server. It exposes Loki, Prometheus, Grafana, and pod events as tools, scoped to one customer's stack. It does not own the LLM loop. It does not own the prompts. It returns raw data plus the query string that produced it.

The MCP server, in three tools

loki_logquery(label_selector, query, since)
  → returns: [{timestamp, line, labels}], plus the raw LogQL string

prom_range_query(promql, start, end, step)
  → returns: [{metric, values}], plus the PromQL string

k8s_pod_events(namespace, name, since)
  → returns: [{type, reason, message, timestamp, count}]

That's it. No "investigate this alert" tool. No "summarize the deploy" tool. The agent composes those workflows. The MCP server is plumbing.

The "raw query plus result" contract is the part that matters. If the agent's diagnosis is wrong, the human can re-run the exact LogQL or PromQL it used and see for themselves. That's how an LLM-generated diagnosis earns trust — by showing its work.

A worked-out incident

A composite distilled from a few real incidents we've debugged. Pod OOMKilled in production. The autopilot, prompted by the alert, ran:

1.  k8s_pod_events(ns="payments", name="api-7d4c-x9qz", since="10m")
    → reason=OOMKilled, killed 3 minutes ago, restarted twice

2.  prom_range_query(
      promql="container_memory_working_set_bytes{pod='api-7d4c-x9qz'}",
      start="-30m", step="30s"
    )
    → memory steady at 800Mi for 25 min, climbed to 2Gi in last 4 min, killed

3.  loki_logquery(
      label_selector='{pod="api-7d4c-x9qz"}',
      query='|~ "(?i)error|oom|panic"',
      since="10m"
    )
    → 1,400 lines of "json decode error" in last 4 minutes, all referencing
      the same upstream batch endpoint

4.  prom_range_query(
      promql="rate(http_requests_total{handler='/batch-import'}[1m])",
      start="-30m", step="30s"
    )
    → request rate jumped 40x at the start of the memory climb

5.  k8s_pod_events(ns="kube-system", name="karpenter-controller", since="10m")
    → 1 NodeClaim consolidation, unrelated, 18 minutes ago

The agent's diagnosis: "A 40x traffic spike on /batch-import correlates with the memory climb. The handler likely accumulates per-request state. OOMKill is symptom; rate limit is fix."

Total wall-clock: 3 seconds. If the human disagrees, they can re-run every query the agent ran and form their own conclusion — that's the design contract.

What this enables that dashboards don't

Alerts open with a diagnosis attached. Not a runbook the on-call has to follow — a candidate diagnosis they confirm or reject.
PRs include monitoring evidence. "This change reduces tail latency" gets the actual rate(...) query and the result inlined into the PR description.
Engineers stop having to learn LogQL. They ask in English; the agent writes the query. The query is right there in the response if they want to refine it.

Where this is heading

The interesting next step is closing the loop: an alert that, instead of paging a human, opens a PR with the diagnosis, the queries, and a proposed fix. The human reviews the PR, not the alert.

We're shipping a small slice of this for our own internal stack. It's not ready for customers yet — the failure modes of "agent autonomously closes incidents" are still the failure modes that get you fired. But the read-only version (alert → diagnosis with citations, posted to Slack) has been net-positive for months. For the cost-control patterns that make this affordable, see Cheap agents — four token-saving moves.

What you can run today

If you're comfortable building your own loop:

Start with Robusta's HolmesGPT for inspiration on the prompts and runbooks
Write a thin MCP server that exposes your Prometheus/Loki/Grafana as the three tools above
Hook it into Claude Desktop or Cursor or whatever client you already use
Demand "queries with citations" from every agent that talks to your monitoring

If you're a Kuberly customer: this is already on by default.