← All posts
azurefinopsaisre

Watching your Azure bill with AI: SRE Agent + FinOps hubs

Beyond routing cost alerts to an AI agent — wiring Azure SRE Agent into FinOps hubs data via MCP and KQL, building a FinOps subagent, and automating cost remediation with guardrails.

Most cost conversations in Azure start the same way: someone opens the invoice at the end of the month and asks “why is this number 30% higher than last month?”. By then the money is already spent. The fix is to treat cost like any other operational signal — monitored continuously, investigated quickly, acted on early.

Microsoft now ships two pieces that map onto this: Azure SRE Agent and FinOps hubs. Most write-ups (and demos — this one included) treat them as two separate tools that happen to share the word “cost”. The interesting part is what happens when you wire them into one system: hubs as the cost data plane, the agent as the investigation-and-action plane.

The two pieces, briefly

Azure SRE Agent is an AI agent for operations work. It receives incidents from Azure Monitor, PagerDuty or ServiceNow, investigates across logs, metrics and deployment history, remembers past incidents, and acts — in Review mode (a human approves every action) or Autonomous mode (it mitigates on its own, within the permissions you scoped). Two features matter for what follows: you can build custom subagents specialized for a domain, and incident response plans route specific incident classes to a specific subagent with a defined autonomy level.

FinOps hubs (part of the open-source Microsoft FinOps toolkit) turn raw Cost Management exports into an analytics-grade platform: cost data normalized to the FOCUS specification, stored in Azure Data Explorer or Microsoft Fabric, queryable with KQL in seconds — across subscriptions, billing accounts, even tenants — together with price sheets and reservation data. The estimate starts around $120/month plus ~$10 per $1M of monitored spend.

Why mixing them is the actual product

Here’s the thing about AI agents: an agent’s reasoning is only as good as the data you let it touch. Ask an agent “why did costs spike?” when all it can reach is the portal’s Cost analysis views, and you’ll get a shallow answer — those views were built for humans clicking through filters, not for a model running analysis.

FinOps hubs change that, and the integration surface already exists on both sides:

  • hubs ship an MCP server that understands FinOps and connects to your cost data;
  • SRE Agent consumes MCP connectors and Kusto sources as part of its investigations, right next to Azure Monitor and Application Insights.

Connect the two and the agent stops guessing. It can run real KQL against FOCUS-normalized data — group the delta by service, join against reservation coverage, compare unit costs across weeks — and then do the one thing hubs can’t do alone: correlate cost with operational context (deployments, traffic, incidents) and take action.

The flow looks like this:

Cost anomaly or budget alert (Azure Monitor) → response plan routes it to a FinOps subagent → the subagent queries hubs (KQL via MCP): which resources drove the delta, since when → correlates with deployment history and metrics: traffic-driven or leak? → proposes a fix (Review) or executes it (Autonomous) → writes the resolution to memory, so the next spike of the same shape is triaged in seconds.

That last step is quietly the most valuable one. Runbooks go stale; the agent’s memory compounds.

Five scenarios beyond “route the alert”

1. Spike triage with a verdict, not a chart. The agent’s job isn’t to tell you costs went up — the alert already did. Its job is to separate legitimate growth (a scale-out backed by traffic metrics and a planned release) from waste (orphaned disks, a runaway Log Analytics ingestion, a dev environment nobody turned off). Each verdict cites evidence: the KQL result from hubs, the metric, the deployment that correlates. A query it might run:

// Which services drove this week's cost delta vs last week?
Costs
| where ChargePeriodStart >= startofday(ago(14d))
| extend Week = iff(ChargePeriodStart >= startofday(ago(7d)), "ThisWeek", "LastWeek")
| summarize Cost = sum(EffectiveCost) by ServiceName, Week
| evaluate pivot(Week, sum(Cost))
| extend Delta = ThisWeek - LastWeek
| top 10 by Delta desc

2. The idle-resource reaper. Hubs know what every resource costs; Azure Monitor knows its utilization. The agent joins the two: resources burning money at flat-zero utilization get deallocated autonomously in dev/test subscriptions and turned into a review proposal in production. This is the classic FinOps recommendation list — except something actually executes it.

3. Commitment coverage watchdog. Hubs ingest reservation details and recommendations alongside cost. After workloads shift — a migration lands, a service gets decommissioned — the agent reviews coverage drift and proposes reservation or savings plan adjustments with the savings math attached, instead of waiting for the quarterly “why is our RI utilization 60%?” meeting.

4. Cost regression gate. After each release, the agent compares cost-per-unit (per request, per tenant, per job — whatever your unit economics are) before and after, using hubs for the cost side and Application Insights for the traffic side. A 2× cost-per-request regression gets treated exactly like a failed performance test: ticket, owner, rollback conversation. Cost becomes part of the definition of done.

5. Allocation hygiene. Unallocated and untagged spend is the silent killer of every chargeback model. The agent hunts down the untagged resources behind the “unallocated” bucket, identifies owners from resource group patterns or deployment history, and opens the ServiceNow ticket — or proposes the IaC fix directly. Tag debt stops compounding because chasing it no longer costs a human afternoon.

Guardrails — the part you don’t skip

An agent that can stop VMs is a privileged identity, full stop. I’d treat its access exactly like a PIM-managed admin account:

  • Least privilege by scope: Reader and Monitoring Reader broadly; mutation rights only where autonomy is actually intended (dev/test resource groups), nothing standing in production.
  • Review mode first, autonomy per scenario — not globally. Graduate individual, well-understood actions (“deallocate dev VMs idle 48h+”) after you’ve watched the agent propose them correctly for a few weeks. The autonomous-mode acknowledgment dialog exists for a reason.
  • Audit everything: agent actions land in the activity log like anyone else’s; route its production changes through the same change management as humans.
  • Cost data is sensitive. Price sheets and negotiated discounts are commercial information — scope access to the hub’s data the same way.

Getting started

  1. Deploy a FinOps hub and configure Cost Management exports — the All costs (FOCUS) + prices template, parquet + snappy, per billing scope. EA billing accounts or MCA billing profiles give the broadest datasets.
  2. Connect Azure SRE Agent to Azure Monitor as the incident platform; define budget and anomaly alerts on the scopes that hurt the most.
  3. Build a FinOps subagent and connect it to the hub’s data (MCP server / Kusto). Seed the knowledge base with your FinOps runbooks: subscription owners, tagging policy, budget thresholds, escalation paths.
  4. Create a response plan routing cost alerts to that subagent — in Review mode.
  5. Iterate toward autonomy one scenario at a time, starting where the blast radius is small and the waste is obvious.

The FinOps principle behind all of this: cost is an engineering signal, not an accounting artifact. Hubs give that signal a proper data plane; the agent finally gives it an action plane. Wire them together, and your invoice stops being a monthly surprise — it becomes just another system that pages someone (or something) when it misbehaves.


Sources & further reading: Azure SRE Agent docs, SRE Agent connectors, custom subagents, FinOps hubs overview, FinOps toolkit, video: Monitor your Azure Costs with Azure SRE Agent.