Azure Monitor: End‑to‑End Observability in Azure

Azure Monitor article banner

Anyone working with the cloud knows that monitoring is not a luxury – it’s a core pillar of every modern architecture. Azure Monitor is Microsoft’s unified observability platform for infra to code, collecting metrics, logs, and traces across Azure, on-prem via Azure Arc, and even other clouds.

You can visualize with Workbooks and dashboards, query using Kusto Query Language (KQL), and act in real time with Alerts and Action Groups that trigger notifications or automation when thresholds are crossed.

Azure Monitor’s two data pillars: Metrics for near real-time numeric time-series, and Logs for rich, schema-flexible events explored with KQL.

It integrates with Application Insights and Container Insights, giving you a connected view from infrastructure health to application performance – all in one place.

Table of Contents

  1. Architecture
  2. Metrics vs Logs
  3. Application Insights & Tracing
  4. AKS, Prometheus & OpenTelemetry
  5. Alerts & Action Groups
  6. Useful KQL Queries
  7. Cost, Retention & Data Collection
  8. Enterprise Governance
  9. Official docs

1 · Architecture

Azure Monitor follows a clear end-to-end pipeline: CollectStoreAnalyzeAct. Each stage is modular, so you can connect the data sources you need, retain it in the right storage layer, explore it visually or programmatically, and then trigger automated actions based on your findings.

Azure Monitor
A comprehensive approach to cloud-native governance.
Action Group configuration in Azure Monitor
Alerting and automation sit on top of collected signals, closing the loop between detection and remediation.

2 · Metrics vs Logs

Metrics are lightweight numerical time-series that are optimized for speed and scale. They are perfect for visualizing trends in dashboards and for triggering alerts within seconds. Common examples include CPU usage, memory consumption, request rates, and response times. Logs, on the other hand, are detailed, schema-flexible records that capture rich contextual data — such as requests, exceptions, and audit events — and can be explored using Kusto Query Language (KQL) for deep analysis and historical insights.

Real-world example: when ticket sales open for a major concert in Yarkon Park, request rate and CPU metrics will spike. Autoscale rules can react instantly to add capacity and maintain performance. Meanwhile, request and dependency logs provide the granular view you need to pinpoint slow endpoints, failed API calls, or database bottlenecks — so you can optimize the hot paths without risking production stability.

Azure Monitor overview
Azure Logs
Azure Monitor overview
Azure Metrics

3 · Application Insights & Tracing

Application Insights is Azure Monitor’s Application Performance Monitoring (APM) capability. It tracks key telemetry such as requests, dependencies, exceptions, availability, and user flows. With distributed tracing, you can follow a single transaction end-to-end across multiple services and correlate it with infrastructure metrics for complete context.

4 · AKS, Prometheus & OpenTelemetry

When running workloads on Azure Kubernetes Service (AKS), visibility is critical. Container Insights integrates natively with Azure Monitor to provide real-time health and performance data for nodes, pods, and controllers. This includes CPU, memory, network usage, and container restarts, all viewable in Azure Workbooks or Grafana.

For more advanced, cloud-native observability, Azure offers a managed Prometheus service. It automatically scrapes Kubernetes endpoints for metrics and stores them in a highly-available backend, without requiring you to manage your own Prometheus server. This makes it easier to integrate with Azure Managed Grafana for rich dashboards.

Logs from AKS are ingested into a Log Analytics Workspace via the Azure Monitor Agent, where they can be queried with KQL for troubleshooting and performance tuning.

To standardize observability across polyglot microservices, OpenTelemetry exporters can be deployed in your services. This unifies metrics, logs, and distributed traces — regardless of the language or framework — enabling full end-to-end correlation from front-end requests down to container-level telemetry.

Azure Monitor overview
Azure Monitor provides a comprehensive solution for collecting, analyzing, and acting on telemetry data.

5 · Alerts & Action Groups

In Azure Monitor, alerts continuously evaluate telemetry against defined conditions. When a condition is met, it triggers an Action Group — a reusable set of notification and remediation steps. Action Groups can send emails, push notifications, Teams messages, open ITSM tickets, call webhooks, or run automated workflows via Azure Logic Apps, Functions, or Automation Runbooks.

  1. Create a Log Alert by writing a KQL query that captures the desired signal and define its evaluation frequency.
  2. Set thresholds — for example, trigger if more than 5 critical exceptions occur within 5 minutes.
  3. Attach an Action Group to notify the on-call team and optionally run automated remediation scripts.
// Example KQL for a log alert - top exceptions in the last 5 minutes
exceptions
| where severityLevel >= 3
| where timestamp > ago(5m)
| summarize count() by type
| order by count_ desc

By combining alerts with automation, you can move from reactive firefighting to proactive, self-healing infrastructure. For example, a high CPU alert on a Kubernetes node could automatically trigger a scale-out action or pod rescheduling without human intervention.

6 · Useful KQL Queries

One of the biggest advantages of Azure Monitor is the ability to run Kusto Query Language (KQL) queries directly against your telemetry. With KQL you can turn raw signals into actionable insights in seconds. Here are some practical examples:

// High CPU VMs in last 1 hour - find hotspots quickly
InsightsMetrics
| where Name == "Percentage CPU" and TimeGenerated > ago(1h)
| summarize avg(Val) by bin(TimeGenerated, 5m), Computer
| where avg_Val > 80
| order by avg_Val desc

// 4xx and 5xx errors by app - identify failing services
requests
| where timestamp > ago(1d)
| summarize count() by tostring(resultCode), cloud_RoleName
| where resultCode startswith "4" or resultCode startswith "5"
| order by count_ desc

// Slow dependencies over 1s - spot backend bottlenecks
dependencies
| where timestamp > ago(1h)
| where duration > 1s
| summarize avg(duration), percentiles(duration, 50, 95, 99) by target
| top 20 by percentile_duration_99 desc

These queries can be pinned to a Workbook, used in scheduled alert rules, or even embedded into automation workflows so issues are detected and addressed without human intervention.

7 · Cost, Retention & Data Collection

Observability is only effective if it is sustainable. Managing ingestion, storage, and analysis costs is crucial:

Azure Automation Runbook worker
Automate remediation with Logic Apps, Functions, or Runbooks to reduce manual effort.

8 · Enterprise Governance

In large environments, governance ensures consistency, security, and compliance across subscriptions:

This structured approach ensures your observability implementation is not only powerful, but also maintainable and compliant over time.

Cloud Native Governance
A comprehensive approach to cloud-native governance.

9 · Official docs