Azure Machine Learning – Monitoring

Overview

Monitoring is a core capability of Azure Machine Learning that enables you to track model performance, resource utilization, and data drift in real time. It integrates with Azure Monitor and Azure Log Analytics to provide a unified view of your ML workloads.

Metrics

Azure ML automatically emits a set of built‑in metrics. You can also publish custom metrics.

Metric	Description
ml.compute.cpu_percent	CPU utilization of the compute target.
ml.compute.memory_percent	Memory consumption.
ml.model.inference_latency	Latency of model inference calls.
ml.model.accuracy	Real‑time model accuracy (if logged).

Alerts

Configure alerts to be notified when metrics breach thresholds.

{
  "name": "High CPU Alert",
  "description": "Triggers when CPU > 80% for 5 minutes",
  "condition": {
    "metricName": "ml.compute.cpu_percent",
    "operator": "GreaterThan",
    "threshold": 80,
    "timeAggregation": "Average",
    "windowSize": "PT5M"
  },
  "actionGroups": ["/subscriptions/xxxx/resourceGroups/rg/providers/microsoft.insights/actionGroups/myGroup"]
}

Log Streaming

Stream logs from training runs and deployments directly to Azure Log Analytics.

az monitor log-analytics workspace create \
  --resource-group myResourceGroup \
  --workspace-name myWorkspace

az ml workspace update \
  --name myMLWorkspace \
  --resource-group myResourceGroup \
  --log-analytics-workspace /subscriptions/xxxx/resourceGroups/myResourceGroup/providers/Microsoft.OperationalInsights/workspaces/myWorkspace

API Reference

Use the Azure ML Python SDK to retrieve monitoring data programmatically.

from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace_name)

metrics = ml_client.monitoring.list_metrics(run_id="my-run-id")
for m in metrics:
    print(f"{m.name}: {m.value}")

Sample Code

Deploy a model with monitoring enabled.

from azure.ai.ml import command, Job
from azure.ai.ml.entities import Model

model = Model(name="my-model", path="./model")
job = command(
    display_name="score-job",
    command="python score.py",
    compute="cpu-cluster",
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:1",
    outputs={"model": model},
    enable_monitoring=True
)
ml_client.jobs.create_or_update(job)

Best Practices

Enable built‑in metrics from the start of your experiment.
Set up alerting for critical thresholds such as latency or CPU usage.
Ingest custom business KPIs to correlate model performance with business outcomes.
Leverage Log Analytics workspaces to retain logs for compliance.
Periodically review drift metrics and retrain models as needed.