Monitoring Azure AI Machine Learning
This section provides detailed information on how to monitor your Azure AI Machine Learning resources and deployed models. Effective monitoring is crucial for understanding performance, detecting issues, and ensuring the reliability of your machine learning solutions.
Key Monitoring Areas
- Monitoring Data
- Model Performance Monitoring
- Infrastructure and Resource Metrics
- Logging and Auditing
- Alerting and Notifications
Monitoring Data
Understand how to collect and analyze data related to your model's predictions and inputs. This includes data drift detection and monitoring for data quality issues.
Data Drift Detection
Azure AI Machine Learning provides capabilities to detect drift in your data. Data drift occurs when the statistical properties of the incoming data change over time compared to the training data. This can significantly impact model performance.
# Example of setting up data drift monitoring in Python SDK
from azure.ai.ml import MLClient
from azure.ai.ml.entities import DataDriftMonitor, DataDriftSignal, DataDriftMetric
ml_client = MLClient.from_config(credential=..., subscription_id=..., resource_group=..., workspace_name=...)
# Define your training and production data inputs
training_data = DataInput(type="uri_folder", path="azureml://datastores/workspaceblobstore/paths/training_data")
production_data = DataInput(type="uri_folder", path="azureml://datastores/workspaceblobstore/paths/production_data")
# Configure the data drift signal
data_drift_signal = DataDriftSignal(
data_drift_metric=DataDriftMetric.ALL,
target_column_name="target",
time_column_name="timestamp"
)
# Create the data drift monitor
data_drift_monitor = DataDriftMonitor(
display_name="MyDataDriftMonitor",
description="Monitor for data drift in production data",
target_data=production_data,
baseline_data=training_data,
signal=data_drift_signal,
schedule=Schedule(interval="1", interval_unit="DAY") # Check daily
)
ml_client.monitoring.create_or_update(data_drift_monitor)
Model Performance Monitoring
Track the accuracy, precision, recall, and other relevant metrics of your deployed models in real-time. This helps identify performance degradation.
Metrics Collection
When deploying a model to an endpoint, you can enable the collection of inference logs and metrics. These metrics are often sent to Azure Application Insights for detailed analysis.
| Metric | Description |
|---|---|
| Inference Latency | Time taken to process an inference request. |
| Inference Throughput | Number of inference requests processed per unit of time. |
| Error Rate | Percentage of requests that resulted in an error. |
| Prediction Distribution | Distribution of predicted labels or values. |
Infrastructure and Resource Metrics
Monitor the underlying infrastructure supporting your Azure AI Machine Learning workloads. This includes compute usage, network traffic, and storage utilization.
Azure Monitor Integration
Azure AI Machine Learning integrates with Azure Monitor, allowing you to view and analyze metrics related to your workspace, compute clusters, and endpoints. Key metrics include:
- CPU and Memory Utilization
- Network In/Out
- Disk IOPS
- Queue Lengths (for batch endpoints)
Logging and Auditing
Understand how to access and interpret logs generated by your Azure AI Machine Learning services. This is essential for debugging and troubleshooting.
Inference Logs
Logs from deployed endpoints capture details about incoming requests, model predictions, and any errors encountered during inference. These can be configured to be sent to Azure Application Insights or Log Analytics.
Audit Logs
Azure Activity Logs provide a record of operations performed on your Azure resources, including Azure AI Machine Learning workspace. This helps track who did what, when, and to which resource.
Alerting and Notifications
Set up alerts to be notified proactively when critical issues or performance thresholds are met. This enables timely intervention.
Creating Alerts
Alerts can be configured through Azure Monitor based on specific metrics, log queries, or activity log events. You can define rules to trigger notifications via email, SMS, or webhooks.