This guide provides common troubleshooting steps for issues encountered with Azure Virtual Machines (VMs). We'll cover scenarios ranging from connectivity problems to performance degradation and boot failures.
Connectivity Issues
Problems connecting to your Azure VM can stem from network configurations, firewall rules, or the VM's own operating system.
1. Cannot RDP/SSH into the VM
Check Network Security Group (NSG) rules: Ensure that inbound rules allow RDP (port 3389 for Windows) or SSH (port 22 for Linux) from your IP address or allowed subnet.
Verify Public IP Address: Confirm that your VM has a public IP address assigned and that it's the one you're attempting to connect to.
Check Azure Firewall: If you're using Azure Firewall, ensure that the relevant ports are open and traffic is allowed.
VM Agent Status: For Windows VMs, ensure the VM Agent is running. It's often required for certain management operations.
Boot Diagnostics: Review boot diagnostics screenshots for any OS-level errors preventing network services from starting.
Tip: Use the Connection Troubleshoot tool in the Azure portal for VMs. It can automatically diagnose and report network connectivity issues.
2. Application Unreachable
Application Service Status: Verify that the application service is running on the VM.
Firewall on VM: Check the Windows Firewall or iptables (Linux) on the VM itself to ensure the application's port is open.
Load Balancer/Application Gateway: If applicable, check the health probes and backend pool configurations.
Azure Load Balancer Rules: Ensure load balancing rules are correctly configured to forward traffic to the VM.
Performance Issues
Slow VM performance can be due to resource exhaustion, inefficient configurations, or application-specific problems.
1. High CPU Utilization
Task Manager/htop: Identify the processes consuming the most CPU.
Resource Monitoring: Use Azure Monitor to track CPU usage over time and identify spikes.
Check for Runaway Processes: A misbehaving application or a loop can cause sustained high CPU.
Consider VM Size: If the workload consistently exceeds the VM's capacity, consider resizing to a larger VM SKU.
2. Slow Disk I/O
Disk Performance Metrics: Monitor Disk Read/Write Operations Per Second (IOPS) and throughput in Azure Monitor.
Disk Type: Ensure you're using appropriate disk types (e.g., Premium SSDs for I/O intensive workloads).
Application I/O Patterns: Analyze how your applications are accessing disks. Some applications are not optimized for cloud storage.
Disk Caching: Review host caching settings for your managed disks.
3. Memory Leaks
Task Manager/top: Monitor memory usage and identify processes that are steadily increasing their memory footprint.
Application Logs: Check application logs for errors that might indicate memory allocation issues.
Restarting Services: Sometimes, restarting the problematic application service can temporarily resolve memory leaks.
Boot Issues
When a VM fails to start or gets stuck during the boot process, boot diagnostics are your primary tool.
1. VM Fails to Boot
Boot Diagnostics: The first step is always to check the console output screenshot in Boot Diagnostics. This will often show the exact error message or point of failure.
OS Disk Health: Ensure the OS disk is not corrupted or unhealthy. You might need to attach it to another VM for inspection.
BCD Store (Windows): For Windows VMs, issues with the Boot Configuration Data (BCD) store can prevent booting. You can use the Boot Configuration Data tool to repair it.
Kernel Panics (Linux): Linux VMs might experience kernel panics. The console output will usually indicate this.
Important: If you suspect OS disk corruption, avoid making direct modifications without a backup or a clear understanding of the process. It's often safer to redeploy the VM or create a new one from a snapshot if possible.
2. Stuck During Boot
Review Boot Sequence: Analyze the boot diagnostics to see what phase the VM is getting stuck in. Is it before the OS loads, during OS loading, or after login prompts?
Windows Services: For Windows, a problematic service that starts automatically at boot can cause it to hang. You might need to disable such services using startup repair tools.
Linux Services/Mount Points: In Linux, issues with `/etc/fstab` entries (e.g., trying to mount non-existent disks) or failing systemd services can cause boot delays or hangs.
General Troubleshooting Tools and Techniques
Azure Monitor: Essential for understanding performance trends, resource utilization, and setting up alerts.
Azure Network Watcher: Provides tools like Connection Troubleshoot, IP Flow Verify, and Packet Capture to diagnose network issues.
VM Boot Diagnostics: Crucial for any VM that won't boot or is inaccessible.
Run Command: Allows you to execute scripts on your VM remotely, useful for tasks like checking service status or collecting logs.
Serial Console: Offers direct access to the VM's console, useful for troubleshooting boot issues or connectivity problems when other methods fail.
Managed Disk Snapshots: Always take snapshots of your OS and data disks before performing significant troubleshooting steps that involve modifying disk configurations.
Note: For complex or persistent issues, consulting the official Microsoft Azure documentation for specific error codes or scenarios is highly recommended.