Azure Virtual Machine Troubleshooting Guide
Common VM Boot Issues
Virtual machines may fail to boot due to various reasons, including OS corruption, bootloader problems, or hardware configuration errors.
Issue: VM Stuck in Boot Loop
If your VM is repeatedly restarting during the boot process, it often indicates an operating system issue.
Check the boot diagnostics screenshots in the Azure portal to identify the exact stage where the boot process fails. This might show an error message or a blue screen.
For Windows VMs, attach the OS disk to a troubleshooting VM and use command-line tools like bcdedit to repair the BCD store.
bootrec /fixmbr
bootrec /fixboot
bootrec /rebuildbcd
Use the System File Checker (SFC) utility on the attached disk to scan for and repair corrupted system files.
sfc /scannow /offbootdir=C:\ /offwindir=C:\Windows
fsck or check boot loader configurations (e.g., GRUB).
Connectivity Problems
Troubleshooting VM connectivity involves checking network configurations, security rules, and service availability.
Issue: Cannot Connect via RDP/SSH
This is a frequent issue stemming from incorrect network security group (NSG) rules, firewall settings, or the RDP/SSH service not running.
Ensure that inbound NSG rules allow traffic on port 3389 (RDP) or 22 (SSH) from your IP address or network range to the VM's network interface.
Verify that the operating system's firewall (Windows Firewall or iptables/firewalld on Linux) is configured to allow RDP/SSH connections.
Azure's Network Watcher tool, specifically the IP flow verify and connection troubleshoot features, can quickly diagnose NSG and connectivity issues.
If possible, access the VM via the serial console or by attaching the disk to another VM to check if the RDP/SSH service is running and healthy.
Performance Degradation
Slow VM performance can be caused by resource contention, disk I/O bottlenecks, or network latency.
Issue: High CPU Usage
Investigate processes consuming excessive CPU resources.
Use Azure Monitor to track CPU utilization over time. Identify peak usage periods and correlate them with specific events.
Connect to the VM and use Task Manager (Windows) or top/htop (Linux) to identify resource-hungry processes. Consider if these are expected workloads.
Run antivirus and anti-malware scans. Malicious software can consume significant CPU resources.
If the workload legitimately requires more CPU, consider resizing the VM to a higher CPU core count or scaling out by adding more VMs.
Issue: Slow Disk Performance
Disk I/O can be a bottleneck, especially for I/O-intensive applications.
Check Azure Monitor for disk read/write operations per second (IOPS) and throughput. Compare these against the limits of your VM size and disk type (e.g., Standard HDD, Standard SSD, Premium SSD).
Use performance monitoring tools within the OS to pinpoint applications or services causing excessive disk activity.
Consider upgrading to Premium SSDs or using Azure Managed Disks with higher performance tiers. For very high IOPS needs, consider Ultra Disks.