Azure Virtual Machine Troubleshooting Guide

Common VM Boot Issues

Virtual machines may fail to boot due to various reasons, including OS corruption, bootloader problems, or hardware configuration errors.

Issue: VM Stuck in Boot Loop

If your VM is repeatedly restarting during the boot process, it often indicates an operating system issue.

Step 1: Access Boot Diagnostics.

Check the boot diagnostics screenshots in the Azure portal to identify the exact stage where the boot process fails. This might show an error message or a blue screen.

Step 2: Repair Boot Configuration Data (BCD).

For Windows VMs, attach the OS disk to a troubleshooting VM and use command-line tools like bcdedit to repair the BCD store.

bootrec /fixmbr
bootrec /fixboot
bootrec /rebuildbcd
Step 3: Check for Corrupted System Files.

Use the System File Checker (SFC) utility on the attached disk to scan for and repair corrupted system files.

sfc /scannow /offbootdir=C:\ /offwindir=C:\Windows
Tip: For Linux VMs, you might need to `chroot` into the OS disk and run fsck or check boot loader configurations (e.g., GRUB).

Connectivity Problems

Troubleshooting VM connectivity involves checking network configurations, security rules, and service availability.

Issue: Cannot Connect via RDP/SSH

This is a frequent issue stemming from incorrect network security group (NSG) rules, firewall settings, or the RDP/SSH service not running.

Step 1: Verify NSG Rules.

Ensure that inbound NSG rules allow traffic on port 3389 (RDP) or 22 (SSH) from your IP address or network range to the VM's network interface.

Step 2: Check VM Firewall.

Verify that the operating system's firewall (Windows Firewall or iptables/firewalld on Linux) is configured to allow RDP/SSH connections.

Step 3: Use Network Watcher.

Azure's Network Watcher tool, specifically the IP flow verify and connection troubleshoot features, can quickly diagnose NSG and connectivity issues.

Step 4: Confirm RDP/SSH Service Status.

If possible, access the VM via the serial console or by attaching the disk to another VM to check if the RDP/SSH service is running and healthy.

Performance Degradation

Slow VM performance can be caused by resource contention, disk I/O bottlenecks, or network latency.

Issue: High CPU Usage

Investigate processes consuming excessive CPU resources.

Step 1: Monitor CPU Metrics.

Use Azure Monitor to track CPU utilization over time. Identify peak usage periods and correlate them with specific events.

Step 2: Analyze Processes.

Connect to the VM and use Task Manager (Windows) or top/htop (Linux) to identify resource-hungry processes. Consider if these are expected workloads.

Step 3: Check for Malware.

Run antivirus and anti-malware scans. Malicious software can consume significant CPU resources.

Step 4: Scale Up or Out.

If the workload legitimately requires more CPU, consider resizing the VM to a higher CPU core count or scaling out by adding more VMs.

Issue: Slow Disk Performance

Disk I/O can be a bottleneck, especially for I/O-intensive applications.

Step 1: Monitor Disk Metrics.

Check Azure Monitor for disk read/write operations per second (IOPS) and throughput. Compare these against the limits of your VM size and disk type (e.g., Standard HDD, Standard SSD, Premium SSD).

Step 2: Identify High-IO Processes.

Use performance monitoring tools within the OS to pinpoint applications or services causing excessive disk activity.

Step 3: Optimize Disk Configuration.

Consider upgrading to Premium SSDs or using Azure Managed Disks with higher performance tiers. For very high IOPS needs, consider Ultra Disks.