Troubleshoot Azure Virtual Machines

Virtual Machine Boot Failures

When your Azure Virtual Machine fails to boot, it can be due to various reasons, including operating system issues, boot configuration problems, or disk errors. Here are common steps to diagnose and resolve these issues.

Common Causes:

Corrupted operating system files.
Incorrect boot order or bootloader configuration.
Disk I/O errors or full OS disk.
Unintended changes to system files.

Troubleshooting Steps:

Accessing Boot Diagnostics:

Azure provides Boot Diagnostics, which capture screenshots of the VM's console output and serial log. This is the first place to check for error messages.

Navigate to your VM in the Azure portal, then under "Support + troubleshooting", select "Boot diagnostics".
Using Serial Console:

The Serial Console allows you to interact with your VM directly, just like you would with a physical machine's console. You can use it to access command prompts or PowerShell to investigate boot issues.

Ensure the serial console is enabled for your VM. You can then connect via the Azure portal under "Support + troubleshooting" > "Serial console".

Tip:

For Windows, you can use the Serial Console to enable or disable a driver or service that might be preventing boot. For Linux, you can edit boot parameters or access shell prompts.
Attaching OS Disk to Another VM:

If direct console access doesn't reveal the issue, you can detach the OS disk from the problematic VM and attach it as a data disk to a healthy VM. This allows you to inspect the file system, check event logs (Windows), or system logs (Linux).

Steps:
- Stop the deallocated VM.
- Detach the OS disk.
- Create or use an existing troubleshooting VM.
- Attach the detached OS disk as a data disk to the troubleshooting VM.
- Access the disk and perform diagnostics.
- Detach the disk and reattach it as the OS disk to the original VM.

Network Connectivity Issues

Problems reaching your Azure VM or services hosted on it can stem from network security rules, virtual network configurations, or DNS issues.

Common Causes:

Misconfigured Network Security Groups (NSGs).
Incorrect subnet routing or firewall rules.
DNS resolution failures.
Issues with load balancers or application gateways.
On-premises network connectivity problems (VPN/ExpressRoute).

Troubleshooting Steps:

Verify Network Security Groups (NSGs):

NSGs control inbound and outbound traffic to Azure resources. Ensure that the necessary ports are open for your application or RDP/SSH access.

Check NSGs associated with the VM's network interface and the subnet it resides in. Use the Network Watcher's Connection Troubleshoot tool for automated checks.
Test IP Flow Verify:

This Network Watcher feature helps determine if traffic is allowed or denied to or from a VM based on NSG rules.

Select your VM, go to "Network Watcher" > "IP flow verify". Specify source and destination IP addresses, ports, and protocol to test.
Use Network Watcher's Packet Capture:

For deeper investigation, packet capture can help you see the actual network traffic arriving at your VM and any responses.
Check DNS Resolution:

Ensure your VM can resolve public and private DNS names. You can test this from within the VM using tools like nslookup or dig.

If using Azure DNS, verify your DNS zone configuration.
Examine Route Tables:

Verify that traffic has a valid route to its destination. Check the route table associated with the VM's subnet.

Important:

Remember to test connectivity from the source that is experiencing the issue. For example, if an on-premises client cannot connect, test from an on-premises machine.

Disk Performance Problems

Slow disk I/O can significantly impact application performance. This section covers common causes and how to diagnose them.

Common Causes:

Under-provisioned disk SKU (e.g., Standard HDD vs. Premium SSD).
Disk throttling due to exceeding IOPS or throughput limits.
High disk queue length.
Resource contention on the VM itself.
Inefficient application I/O patterns.

Troubleshooting Steps:

Monitor Disk Metrics:

Azure Monitor provides key disk performance metrics:
- IOPS Read/Write: Number of read/write operations per second.
- Throughput (Bytes/sec) Read/Write: Data transfer rate.
- Disk Queue Length: Number of I/O operations waiting to be processed. A consistently high queue length indicates the disk cannot keep up.
Compare these metrics against the performance targets of your chosen disk SKU.
Check Disk Throttling:

Azure storage has limits on IOPS and throughput. If these limits are exceeded, the disk will be throttled, resulting in reduced performance.

Look for metrics like Transactions and Bandwidth (Bytes/sec) and compare them to the limits of your disk type.
Analyze Queue Length:

A sustained high disk queue length (e.g., > 2 for extended periods) suggests the disk is a bottleneck. Consider upgrading to a higher-performance disk SKU (e.g., from Standard HDD to Premium SSD or Ultra Disk).
VM Size Considerations:

The VM size itself can influence disk performance, as it dictates the maximum throughput and IOPS that can be pushed to attached disks.

Ensure your VM size supports the performance requirements of your disks.
Application I/O Analysis:

Use tools like Resource Monitor (Windows) or iotop (Linux) from within the VM to identify which processes are generating the most disk I/O. This can help optimize application behavior.

Application-Specific Errors

Troubleshooting errors within your applications running on Azure VMs requires understanding how to access and analyze application logs.

Common Causes:

Application configuration errors.
Dependency failures (databases, other services).
Code bugs.
Resource exhaustion (CPU, memory).
Incorrect permissions.

Troubleshooting Steps:

Review Application Logs:

This is the most crucial step. Locate and examine your application's error logs. These are typically found in directories like /var/log/ (Linux) or C:\inetpub\logs\LogFiles\W3SVC1\ (IIS on Windows), or custom log locations defined by your application.
Examine System Event Logs (Windows):

Windows Event Viewer (Application, System, Security logs) often contains valuable information about application crashes or underlying system issues.
Check System Logs (Linux):

Use commands like journalctl (for systemd-based systems), dmesg, or files in /var/log/ (e.g., syslog, messages) to diagnose system-level problems affecting your application.
Monitor VM Resource Utilization:

Check Azure Monitor metrics for CPU, memory, and network usage of your VM. High utilization can cause applications to become unresponsive or error out.

Use Task Manager (Windows) or top/htop (Linux) from within the VM for real-time process analysis.
Verify Dependencies:

Ensure that any databases, APIs, or other services your application relies on are accessible and functioning correctly. Test connectivity from the VM to these dependencies.

Resource Limits and Quotas

Exceeding Azure quotas or resource limits can prevent new deployments or cause existing resources to behave unexpectedly.

Common Causes:

Exceeding subscription or region-specific quotas (e.g., vCPUs, storage accounts).
Hitting service limits (e.g., maximum number of disks per VM).
Insufficient capacity in a region.

Troubleshooting Steps:

Check Quotas in Azure Portal:

Navigate to your subscription in the Azure portal and select "Usage + quotas". This view shows your current consumption and limits for various resources across regions.
Request Quota Increases:

If you are approaching or have exceeded a quota, you can submit a request to increase it. The process typically involves creating a support request.

Note:

Quota increases are subject to Azure's review and approval process and may take some time.
Understand Service Limits:

Familiarize yourself with Azure's documented service limits for Virtual Machines, Storage, Networking, etc. These are often tied to specific resource types or configurations.
Monitor Resource Usage:

Proactively monitor your resource consumption using Azure Monitor and Cost Management to anticipate potential quota issues.

Virtual Machine Boot Failures

Common Causes:

Troubleshooting Steps:

Accessing Boot Diagnostics:

Using Serial Console:

Tip:

Attaching OS Disk to Another VM:

Network Connectivity Issues

Common Causes:

Troubleshooting Steps:

Verify Network Security Groups (NSGs):

Test IP Flow Verify:

Use Network Watcher's Packet Capture:

Check DNS Resolution:

Examine Route Tables:

Important:

Disk Performance Problems

Common Causes:

Troubleshooting Steps:

Monitor Disk Metrics:

Check Disk Throttling:

Analyze Queue Length:

VM Size Considerations:

Application I/O Analysis:

Application-Specific Errors

Common Causes:

Troubleshooting Steps:

Review Application Logs:

Examine System Event Logs (Windows):

Check System Logs (Linux):

Monitor VM Resource Utilization:

Verify Dependencies:

Resource Limits and Quotas

Common Causes:

Troubleshooting Steps:

Check Quotas in Azure Portal:

Request Quota Increases:

Note:

Understand Service Limits:

Monitor Resource Usage: