Troubleshooting Azure VM Boot Issues with Boot Diagnostics

Boot Diagnostics is an invaluable feature in Azure that allows you to diagnose virtual machine (VM) boot problems. It captures console output and screenshots of your VM during the boot process, providing crucial insights when a VM fails to start correctly.

Why Use Boot Diagnostics?

How Boot Diagnostics Works

When Boot Diagnostics is enabled for a VM, Azure takes two types of captures:

  1. Console Output: This captures the serial console logs generated by the VM's operating system during the boot sequence. It's text-based and provides detailed information about services starting, errors encountered, and system events.
  2. Screenshot: This is a snapshot of the VM's display output, similar to what you would see on a physical monitor connected to a server. It's particularly useful for identifying visual error messages or graphical issues during the boot process.

Enabling Boot Diagnostics

Boot Diagnostics can be enabled during VM creation or for an existing VM via the Azure portal.

For New VMs:

During the VM creation wizard, navigate to the "Monitoring" tab. Under the "Boot diagnostics" section, select "Enable". You will need to specify a storage account where the boot logs and screenshots will be stored.

For Existing VMs:

  1. Navigate to your VM in the Azure portal.
  2. Under the "Support + troubleshooting" section in the left-hand menu, select "Boot diagnostics".
  3. If not already enabled, click the "On" button.
  4. You will be prompted to select or create a storage account. It's recommended to use a general-purpose v2 storage account.
  5. Click "Save".
Note: Boot diagnostics requires a storage account to store the captured data. Ensure the storage account is accessible by the VM. For managed disks, the storage account can be in a different region or subscription, but for unmanaged disks, it must be in the same region.

Accessing and Interpreting Boot Diagnostics Data

Once enabled, you can access the boot logs and screenshots from the "Boot diagnostics" blade of your VM.

Console Output:

The console output is presented as a text log. Look for:


[   0.000000] Linux version 5.15.0-1050-azure (buildd@lcy02-amd64-104) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, ...) #55-Ubuntu SMP Mon Feb 20 14:41:25 UTC 2023
...
[  15.456789] ERROR: Failed to load module 'my_custom_driver'. Error code: -19
[  16.012345] Systemd[1]: Failed to start NetworkManager.service - Network Manager.
            

Screenshot:

The screenshot provides a visual confirmation of the boot process. It can help identify:

If the screenshot shows a black screen or hangs at a specific point, it indicates the problem is occurring at that stage of the boot process.

Important: If Boot Diagnostics is not enabled, you will not be able to retrieve this data for a VM that has already failed to boot. You must enable it proactively.

Common Boot Failure Scenarios and How Boot Diagnostics Helps

Scenario 1: Operating System Corruption

Symptom: VM fails to boot, shows generic error messages or a BSOD.

Boot Diagnostics: The screenshot might show a BSOD with an error code, or the console log might indicate corrupted system files or failed driver loads.

Scenario 2: Incorrect Disk Configuration

Symptom: VM fails to find a bootable drive.

Boot Diagnostics: The console output might contain messages like "No bootable device found" or errors related to disk enumeration.

Scenario 3: Network Driver Issues

Symptom: VM boots but becomes unreachable, or network services fail to start.

Boot Diagnostics: Console logs might show errors starting network-related services or driver initialization failures.

Scenario 4: Unintended Configuration Changes

Symptom: VM boots but exhibits unexpected behavior or crashes shortly after. This could be due to recent updates or configuration modifications.

Boot Diagnostics: Reviewing the logs from the time of the failure can help identify which service or process started failing after the changes.

Troubleshooting Steps with Boot Diagnostics

  1. Enable Boot Diagnostics: Ensure it's enabled for the VM.
  2. Capture Data: Trigger a VM reboot to get fresh diagnostic information.
  3. Review Screenshot: Look for any visual errors or stalls.
  4. Analyze Console Output: Search for error keywords and specific error codes.
  5. Correlate with Recent Changes: Compare the boot log with any recent configuration changes, updates, or deployments.
  6. Seek Resolution: Based on the findings, use the appropriate Azure tools or OS-level troubleshooting steps to resolve the issue. This might involve attaching the OS disk to another VM for repair, using the Azure Serial Console for interactive troubleshooting, or reverting changes.
Tip: For persistent boot issues, consider using the Azure Serial Console for interactive troubleshooting. It allows you to access the VM's command line directly, even if the OS is not fully functional, and can be used in conjunction with boot diagnostic findings.

Best Practices