DirectML Performance Documentation

Introduction
Optimizing Operator Execution
Memory Management
Resource Binding
Performance Analysis Tools
Common Pitfalls
Advanced Techniques

Introduction to DirectML Performance

DirectML is a high-performance, hardware-accelerated machine learning API for Windows. Achieving optimal performance with DirectML involves understanding its core principles and applying best practices in your application development. This documentation provides insights and techniques to maximize the speed and efficiency of your DirectML workloads.

This section covers foundational concepts and outlines the key areas we'll explore to unlock the full potential of DirectML on Windows.

Optimizing Operator Execution

Operators are the building blocks of your machine learning models. Efficient execution of these operators is paramount for overall performance.

Operator Fusion

DirectML can fuse multiple operators into a single kernel execution, reducing overhead and improving data locality. Identify opportunities to chain compatible operators.

Performance Tip:

Prefer sequential operations that can be fused by the DirectML driver.
Use models that are optimized for fusion, often found in model conversion tools.

Tensor Layouts

The arrangement of data within tensors (e.g., NCHW vs. NHWC) can significantly impact performance depending on the underlying hardware. Experiment with different layouts to find the most efficient one for your specific model and hardware configuration.

Batch Size Tuning

The optimal batch size is a critical tuning parameter. Larger batch sizes can improve throughput by better utilizing hardware parallelism, but excessively large batches can lead to memory pressure and diminishing returns. Measure performance with varying batch sizes to find the sweet spot.

Memory Management

Efficient memory management is crucial for both performance and stability. Incorrect memory handling can lead to bottlenecks, increased latency, and out-of-memory errors.

Resource Lifetimes

Carefully manage the lifetimes of DirectML resources (buffers, textures, samplers, etc.). Reusing resources where possible can significantly reduce allocation overhead.

Persistent Resources

For frequently used resources, consider using persistent resources. These resources can remain allocated across multiple command lists, avoiding repeated allocation and deallocation.

Buffer Padding and Alignment

Ensure that your buffers are correctly padded and aligned according to DirectML specifications. Incorrect alignment can lead to suboptimal memory access patterns and performance degradation.

Resource Binding

How you bind resources to your DirectML operators affects how efficiently data is accessed. Understanding binding is key to avoiding data transfer bottlenecks.

Persistent Resource Binding

When using persistent resources, ensure they are bound correctly to the appropriate operator inputs and outputs. This avoids the overhead of re-binding resources for each operator or operator graph.

Input and Output Binding Order

The order in which you bind input and output resources can sometimes have a minor impact. Consult DirectML documentation for specific operator binding recommendations if available.

API Example: Binding Resources

IDMLOperator::BindInputs(UINT32 startInputIndex, UINT32 inputCount, _In_reads_(inputCount) const DML_BINDING_DESC* pBindingDescs)

Binds input resources to the operator.

IDMLOperator::BindOutputs(UINT32 startOutputIndex, UINT32 outputCount, _In_reads_(outputCount) const DML_BINDING_DESC* pBindingDescs)

Binds output resources to the operator.

Performance Analysis Tools

Leveraging the right tools is essential for identifying performance bottlenecks and validating optimizations.

Windows Performance Analyzer (WPA)

WPA provides deep insights into system-level performance, including CPU usage, GPU activity, and memory operations. Use it to profile your DirectML application and pinpoint areas for improvement.

PIX on Windows

PIX is an invaluable debugging and performance analysis tool for DirectX applications. It allows you to capture detailed GPU captures, analyze command lists, and inspect resource states, helping you understand GPU-bound issues.

Important Note:

Regularly profile your application on target hardware to ensure that optimizations translate effectively. What works on one device might behave differently on another.

Common Pitfalls to Avoid

Be aware of common mistakes that can lead to suboptimal DirectML performance.

Excessive CPU-GPU Synchronization: Avoid frequent calls to ID3D12CommandQueue::Signal or ID3D12CommandQueue::Wait without proper batching.
Unnecessary Data Transfers: Minimize host-to-device and device-to-host data copies. Keep data on the GPU as much as possible.
Fragmented Memory Allocations: Large numbers of small, short-lived allocations can lead to memory fragmentation and increased overhead.
Ignoring Hardware Specifics: Different GPUs have different strengths and weaknesses. Consider the target hardware when optimizing.

Advanced Techniques for Further Optimization

For maximum performance, explore these advanced techniques:

Custom Kernels: In specific scenarios, writing custom DirectML kernels might offer superior performance for unique operations.
Asynchronous Operations: Utilize asynchronous compute queues where supported to overlap computation with other tasks.
Shader Model Optimization: For custom kernels, adhere to best practices for shader code optimization.
Profiling Guided Tuning: Continuously profile and tune based on data from WPA and PIX.

Table of Contents

Introduction to DirectML Performance

Optimizing Operator Execution

Operator Fusion

Performance Tip:

Tensor Layouts

Batch Size Tuning

Memory Management

Resource Lifetimes

Persistent Resources

Buffer Padding and Alignment

Resource Binding

Persistent Resource Binding

Input and Output Binding Order

API Example: Binding Resources

Performance Analysis Tools

Windows Performance Analyzer (WPA)

PIX on Windows

Important Note:

Common Pitfalls to Avoid

Advanced Techniques for Further Optimization