Table of Contents
Introduction to DirectML Performance
DirectML is a high-performance, hardware-accelerated machine learning API for Windows. Achieving optimal performance with DirectML involves understanding its core principles and applying best practices in your application development. This documentation provides insights and techniques to maximize the speed and efficiency of your DirectML workloads.
This section covers foundational concepts and outlines the key areas we'll explore to unlock the full potential of DirectML on Windows.
Optimizing Operator Execution
Operators are the building blocks of your machine learning models. Efficient execution of these operators is paramount for overall performance.
Operator Fusion
DirectML can fuse multiple operators into a single kernel execution, reducing overhead and improving data locality. Identify opportunities to chain compatible operators.
Performance Tip:
- Prefer sequential operations that can be fused by the DirectML driver.
- Use models that are optimized for fusion, often found in model conversion tools.
Tensor Layouts
The arrangement of data within tensors (e.g., NCHW vs. NHWC) can significantly impact performance depending on the underlying hardware. Experiment with different layouts to find the most efficient one for your specific model and hardware configuration.
Batch Size Tuning
The optimal batch size is a critical tuning parameter. Larger batch sizes can improve throughput by better utilizing hardware parallelism, but excessively large batches can lead to memory pressure and diminishing returns. Measure performance with varying batch sizes to find the sweet spot.
Memory Management
Efficient memory management is crucial for both performance and stability. Incorrect memory handling can lead to bottlenecks, increased latency, and out-of-memory errors.
Resource Lifetimes
Carefully manage the lifetimes of DirectML resources (buffers, textures, samplers, etc.). Reusing resources where possible can significantly reduce allocation overhead.
Persistent Resources
For frequently used resources, consider using persistent resources. These resources can remain allocated across multiple command lists, avoiding repeated allocation and deallocation.
Buffer Padding and Alignment
Ensure that your buffers are correctly padded and aligned according to DirectML specifications. Incorrect alignment can lead to suboptimal memory access patterns and performance degradation.
Resource Binding
How you bind resources to your DirectML operators affects how efficiently data is accessed. Understanding binding is key to avoiding data transfer bottlenecks.
Persistent Resource Binding
When using persistent resources, ensure they are bound correctly to the appropriate operator inputs and outputs. This avoids the overhead of re-binding resources for each operator or operator graph.
Input and Output Binding Order
The order in which you bind input and output resources can sometimes have a minor impact. Consult DirectML documentation for specific operator binding recommendations if available.
API Example: Binding Resources
IDMLOperator::BindInputs(UINT32 startInputIndex, UINT32 inputCount, _In_reads_(inputCount) const DML_BINDING_DESC* pBindingDescs)
IDMLOperator::BindOutputs(UINT32 startOutputIndex, UINT32 outputCount, _In_reads_(outputCount) const DML_BINDING_DESC* pBindingDescs)
Performance Analysis Tools
Leveraging the right tools is essential for identifying performance bottlenecks and validating optimizations.
Windows Performance Analyzer (WPA)
WPA provides deep insights into system-level performance, including CPU usage, GPU activity, and memory operations. Use it to profile your DirectML application and pinpoint areas for improvement.
PIX on Windows
PIX is an invaluable debugging and performance analysis tool for DirectX applications. It allows you to capture detailed GPU captures, analyze command lists, and inspect resource states, helping you understand GPU-bound issues.
Important Note:
Regularly profile your application on target hardware to ensure that optimizations translate effectively. What works on one device might behave differently on another.
Common Pitfalls to Avoid
Be aware of common mistakes that can lead to suboptimal DirectML performance.
- Excessive CPU-GPU Synchronization: Avoid frequent calls to
ID3D12CommandQueue::SignalorID3D12CommandQueue::Waitwithout proper batching. - Unnecessary Data Transfers: Minimize host-to-device and device-to-host data copies. Keep data on the GPU as much as possible.
- Fragmented Memory Allocations: Large numbers of small, short-lived allocations can lead to memory fragmentation and increased overhead.
- Ignoring Hardware Specifics: Different GPUs have different strengths and weaknesses. Consider the target hardware when optimizing.
Advanced Techniques for Further Optimization
For maximum performance, explore these advanced techniques:
- Custom Kernels: In specific scenarios, writing custom DirectML kernels might offer superior performance for unique operations.
- Asynchronous Operations: Utilize asynchronous compute queues where supported to overlap computation with other tasks.
- Shader Model Optimization: For custom kernels, adhere to best practices for shader code optimization.
- Profiling Guided Tuning: Continuously profile and tune based on data from WPA and PIX.