Overview
DirectML (Direct Machine Learning) provides a high‑performance, low‑level API for machine learning workloads on Windows. It integrates tightly with DirectX 12, allowing GPU‑accelerated execution of neural network graphs with minimal overhead.
Key Architectural Concepts
Graph Execution Model
DirectML represents a neural network as a graph of operators (nodes) connected by tensors. The runtime compiles the graph into a sequence of GPU commands and executes them as a single dispatch.
graph = CreateGraph();
graph->AddOperator(Convolution, inputs, outputs, params);
graph->AddOperator(ReLU, ...);
graph->Compile();
graph->Execute(commandList);
Layer Abstractions
- Tensor: Typed multi‑dimensional array with layout hints.
- Operator: Core computational unit (e.g., Conv, MatMul, Softmax).
- Descriptor: Describes specific operator parameters and tensor formats.
DirectX 12 Integration
DirectML leverages DirectX 12’s resource management and command queues. Resources created for DirectML can be shared with other DirectX pipelines, enabling hybrid graphics‑ML workloads.
// Example: sharing a texture between a render pass and a compute pass
ID3D12Resource* renderTarget = ...;
ID3D12Resource* mlTensor = DirectML::CreateTensorFromResource(renderTarget);
graph->AddOperator(..., mlTensor, ...);
Execution Pipeline
- Resource Allocation – Allocate GPU buffers or textures for tensors.
- Graph Construction – Define operators and their connections.
- Compilation – DirectML compiles the graph into optimized GPU commands.
- Dispatch – Submit the compiled command list to a DirectX 12 command queue.
- Synchronization – Use fences or events to synchronize with other GPU work.
Performance Considerations
- Prefer immutable tensors for weights to enable caching.
- Batch inputs to maximize GPU utilization.
- Align tensor strides to 128‑byte boundaries for optimal memory access.
- Leverage
ID3D12GraphicsCommandList::SetPipelineStatereuse across multiple graphs.
Sample Code
Below is a minimal example that creates a convolution operator and runs it on the GPU.
#include <DirectML.h>
#include <d3d12.h>
// Initialize D3D12 device and command queue
ID3D12Device* device = CreateD3D12Device();
ID3D12CommandQueue* queue = device->CreateCommandQueue(...);
// Create DirectML device
IDMLDevice* dmlDevice = nullptr;
DMLCreateDevice(device, DML_CREATE_DEVICE_FLAG_NONE, IID_PPV_ARGS(&dmlDevice));
// Allocate tensors
auto inputDesc = DML_BUFFER_TENSOR_DESC{...};
auto weightDesc = DML_BUFFER_TENSOR_DESC{...};
auto outputDesc = DML_BUFFER_TENSOR_DESC{...};
IDMLTensorDescriptor* input = dmlDevice->CreateTensorDescriptor(&inputDesc);
IDMLTensorDescriptor* weight = dmlDevice->CreateTensorDescriptor(&weightDesc);
IDMLTensorDescriptor* output = dmlDevice->CreateTensorDescriptor(&outputDesc);
// Build the graph
IDMLCompiledOperator* convOp = nullptr;
{
DML_CONVOLUTION_OPERATOR_DESC convDesc = {};
convDesc.InputTensor = input->GetDesc();
convDesc.FilterTensor = weight->GetDesc();
convDesc.OutputTensor = output->GetDesc();
convDesc.Strides = {1,1};
convDesc.StartPadding = {0,0};
convDesc.EndPadding = {0,0};
convDesc.DilationRates = {1,1};
convDesc.GroupCount = 1;
convDesc.FusedActivation = DML_OPERATOR_DESC{nullptr,nullptr};
IDMLDeviceOperator* op = dmlDevice->CreateOperator(DML_OPERATOR_CONVOLUTION, &convDesc);
convOp = dmlDevice->CompileOperator(op, DML_EXECUTION_FLAG_NONE);
}
// Execute
ID3D12GraphicsCommandList* cmdList = device->CreateCommandList(...);
dmlDevice->ExecuteOperator(cmdList, convOp, ...);
cmdList->Close();
queue->ExecuteCommandLists(1, reinterpret_cast(&cmdList));
queue->Signal(fence, ++fenceValue);