DirectML Architecture - Microsoft Docs

Overview

DirectML (Direct Machine Learning) provides a high‑performance, low‑level API for machine learning workloads on Windows. It integrates tightly with DirectX 12, allowing GPU‑accelerated execution of neural network graphs with minimal overhead.

Key Architectural Concepts

Graph Execution Model

DirectML represents a neural network as a graph of operators (nodes) connected by tensors. The runtime compiles the graph into a sequence of GPU commands and executes them as a single dispatch.

graph = CreateGraph();
graph->AddOperator(Convolution, inputs, outputs, params);
graph->AddOperator(ReLU, ...);
graph->Compile();
graph->Execute(commandList);

Layer Abstractions

Tensor: Typed multi‑dimensional array with layout hints.
Operator: Core computational unit (e.g., Conv, MatMul, Softmax).
Descriptor: Describes specific operator parameters and tensor formats.

DirectX 12 Integration

DirectML leverages DirectX 12’s resource management and command queues. Resources created for DirectML can be shared with other DirectX pipelines, enabling hybrid graphics‑ML workloads.

// Example: sharing a texture between a render pass and a compute pass
ID3D12Resource* renderTarget = ...;
ID3D12Resource* mlTensor = DirectML::CreateTensorFromResource(renderTarget);
graph->AddOperator(..., mlTensor, ...);

Execution Pipeline

Resource Allocation – Allocate GPU buffers or textures for tensors.
Graph Construction – Define operators and their connections.
Compilation – DirectML compiles the graph into optimized GPU commands.
Dispatch – Submit the compiled command list to a DirectX 12 command queue.
Synchronization – Use fences or events to synchronize with other GPU work.

Performance Considerations

Prefer immutable tensors for weights to enable caching.
Batch inputs to maximize GPU utilization.
Align tensor strides to 128‑byte boundaries for optimal memory access.
Leverage ID3D12GraphicsCommandList::SetPipelineState reuse across multiple graphs.

Sample Code

Below is a minimal example that creates a convolution operator and runs it on the GPU.

#include <DirectML.h>
#include <d3d12.h>

// Initialize D3D12 device and command queue
ID3D12Device* device = CreateD3D12Device();
ID3D12CommandQueue* queue = device->CreateCommandQueue(...);

// Create DirectML device
IDMLDevice* dmlDevice = nullptr;
DMLCreateDevice(device, DML_CREATE_DEVICE_FLAG_NONE, IID_PPV_ARGS(&dmlDevice));

// Allocate tensors
auto inputDesc = DML_BUFFER_TENSOR_DESC{...};
auto weightDesc = DML_BUFFER_TENSOR_DESC{...};
auto outputDesc = DML_BUFFER_TENSOR_DESC{...};

IDMLTensorDescriptor* input = dmlDevice->CreateTensorDescriptor(&inputDesc);
IDMLTensorDescriptor* weight = dmlDevice->CreateTensorDescriptor(&weightDesc);
IDMLTensorDescriptor* output = dmlDevice->CreateTensorDescriptor(&outputDesc);

// Build the graph
IDMLCompiledOperator* convOp = nullptr;
{
    DML_CONVOLUTION_OPERATOR_DESC convDesc = {};
    convDesc.InputTensor = input->GetDesc();
    convDesc.FilterTensor = weight->GetDesc();
    convDesc.OutputTensor = output->GetDesc();
    convDesc.Strides = {1,1};
    convDesc.StartPadding = {0,0};
    convDesc.EndPadding = {0,0};
    convDesc.DilationRates = {1,1};
    convDesc.GroupCount = 1;
    convDesc.FusedActivation = DML_OPERATOR_DESC{nullptr,nullptr};

    IDMLDeviceOperator* op = dmlDevice->CreateOperator(DML_OPERATOR_CONVOLUTION, &convDesc);
    convOp = dmlDevice->CompileOperator(op, DML_EXECUTION_FLAG_NONE);
}

// Execute
ID3D12GraphicsCommandList* cmdList = device->CreateCommandList(...);
dmlDevice->ExecuteOperator(cmdList, convOp, ...);
cmdList->Close();
queue->ExecuteCommandLists(1, reinterpret_cast(&cmdList));
queue->Signal(fence, ++fenceValue);