DirectML Architecture

Overview

DirectML (Direct Machine Learning) provides a high‑performance, low‑level API for machine learning workloads on Windows. It integrates tightly with DirectX 12, allowing GPU‑accelerated execution of neural network graphs with minimal overhead.

Key Architectural Concepts

Graph Execution Model

DirectML represents a neural network as a graph of operators (nodes) connected by tensors. The runtime compiles the graph into a sequence of GPU commands and executes them as a single dispatch.

graph = CreateGraph();
graph->AddOperator(Convolution, inputs, outputs, params);
graph->AddOperator(ReLU, ...);
graph->Compile();
graph->Execute(commandList);

Layer Abstractions

DirectX 12 Integration

DirectML leverages DirectX 12’s resource management and command queues. Resources created for DirectML can be shared with other DirectX pipelines, enabling hybrid graphics‑ML workloads.

// Example: sharing a texture between a render pass and a compute pass
ID3D12Resource* renderTarget = ...;
ID3D12Resource* mlTensor = DirectML::CreateTensorFromResource(renderTarget);
graph->AddOperator(..., mlTensor, ...);

Execution Pipeline

  1. Resource Allocation – Allocate GPU buffers or textures for tensors.
  2. Graph Construction – Define operators and their connections.
  3. Compilation – DirectML compiles the graph into optimized GPU commands.
  4. Dispatch – Submit the compiled command list to a DirectX 12 command queue.
  5. Synchronization – Use fences or events to synchronize with other GPU work.

Performance Considerations

Sample Code

Below is a minimal example that creates a convolution operator and runs it on the GPU.

#include <DirectML.h>
#include <d3d12.h>

// Initialize D3D12 device and command queue
ID3D12Device* device = CreateD3D12Device();
ID3D12CommandQueue* queue = device->CreateCommandQueue(...);

// Create DirectML device
IDMLDevice* dmlDevice = nullptr;
DMLCreateDevice(device, DML_CREATE_DEVICE_FLAG_NONE, IID_PPV_ARGS(&dmlDevice));

// Allocate tensors
auto inputDesc = DML_BUFFER_TENSOR_DESC{...};
auto weightDesc = DML_BUFFER_TENSOR_DESC{...};
auto outputDesc = DML_BUFFER_TENSOR_DESC{...};

IDMLTensorDescriptor* input = dmlDevice->CreateTensorDescriptor(&inputDesc);
IDMLTensorDescriptor* weight = dmlDevice->CreateTensorDescriptor(&weightDesc);
IDMLTensorDescriptor* output = dmlDevice->CreateTensorDescriptor(&outputDesc);

// Build the graph
IDMLCompiledOperator* convOp = nullptr;
{
    DML_CONVOLUTION_OPERATOR_DESC convDesc = {};
    convDesc.InputTensor = input->GetDesc();
    convDesc.FilterTensor = weight->GetDesc();
    convDesc.OutputTensor = output->GetDesc();
    convDesc.Strides = {1,1};
    convDesc.StartPadding = {0,0};
    convDesc.EndPadding = {0,0};
    convDesc.DilationRates = {1,1};
    convDesc.GroupCount = 1;
    convDesc.FusedActivation = DML_OPERATOR_DESC{nullptr,nullptr};

    IDMLDeviceOperator* op = dmlDevice->CreateOperator(DML_OPERATOR_CONVOLUTION, &convDesc);
    convOp = dmlDevice->CompileOperator(op, DML_EXECUTION_FLAG_NONE);
}

// Execute
ID3D12GraphicsCommandList* cmdList = device->CreateCommandList(...);
dmlDevice->ExecuteOperator(cmdList, convOp, ...);
cmdList->Close();
queue->ExecuteCommandLists(1, reinterpret_cast(&cmdList));
queue->Signal(fence, ++fenceValue);

Further Reading