Compute Shader Stages
Overview
Compute shaders in DirectCompute offer a highly flexible and programmable way to perform general-purpose computations on the GPU. Unlike traditional graphics shaders, compute shaders are not tied to the graphics pipeline's fixed stages. They operate on arbitrary data and can be used for a wide range of tasks, from physics simulations and image processing to data analysis and machine learning.
The compute shader model introduces a new set of concepts and programming paradigms centered around threads, thread groups, and shared memory. This section delves into the fundamental stages and components that define the execution of a compute shader.
Shader Stages
While not a fixed pipeline in the same way as graphics rendering, compute shaders do have distinct conceptual stages that define their execution flow:
- Shader Input: The compute shader receives input data through various mechanisms, including constant buffers, structured buffers, typed buffers, textures (as shader resource views - SRVs), and UAVs.
- Shader Execution: The core logic of the compute shader is executed by individual threads. Each thread processes a portion of the input data and performs computations.
- Thread Group Synchronization: Threads within a single thread group can synchronize their execution and share data using thread group shared memory. This is crucial for algorithms that require coordination among threads in a group.
- Shader Output: Results are typically written to output buffers (UAVs) or textures (UAVs). This allows for efficient data manipulation and storage.
Execution Model
DirectCompute's execution model is based on a hierarchical structure:
- Threads: The smallest unit of execution. Each thread runs a copy of the compute shader code.
- Thread Groups: A collection of threads that can cooperate and synchronize. Threads within a group can share data via
thread_group_shared_memoryand synchronize their execution usingGroupMemoryBarrierWithGroupSync. - Dispatch: The process of launching compute shaders. The application specifies the dimensions of a 3D grid of thread groups (dispatch width, height, depth) and the dimensions of each thread group (group size X, Y, Z).
The total number of threads launched is (dispatch width * group size X) * (dispatch height * group size Y) * (dispatch depth * group size Z).
Key Concepts
SV_DispatchThreadID: A system-generated semantic that provides a unique, global ID for each thread across all dispatched thread groups.SV_GroupID: A system-generated semantic that identifies the specific thread group to which a thread belongs.SV_GroupThreadID: A system-generated semantic that provides a unique ID for a thread within its own thread group.SV_GroupSize: A system-generated semantic that returns the dimensions of the thread group.SV_InsideTessellationFactorBuffer: Not directly used in compute shaders, but part of the broader shader system.SV_TessFactor: Not directly used in compute shaders.SV_DomainLocation: Not directly used in compute shaders.- Thread Group Shared Memory: A fast, on-chip memory accessible by all threads within a single thread group. Use the
groupsharedkeyword or thethread_group_shared_memorybuilt-in function.
Example Compute Shader Snippet (HLSL)
This snippet demonstrates basic thread identification and writing to an output buffer.
RWStructuredBuffer<float4> outputBuffer : register(u0);
// Define the thread group size
[numthreads(8, 8, 1)]
void CSMain(
uint3 dispatchThreadID : SV_DispatchThreadID
)
{
// Calculate an index for the output buffer
uint index = dispatchThreadID.x + dispatchThreadID.y * 1024; // Assuming 1024 is the dispatch width
// Write some data to the output buffer
outputBuffer[index] = float4(
dispatchThreadID.x / 1024.0f,
dispatchThreadID.y / 1024.0f,
0.0f,
1.0f
);
}