Compute Shader Stages

Overview

Compute shaders in DirectCompute offer a highly flexible and programmable way to perform general-purpose computations on the GPU. Unlike traditional graphics shaders, compute shaders are not tied to the graphics pipeline's fixed stages. They operate on arbitrary data and can be used for a wide range of tasks, from physics simulations and image processing to data analysis and machine learning.

The compute shader model introduces a new set of concepts and programming paradigms centered around threads, thread groups, and shared memory. This section delves into the fundamental stages and components that define the execution of a compute shader.

Shader Stages

While not a fixed pipeline in the same way as graphics rendering, compute shaders do have distinct conceptual stages that define their execution flow:

  • Shader Input: The compute shader receives input data through various mechanisms, including constant buffers, structured buffers, typed buffers, textures (as shader resource views - SRVs), and UAVs.
  • Shader Execution: The core logic of the compute shader is executed by individual threads. Each thread processes a portion of the input data and performs computations.
  • Thread Group Synchronization: Threads within a single thread group can synchronize their execution and share data using thread group shared memory. This is crucial for algorithms that require coordination among threads in a group.
  • Shader Output: Results are typically written to output buffers (UAVs) or textures (UAVs). This allows for efficient data manipulation and storage.

Execution Model

DirectCompute's execution model is based on a hierarchical structure:

  • Threads: The smallest unit of execution. Each thread runs a copy of the compute shader code.
  • Thread Groups: A collection of threads that can cooperate and synchronize. Threads within a group can share data via thread_group_shared_memory and synchronize their execution using GroupMemoryBarrierWithGroupSync.
  • Dispatch: The process of launching compute shaders. The application specifies the dimensions of a 3D grid of thread groups (dispatch width, height, depth) and the dimensions of each thread group (group size X, Y, Z).

The total number of threads launched is (dispatch width * group size X) * (dispatch height * group size Y) * (dispatch depth * group size Z).

Key Concepts

  • SV_DispatchThreadID: A system-generated semantic that provides a unique, global ID for each thread across all dispatched thread groups.
  • SV_GroupID: A system-generated semantic that identifies the specific thread group to which a thread belongs.
  • SV_GroupThreadID: A system-generated semantic that provides a unique ID for a thread within its own thread group.
  • SV_GroupSize: A system-generated semantic that returns the dimensions of the thread group.
  • SV_InsideTessellationFactorBuffer: Not directly used in compute shaders, but part of the broader shader system.
  • SV_TessFactor: Not directly used in compute shaders.
  • SV_DomainLocation: Not directly used in compute shaders.
  • Thread Group Shared Memory: A fast, on-chip memory accessible by all threads within a single thread group. Use the groupshared keyword or the thread_group_shared_memory built-in function.

Example Compute Shader Snippet (HLSL)

This snippet demonstrates basic thread identification and writing to an output buffer.


RWStructuredBuffer<float4> outputBuffer : register(u0);

// Define the thread group size
[numthreads(8, 8, 1)]
void CSMain(
    uint3 dispatchThreadID : SV_DispatchThreadID
)
{
    // Calculate an index for the output buffer
    uint index = dispatchThreadID.x + dispatchThreadID.y * 1024; // Assuming 1024 is the dispatch width

    // Write some data to the output buffer
    outputBuffer[index] = float4(
        dispatchThreadID.x / 1024.0f,
        dispatchThreadID.y / 1024.0f,
        0.0f,
        1.0f
    );
}