Compute Shader Patterns

Compute shaders offer a powerful and flexible way to leverage the GPU for general-purpose computation. They go beyond traditional graphics rendering pipelines, enabling complex algorithms and massive parallelization. This guide explores common and effective patterns for using compute shaders in DirectX.

1. Parallel Data Transformation

A fundamental use case for compute shaders is applying a transformation to each element of a large dataset in parallel. This is ideal for tasks like particle simulations, image filtering, and physics calculations.

Example: Per-particle position updates.

Learn More →

2. Reductions

Compute shaders can efficiently perform reduction operations (e.g., sum, minimum, maximum) across large datasets. This typically involves a multi-pass approach where intermediate results are progressively combined.

Example: Calculating the sum of all elements in a large array.

Learn More →

3. Atomic Operations

When multiple threads need to safely access and modify shared memory, atomic operations are crucial. They ensure that operations are performed indivisibly, preventing race conditions.

Example: Thread-safe counting or updating shared resources.

Learn More →

4. Wave Operations

Modern GPUs support "wave" or "warp" operations, allowing threads within a wave to communicate and perform operations collectively. This is useful for more complex parallel patterns.

Example: Shuffle operations, ballot queries.

Learn More →

5. Image Processing and Filtering

Compute shaders excel at per-pixel operations, making them perfect for image manipulation tasks like blurring, sharpening, color correction, and applying complex filters.

Example: Applying a Gaussian blur to a texture.

Learn More →

6. Simulation and Physics

From fluid dynamics to rigid body simulations, compute shaders can simulate complex physical interactions by updating object states in parallel based on interaction rules.

Example: N-body simulations, particle-based fluid simulations.

Learn More →

Parallel Data Transformation

This pattern involves dispatching a compute shader that operates on a grid of threads, where each thread processes a distinct piece of data. Input data is typically stored in structured buffers or UAVs (Unordered Access Views), and the results are written back to separate buffers or back to the same buffer.

// Example HLSL structure for input/output data
struct Particle {
    float2 position;
    float2 velocity;
};

RWStructuredBuffer<Particle> g_Particles : register(u0);

[numthreads(64, 1, 1)]
void CS_UpdateParticles(uint3 dispatchThreadID : SV_DispatchThreadID) {
    // Ensure we don't go out of bounds
    if (dispatchThreadID.x >= g_Particles.Length) {
        return;
    }

    Particle p = g_Particles[dispatchThreadID.x];

    // Apply physics simulation (e.g., update position based on velocity)
    p.position += p.velocity * g_deltaTime;

    // Store the updated particle
    g_Particles[dispatchThreadID.x] = p;
}

Reductions

Reductions are often implemented in multiple passes. The first pass may process data in larger chunks, writing intermediate results. Subsequent passes combine these intermediate results until a single final value is obtained. Shared memory is often utilized to accelerate intermediate combinations within a thread group.

// Simplified example of a parallel sum reduction
RWStructuredBuffer<float> g_InputData : register(u0);
RWBuffer<float> g_OutputSum : register(u1);

groupshared float sharedSum;

[numthreads(256, 1, 1)]
void CS_ReduceSum(uint3 dispatchThreadID : SV_DispatchThreadID, uint3 groupThreadID : SV_GroupThreadID, uint groupID : SV_GroupID) {
    // Initialize shared memory for this group
    if (groupThreadID.x == 0) {
        sharedSum = 0.0f;
    }
    GroupMemoryBarrierWithGroupSync(); // Wait for all threads to reach this point

    // Each thread adds its value to the shared sum
    sharedSum += g_InputData[dispatchThreadID.x];
    GroupMemoryBarrierWithGroupSync(); // Wait for all additions to complete

    // The first thread in the group writes the group's total sum
    if (groupThreadID.x == 0) {
        // In a multi-pass reduction, this would write to an intermediate buffer
        // and a subsequent compute shader would read from there.
        // For simplicity, we assume a single pass reduction here writing to a single output buffer.
        // A proper multi-pass reduction would require more logic.
        // For a single large buffer, one might need to use atomics on g_OutputSum[0]
        // or a more sophisticated multi-pass reduction strategy.
    }
}

Atomic Operations

DirectX provides atomic functions like `InterlockedAdd`, `InterlockedExchange`, `InterlockedCompareExchange`, etc. These are vital for scenarios where multiple threads might try to modify the same memory location concurrently.

RWBuffer<uint> g_Counter : register(u0);

[numthreads(32, 1, 1)]
void CS_IncrementCounter(uint3 dispatchThreadID : SV_DispatchThreadID) {
    // Safely increment the counter
    InterlockedAdd(g_Counter[0], 1);
}