Compute Shaders: Basic Usage

Compute shaders offer a powerful way to leverage the parallel processing capabilities of the GPU for general-purpose computation, not just graphics rendering. This guide introduces the fundamental concepts and steps involved in using compute shaders.

What are Compute Shaders?

Traditionally, shaders are used to process vertices and pixels for rendering graphics. Compute shaders, on the other hand, are designed for arbitrary computations. They run on the GPU and can operate on data in parallel, making them ideal for tasks like:

Key Concepts

Thread Groups

Compute shaders execute in units called thread groups. A thread group is a collection of threads that can cooperate and share data through shared memory. Threads within a group execute the same shader code but can operate on different data elements.

Threads

Each thread within a thread group executes the compute shader program independently. Threads are identified by their global, local, and group indices.

Unordered Access Views (UAVs)

Compute shaders primarily interact with data stored in memory via Unordered Access Views (UAVs). UAVs allow read and write access to resources like buffers and textures. This is crucial for modifying data on the GPU.

Steps to Using Compute Shaders

1. Create and Compile the Compute Shader

Compute shaders are written in High-Level Shading Language (HLSL) and compiled into shader bytecode. The entry point for a compute shader is typically a function marked with the [numthreads(x, y, z)] attribute, which defines the size of a thread group.

Example HLSL Compute Shader (Basic Vector Addition)


// A simple compute shader for vector addition: C = A + B

// Define the size of each thread group (e.g., 64 threads in X dimension)
// The actual number of threads dispatched can be different.
#define THREAD_GROUP_SIZE 64
[numthreads(THREAD_GROUP_SIZE, 1, 1)]
void CSMain(uint3 dispatchThreadID : SV_DispatchThreadID)
{
    // dispatchThreadID is the global index of the current thread.
    // Ensure we don't go out of bounds if the dispatch size isn't a multiple of THREAD_GROUP_SIZE.
    if (dispatchThreadID.x < DATA_SIZE)
    {
        // Read from input buffers (A and B)
        float a = InputBufferA[dispatchThreadID.x];
        float b = InputBufferB[dispatchThreadID.x];

        // Perform computation
        float c = a + b;

        // Write to output buffer (C)
        OutputBufferC[dispatchThreadID.x] = c;
    }
}

// Buffers (defined in C++ or C# and bound to shader resources)
// These are assumed to be bound as UAVs.
RWStructuredBuffer InputBufferA;
RWStructuredBuffer InputBufferB;
RWStructuredBuffer OutputBufferC;
uint DATA_SIZE; // Uniform parameter for the total size of data
            

2. Create GPU Resources

You'll need to create GPU resources (buffers or textures) that the compute shader will read from and write to. These resources must be created with the appropriate flags to allow Shader Resource Views (SRV) for reading and Unordered Access Views (UAV) for writing.

In C++ (using DirectX 11/12 concepts):


// Example for creating a structured buffer for input data
D3D11_BUFFER_DESC cbd;
cbd.Usage = D3D11_USAGE_DEFAULT;
cbd.ByteWidth = sizeof(float) * numElements;
cbd.BindFlags = D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS;
cbd.CPUAccessFlags = 0;
cbd.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
cbd.StructureByteStride = sizeof(float);

D3D11_SUBRESOURCE_DATA initialData;
initialData.pSysMem = dataArray;
initialData.SysMemPitch = 0;
initialData.SysMemSlicePitch = 0;

ID3D11Buffer* inputBuffer;
device->CreateBuffer(&cbd, &initialData, &inputBuffer);

// Create SRV and UAV for the buffer
D3D11_SHADER_RESOURCE_VIEW_DESC srvDesc;
srvDesc.Format = DXGI_FORMAT_UNKNOWN;
srvDesc.ViewDimension = D3D11_SRV_DIMENSION_BUFFER;
srvDesc.Buffer.ElementOffset = 0;
srvDesc.Buffer.FirstElement = 0;
srvDesc.Buffer.NumElements = numElements;

ID3D11ShaderResourceView* inputSRV;
device->CreateShaderResourceView(inputBuffer, &srvDesc, &inputSRV);

D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc;
uavDesc.Format = DXGI_FORMAT_UNKNOWN;
uavDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
uavDesc.Buffer.CounterOffset = 0;
uavDesc.Buffer.Flags = 0;
uavDesc.Buffer.FirstElement = 0;
uavDesc.Buffer.NumElements = numElements;

ID3D11UnorderedAccessView* inputUAV; // For writing to the input buffer if needed
device->CreateUnorderedAccessView(inputBuffer, &uavDesc, &inputUAV);
            

You would do similarly for other input and output buffers.

3. Bind Resources and Dispatch

Before dispatching the compute shader, you need to bind the compiled shader, the SRVs for reading, and the UAVs for writing to the GPU's pipeline. You also set any uniform parameters (like DATA_SIZE).

The dispatch call specifies how many thread groups to launch. This is calculated based on the total amount of work and the thread group size defined in the shader.


// Assuming 'computeShaderStateObject' is your compiled compute shader
// and 'computeContext' is your device context (e.g., ID3D11DeviceContext)

// Bind the compute shader
computeContext->CSSetShader(computeShaderStateObject, nullptr, 0);

// Bind input SRVs
ID3D11ShaderResourceView* srvs[] = { inputSRV_A, inputSRV_B };
computeContext->CSSetShaderResources(0, 2, srvs); // Bind to slot 0 and 1

// Bind output UAVs
ID3D11UnorderedAccessView* uavs[] = { outputUAV_C };
UINT initialCounts[] = { 0 }; // For counter-based UAVs, not used here
computeContext->CSSetUnorderedAccessViews(0, 1, uavs, initialCounts); // Bind to slot 0

// Set uniform parameters (e.g., DATA_SIZE)
// computeContext->CSSetConstantBuffers(...) or similar methods

// Calculate number of thread groups
UINT numElements = 1024; // Example total data elements
UINT threadGroupSizeX = THREAD_GROUP_SIZE; // From shader
UINT numGroupsX = (numElements + threadGroupSizeX - 1) / threadGroupSizeX;

// Dispatch the compute shader
computeContext->Dispatch(numGroupsX, 1, 1);

// After dispatch, you may need to:
// 1. Unbind shader resources and UAVs to avoid aliasing issues.
// 2. Optionally, copy results back to the CPU for processing or display.
// 3. Or, use the output buffer as input for a pixel shader or another compute pass.
            
Important: For the GPU to finish its work before the CPU needs the results, synchronization is often required. This can involve using fences or queries. For reading data back to the CPU, you might need to copy the UAV to a staging buffer.

Example Output

Input Buffer A: [1.0, 2.0, 3.0, 4.0, ...]
Input Buffer B: [5.0, 6.0, 7.0, 8.0, ...]

Output Buffer C (after compute shader): [6.0, 8.0, 10.0, 12.0, ...]

Next Steps

This basic example demonstrates the core workflow. For more advanced scenarios, explore:

Refer to the Performance Considerations and Use Cases sections for deeper insights.