Poly Coding

Compute shaders in Unity

0. Introduction

1. Compute shaders

2. Basic Compute shaders in Unity

3. Compute buffers in Unity

Marching Cubes Terrain Wireframe

Note: All code discussed in this tutorial can be downloaded at the end of each corresponding section.

0. Introduction

When developing games we are sometimes forced to do a lot of calculations. Be it in pathfinding, procedurally generating terrain, simulations, etc. If we where to do those calculations on the CPU in a sequential order, it could take a long time and possible freeze our game for a few seconds. It would be handy if we could offload some of these calculations to our GPU, where we can do computations in parallel.

1. Compute shaders

Compute shaders are programs that allow us to execute arbitrary code on the GPU. When executing a compute shader, the GPU creates workgroups. Workgroups can be tought of as smaller computers that run in parallel and are therefore completely independent of eachother.

Workgroups are not guarranteed to be executed in a particular order.

A single workgroup

A single workgroup

Workgroups run in 3D space (“compute space”), and are as such logically divided into 3 dimensions.

Compute space consisting of 4x4x4 workgroups

Compute space consisting of 4x4x4 workgroups

Workgroups are in turn divided into threads. Which again run in 3D space. These threads are responsible for actually executing our code. Threads are again independent of one another. In other words, every calculation done within a thread must be self contained.

2.1 Example

Let’s say that we would like to create a 6x6 2D grid of squares with our compute shader. To do this we decide that the best aproach is to create workgroups containing 3x3x1 threads so every workgroup takes care of 9 squares in total (remember, threads are assigned with 3 dimensions, this however does not mean you can’t make them 1D or 2D by simply using a dimension with size 1). This means that, to cover the entire grid, we should use 4 workgroups, 2 in the width and 2 in height, like so:

Compute space example

Every circle is a thread and its border a workgroup.

2. Using compute shaders in Unity

2.1 Compute shaders

When creating a compute shader in Unity (Create > Shader > Compute Shader) we are presented with the following template:

1
2
3
4
5
6
7
8
9
#pragma kernel CSMain

RWTexture2D<float4> Result;

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0);
}

Let’s go over each of the components.

1
#pragma kernel CSMain

This line indicates the function (kernel) in our compute shader that will be run when we execute the compute shader. Think of this as letting the compute shader know what the entry point is into the shader, kind of like an awake/start function in Unity. In this case our kernel is called “CSMain” but we are not limited to a single kernel per shader, more on that later.

1
RWTexture2D<float4> Result;

Just as in C#, in compute shaders we can have variables and objects (structs). In this case we have a RWTexture2D<float4> array. The RW part indicates that this array is used for both reading and writing.

1
[numthreads(8,8,1)]

As explained above: workgroups are partioned into a 3D space. Each workgroup executes a number of threads (also running in 3D space). The numthreads tells the compute shader how many threads a workgroup contains in each dimensions. In this case our workgroup exists out of 8x8x1 threads.

1
void CSMain (uint3 id : SV_DispatchThreadID)

This is our function that will be executed per thread. Every thread has as an unsigned integer3 id as parameter to indicate which thread we are currently running. For example, if we are running the very first thread, we would have (0, 0, 0) as id. The second would be (1, 0, 0), and so on.

1
Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0);

Finally, in the CSMain function we actually put our code and calculations. Here we simply write some float4 values to the Result array based on our id.

2.2 Executing a compute shader

Compute shaders are executed from a C# script, so let’s create a new Unity projects and create a new C# script called “TextureGenerator” (Create > C# Scripts) and a compute shader called “TextureCompute” (Create > Shader > Compute Shader).

To start off, we have to create a reference to our template compute shader in our TextureGenerator script.

1
2
3
4
5
6
public class TextureGenerator : MonoBehaviour
{

    public ComputeShader TextureShader;

}

Now, attach the TextureGenerator.cs to the main camera in the hierarchy, and assign the compute shader.

Main Camera

Additionaly, to display the texture that will be filled by the compute shader, we have to create a RenderTexture variable. Initialize this variable in the OnRenderImage function.

1
2
3
4
5
6
7
8
9
10
11
12
13
public class TextureGenerator : MonoBehaviour
{
    public ComputeShader TextureShader;
    private RenderTexture _rTexture;

    private void OnRenderImage(RenderTexture source, RenderTexture destination) {
        if (_rTexture == null) { 
            _rTexture = new RenderTexture( Screen.width, Screen.height, 0, RenderTextureFormat.ARGBFloat, RenderTextureReadWrite.Linear);
            _rTexture.enableRandomWrite = true;
            _rTexture.Create();
        }
    }    
}

The OnRenderImage(source, destination) allows us to write an image to the camera after rendering is done by the camera.

Next, we have to make sure that our TextureShader has all its parameters set. Let’s start by setting the RWTexture2D Result array in the compute shader. To do this we have to use the .SetTexture method.

1
2
3
4
private void OnRenderImage(RenderTexture source, RenderTexture destination) {
    ...
    TextureShader.SetTexture(0, "Result", _rTexture);
}

In this line we tell the Shader that we would like the first kernel, indicated by 0 as the first argument, to set the Result parameter within our compute shader with _rTexture.

Like mentioned above, compute shaders can have multiple kernels. These kernels can be accesed by an ID, which is determined by the order in which the pragma kernels are written.

1
2
#pragma kernel CSMain
#pragma kernel Random

Kernels example

In the example above, if we intend to use the “Random” entry point, we would use kernel = 1, since it is defined as the second kernel.

A more readable way however is to ask the compute shader what the ID is for a given string, like so:

1
int kernel = TextureShader.FindKernel("CSMain");

For now, let us use the last method in which we use FindKernel.

1
2
int kernel = TextureShader.FindKernel("CSMain");
TextureShader.SetTexture(kernel, "Result", _rTexture);

One last thing we have to decide before executing the TextureShader is: how many workgroups should we start? We want to fill up the entire screen with a texture, so let’s use that as a starting point.

1
2
3
4
5
6
private void OnRenderImage(RenderTexture source, RenderTexture destination) {
    ...
    int workgroupsX = Screen.width;
    int workgroupsY = Screen.height;
    // No workgroupsZ is needed, as we intent to fill a 2D texture.
}

This would create a workgroup for every pixel in our screen size. Remember, when looking in our compute shader, we have by default assigned the numthreads (threads per workgroup) as 8x8x1. This means that currently if we have a screen size of 1920 by 907, to cover the entire screen we would create (1920 * 8 * 907 * 8 =) 111452160 threads.

Correct this by dividing the workgroups both by 8, which indicate the amount of threads we use per workgroup. Additionaly, workgroups must be an integer, so ceil the result to an integer

1
2
3
4
5
6
private void OnRenderImage(RenderTexture source, RenderTexture destination) {
    ...
    // Formula = Ceil(Screen dimension size / threads per group)
    int workgroupsX = Mathf.CeilToInt(Screen.width / 8.0f);
    int workgroupsY = Mathf.CeilToInt(Screen.height / 8.0f);
}

Finally, dispatch the shader with the right kernel and the above found workgroup sizes, and display our texture.

1
2
3
4
5
private void OnRenderImage(RenderTexture source, RenderTexture destination) {
    ...
    TextureShader.Dispatch(kernel, workgroupsX, workgroupsY, 1);
    Graphics.Blit(_rTexture, destination);
}

When starting the game we will now see the following image:

Render Result

Download: unity package with code

3. Compute buffers

While these results are generated very fast, when generating an image like this there is no real need for a compute shader. This can be done way faster with a regular unlit shader.

Download: unlit shader

The real power of compute shaders lie in being able to write arbitrary data to the GPU, do calculations on them, and return the results. To write data to the GPU, we need a compute buffer to act as middle man. This is usually the pattern: We put some data on the compute buffer with a C# script, the compute shader on the GPU reads this data and does calculations on them, once ready with calculating, the new data is written to the buffer and in the end is read once more by our C# script.

3.1 Compute buffers Example

Let’s look at an example. To keep it short, I will be using a simplified version of an example of this youtube video by Game Dev Guide.

The goal of the example is to create a grid of cubes that simply update it’s y position every second. To calculate the y position we do an expensive computation: calculate a random number for a number of repetitions (ex. 2000). In other words, per cube we will calculate a random y coordinate 2000 times. For a grid consisting of 100x100 cubes this means we will do 20000000 calculations every second.

3.1.1 Implementation

To start, let’s create a new empty Compute Shader, call it “CubesCompute”. This shader needs a few variables. We first of all want it to contain a RWStructuredBuffer, where RW stands for read write, containing all the y positions of all of our cubes. This structured buffer essentially acts as an array of floats.

We also want to keep track of the size of our grid (_CubesPerAxis), the repetitions and finally the time, to make sure we get a new y position everytime we execute the script.

1
2
3
4
5
6
7
#pragma kernel CSMain

RWStructuredBuffer<float> _Positions;

int _CubesPerAxis;
int _Repetitions;
float _Time;

Once again we use 8x8x1 as number of threads within our workgroups. However when working with an arbitrary size of grid, it means we will either create just enough threads for our grid or too many, as our workgroups and number of threads are static.

For example if we have a grid size of 10, this would not fit nicely inside our workgroups and threads.

If our grid has a size that is not divisible by 8, we will have too many threads running. This is fine as long as we make sure to check if the id is in fact in the bounds of our grid.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    // Check if id is in the grid
    if (id.x > _CubesPerAxis - 1 || id.y > _CubesPerAxis - 1)
    {
        return;
    }

    for (int i = 0; i < _Repetitions; i++)
    {
        // Assign random value
        _Positions[indexFromId(id)] = rand(float2(id.x * _Time, id.y * _Time));
    }
}

Let’s also add 2 helper functions. rand(float2) and indexFromId(uint3). rand() gives us a random value between -1 and 1 given a uv/xy position. indexFromId() returns the 1D index of the cube given a 3D id. This is needed since _Positions is a 1D array.

Make sure to add these helper functions above your CSMain function, otherwise the HLSL code will not compile.

1
2
3
4
5
6
7
8
9
float rand(in float2 uv)
{
    return (frac(sin(dot(uv, float2(12.9898, 78.233) * 2.0)) * 43758.5453)) * 2 - 1;
}

int indexFromId(uint3 id)
{
    return id.x + _CubesPerAxis * (id.y + _CubesPerAxis * id.z);
}

To execute this shader, let’s create a new C# script, call it “CubeGrid” and assign it to a new object in the hierarchy. Initialize the script with the following variables.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public class CubeGrid : MonoBehaviour
{
    public Transform CubePrefab;
    public ComputeShader CubeShader;

    private ComputeBuffer _cubesPositionBuffer;

    // Grid size
    public int CubesPerAxis = 80;
    public int Repetitions = 1000;

    // Cube objects
    private Transform[] _cubes;

    // Array containing all y positions of cubes.
    // Will be put on the compute buffer
    private float[] _cubesPositions; 

    public bool UseGPU;
}

In Unity, create a CubePrefab by right clicking in the hierarchy > 3D object > Cube. Remove the Box Collider and drag the cube into your assets folder. Delete the cube from the hierarchy and assign the prefab from the assets folder to your script in the hierarchy. Don’t forget to assign the compute shader in the inspector.

Render Result

Once this is done, back in the C# script, we have to create our compute buffer. A compute buffer requires to know the maximum amount of elements it can contain and the size in bytes (also called “stride”) of a single element.

Since we want to constantly update our y positions of the cubes, which is represented by a float (in our RWStructuredBuffer<float> _Positions), we will be using the size of a float as stride, and a total maximum size of elements of CubesPerAxis * CubesPerAxis, which is the total size of our grid.

1
2
3
private void Awake() {
    _cubesPositionBuffer = new ComputeBuffer(CubesPerAxis * CubesPerAxis, sizeof(float));
}
Can we also use custom types in compute shader? Yes, compute buffers also allow us to use custom structs, like so
1
2
3
struct Triangle {
    Vector3 v0, v1, v2;
}
In this case we would tell our compute buffer a single stride is
sizeof(float) * 3 * 3
Because a single triangle contains 3 vector3's which each exists out of 3 floats (xyz). To match this triangle struct in HLSL we would use the following:
struct Triangle {
    float3 a, b, c;
};

RWStructuredBuffer<Triangle> _MyTriangles
---

Going back to our example case, telling our compute buffer the max size and stride effectively reserves space for an array in GPU memory in which every element has a size of a float.

When the game ends we also manually have to release the buffer.

1
2
3
private void OnDestroy() {
    _cubesPositionBuffer.Release();
}

Let’s now create the grid by looping through the x,z coordinates and instantiating a cube at every position.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
private void Start() {
    CreateGrid();
}

void CreateGrid() {
    _cubes = new Transform[CubesPerAxis * CubesPerAxis];
    _cubesPositions = new float[CubesPerAxis * CubesPerAxis];
    for (int x = 0, i = 0; x < CubesPerAxis; x++) {
        for (int z = 0; z < CubesPerAxis; z++, i++) {
            _cubes[i] = Instantiate(CubePrefab, transform);
            _cubes[i].transform.position = new Vector3(x, 0, z);
        }
    }
}

If we where to go back to Unity now, make sure all parameters are set in the hierarchy and run the game, we should see a grid of cubes.

To constantly update our grid, we start a coroutine at the end of creating the grid. This coroutine is responsible for looping through the cubes and updating its y coordinate.

1
2
3
4
void CreateGrid() {
    ...
    StartCoroutine(UpdateCubeGrid());
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
IEnumerator UpdateCubeGrid() {
    while (true) {
        if (UseGPU) {
            UpdatePositionsGPU();
        }
        else {
            UpdatePositionsCPU();
        }

        for (int i = 0; i < _cubes.Length; i++) {
            _cubes[i].localPosition = new Vector3(_cubes[i].localPosition.x, _cubesPositions[i], _cubes[i].localPosition.z);
        }
        yield return new WaitForSeconds(1);
    }
}

In case we use our CPU, we want to call the UpdatePositionsCPU() function that sets every cubes y position to a random value (range -1 to 1) every repetition.

1
2
3
4
5
6
7
void UpdatePositionsCPU() {
    for (int i = 0; i < _cubes.Length; i++) {
        for (int j = 0; j < Repetitions; j++) {
            _cubesPositions[i] = Random.Range(-1f, 1f);
        }
    }
}

Our UpdatePositionsGPU() function sets parameters on our compute shader and dispatches it.

1
2
3
4
5
6
7
8
9
10
11
12
void UpdatePositionsGPU() {
    CubeShader.SetBuffer(0, "_Positions", _cubesPositionBuffer);

    CubeShader.SetInt("_CubesPerAxis", CubesPerAxis);
    CubeShader.SetInt("_Repetitions", Repetitions);
    CubeShader.SetFloat("_Time", Time.deltaTime);

    int workgroups = Mathf.CeilToInt(CubesPerAxis / 8.0f);
    CubeShader.Dispatch(0, workgroups, workgroups, 1);

    _cubesPositionBuffer.GetData(_cubesPositions);
}

Going over the code above, we start by assigning the positions buffer to the compute shader in the first (and only) kernel “CSMain”. We link this buffer with our _Positions array. We than once again set some variables and find the size of a workgroup.

We dispatch the shader so that it executes the computes shader, and finally, with the GetData method on the buffer take whatever data is currently (this is after executing the compute shader) in our _Positions array on the buffer and store it in the _cubesPositions array in our C# script.

This should be all. Going into Unity, you will see that when starting the game, given your hardware and the amount of repetitions, the GPU version will have an easier time staying at a stable framerate than the CPU. In my case with 80 cubes per axis and 3000 repetitions, the GPU stays somewhere around 85 fps at all times, while the CPU version regularly plunges to 35 fps.

3.1.2 Benchmark

Below you can find a benchmark which runs the UpdatePositionsGPU/UpdatePositionsCPU methods 10 times and averages the time used to execute. Results will vary based on hardware.

Specs:

  • CPU - i7-8700
  • GPU - RTX 2060
Execution Mode Time
CPU 316.5ms
GPU 0.1ms

An additional practical example can be found in the marching cubes guides.

Download: complete project (updated with a grid profiler)