### Poly Coding About

0. Introduction

2. Basic Compute shaders in Unity

3. Compute buffers in Unity Note: All code discussed in this tutorial can be downloaded at the end of each corresponding section.

## 0. Introduction

When developing games we are sometimes forced to do a lot of calculations. Be it in pathfinding, procedurally generating terrain, simulations, etc. If we where to do those calculations on the CPU in a sequential order, it could take a long time and possible freeze our game for a few seconds. It would be handy if we could offload some of these calculations to our GPU, where we can do computations in parallel.

Compute shaders are programs that allow us to execute arbitrary code on the GPU. When executing a compute shader, the GPU creates workgroups. Workgroups can be tought of as smaller computers that run in parallel and are therefore completely independent of eachother.

Workgroups are not guarranteed to be executed in a particular order. A single workgroup

Workgroups run in 3D space (“compute space”), and are as such logically divided into 3 dimensions. Compute space consisting of 4x4x4 workgroups

Workgroups are in turn divided into threads. Which again run in 3D space. These threads are responsible for actually executing our code. Threads are again independent of one another. In other words, every calculation done within a thread must be self contained.

### 2.1 Example

Let’s say that we would like to create a 6x6 2D grid of squares with our compute shader. To do this we decide that the best aproach is to create workgroups containing 3x3x1 threads so every workgroup takes care of 9 squares in total (remember, threads are assigned with 3 dimensions, this however does not mean you can’t make them 1D or 2D by simply using a dimension with size 1). This means that, to cover the entire grid, we should use 4 workgroups, 2 in the width and 2 in height, like so: Every circle is a thread and its border a workgroup.

## 2. Using compute shaders in Unity

When creating a compute shader in Unity (Create > Shader > Compute Shader) we are presented with the following template:

Let’s go over each of the components.

This line indicates the function (kernel) in our compute shader that will be run when we execute the compute shader. Think of this as letting the compute shader know what the entry point is into the shader, kind of like an awake/start function in Unity. In this case our kernel is called “CSMain” but we are not limited to a single kernel per shader, more on that later.

Just as in C#, in compute shaders we can have variables and objects (structs). In this case we have a RWTexture2D<float4> array. The RW part indicates that this array is used for both reading and writing.

As explained above: workgroups are partioned into a 3D space. Each workgroup executes a number of threads (also running in 3D space). The numthreads tells the compute shader how many threads a workgroup contains in each dimensions. In this case our workgroup exists out of 8x8x1 threads.

This is our function that will be executed per thread. Every thread has as an unsigned integer3 id as parameter to indicate which thread we are currently running. For example, if we are running the very first thread, we would have (0, 0, 0) as id. The second would be (1, 0, 0), and so on.

Finally, in the CSMain function we actually put our code and calculations. Here we simply write some float4 values to the Result array based on our id.

### 2.2 Executing a compute shader

Compute shaders are executed from a C# script, so let’s create a new Unity projects and create a new C# script called “TextureGenerator” (Create > C# Scripts) and a compute shader called “TextureCompute” (Create > Shader > Compute Shader).

To start off, we have to create a reference to our template compute shader in our TextureGenerator script.

Now, attach the TextureGenerator.cs to the main camera in the hierarchy, and assign the compute shader. Additionaly, to display the texture that will be filled by the compute shader, we have to create a RenderTexture variable. Initialize this variable in the OnRenderImage function.

The OnRenderImage(source, destination) allows us to write an image to the camera after rendering is done by the camera.

Next, we have to make sure that our TextureShader has all its parameters set. Let’s start by setting the RWTexture2D Result array in the compute shader. To do this we have to use the .SetTexture method.

In this line we tell the Shader that we would like the first kernel, indicated by 0 as the first argument, to set the Result parameter within our compute shader with _rTexture.

Like mentioned above, compute shaders can have multiple kernels. These kernels can be accesed by an ID, which is determined by the order in which the pragma kernels are written.

Kernels example

In the example above, if we intend to use the “Random” entry point, we would use kernel = 1, since it is defined as the second kernel.

A more readable way however is to ask the compute shader what the ID is for a given string, like so:

For now, let us use the last method in which we use FindKernel.

One last thing we have to decide before executing the TextureShader is: how many workgroups should we start? We want to fill up the entire screen with a texture, so let’s use that as a starting point.

This would create a workgroup for every pixel in our screen size. Remember, when looking in our compute shader, we have by default assigned the numthreads (threads per workgroup) as 8x8x1. This means that currently if we have a screen size of 1920 by 907, to cover the entire screen we would create (1920 * 8 * 907 * 8 =) 111452160 threads.

Correct this by dividing the workgroups both by 8, which indicate the amount of threads we use per workgroup. Additionaly, workgroups must be an integer, so ceil the result to an integer

Finally, dispatch the shader with the right kernel and the above found workgroup sizes, and display our texture.

When starting the game we will now see the following image: ## 3. Compute buffers

While these results are generated very fast, when generating an image like this there is no real need for a compute shader. This can be done way faster with a regular unlit shader.

The real power of compute shaders lie in being able to write arbitrary data to the GPU, do calculations on them, and return the results. To write data to the GPU, we need a compute buffer to act as middle man. This is usually the pattern: We put some data on the compute buffer with a C# script, the compute shader on the GPU reads this data and does calculations on them, once ready with calculating, the new data is written to the buffer and in the end is read once more by our C# script.

### 3.1 Compute buffers Example

Let’s look at an example. To keep it short, I will be using a simplified version of an example of this youtube video by Game Dev Guide.

The goal of the example is to create a grid of cubes that simply update it’s y position every second. To calculate the y position we do an expensive computation: calculate a random number for a number of repetitions (ex. 2000). In other words, per cube we will calculate a random y coordinate 2000 times. For a grid consisting of 100x100 cubes this means we will do 20000000 calculations every second.

### 3.1.1 Implementation

To start, let’s create a new empty Compute Shader, call it “CubesCompute”. This shader needs a few variables. We first of all want it to contain a RWStructuredBuffer, where RW stands for read write, containing all the y positions of all of our cubes. This structured buffer essentially acts as an array of floats.

We also want to keep track of the size of our grid (_CubesPerAxis), the repetitions and finally the time, to make sure we get a new y position everytime we execute the script.

Once again we use 8x8x1 as number of threads within our workgroups. However when working with an arbitrary size of grid, it means we will either create just enough threads for our grid or too many, as our workgroups and number of threads are static.

For example if we have a grid size of 10, this would not fit nicely inside our workgroups and threads.

If our grid has a size that is not divisible by 8, we will have too many threads running. This is fine as long as we make sure to check if the id is in fact in the bounds of our grid.

Let’s also add 2 helper functions. rand(float2) and indexFromId(uint3). rand() gives us a random value between -1 and 1 given a uv/xy position. indexFromId() returns the 1D index of the cube given a 3D id. This is needed since _Positions is a 1D array.

Make sure to add these helper functions above your CSMain function, otherwise the HLSL code will not compile.

To execute this shader, let’s create a new C# script, call it “CubeGrid” and assign it to a new object in the hierarchy. Initialize the script with the following variables.

In Unity, create a CubePrefab by right clicking in the hierarchy > 3D object > Cube. Remove the Box Collider and drag the cube into your assets folder. Delete the cube from the hierarchy and assign the prefab from the assets folder to your script in the hierarchy. Don’t forget to assign the compute shader in the inspector. Once this is done, back in the C# script, we have to create our compute buffer. A compute buffer requires to know the maximum amount of elements it can contain and the size in bytes (also called “stride”) of a single element.

Since we want to constantly update our y positions of the cubes, which is represented by a float (in our RWStructuredBuffer<float> _Positions), we will be using the size of a float as stride, and a total maximum size of elements of CubesPerAxis * CubesPerAxis, which is the total size of our grid.

Can we also use custom types in compute shader? Yes, compute buffers also allow us to use custom structs, like so In this case we would tell our compute buffer a single stride is Because a single triangle contains 3 vector3's which each exists out of 3 floats (xyz). To match this triangle struct in HLSL we would use the following:
---

Going back to our example case, telling our compute buffer the max size and stride effectively reserves space for an array in GPU memory in which every element has a size of a float.

When the game ends we also manually have to release the buffer.

Let’s now create the grid by looping through the x,z coordinates and instantiating a cube at every position.

If we where to go back to Unity now, make sure all parameters are set in the hierarchy and run the game, we should see a grid of cubes.

To constantly update our grid, we start a coroutine at the end of creating the grid. This coroutine is responsible for looping through the cubes and updating its y coordinate.

In case we use our CPU, we want to call the UpdatePositionsCPU() function that sets every cubes y position to a random value (range -1 to 1) every repetition.

Our UpdatePositionsGPU() function sets parameters on our compute shader and dispatches it.

Going over the code above, we start by assigning the positions buffer to the compute shader in the first (and only) kernel “CSMain”. We link this buffer with our _Positions array. We than once again set some variables and find the size of a workgroup.

We dispatch the shader so that it executes the computes shader, and finally, with the GetData method on the buffer take whatever data is currently (this is after executing the compute shader) in our _Positions array on the buffer and store it in the _cubesPositions array in our C# script.

This should be all. Going into Unity, you will see that when starting the game, given your hardware and the amount of repetitions, the GPU version will have an easier time staying at a stable framerate than the CPU. In my case with 80 cubes per axis and 3000 repetitions, the GPU stays somewhere around 85 fps at all times, while the CPU version regularly plunges to 35 fps.

### 3.1.2 Benchmark

Below you can find a benchmark which runs the UpdatePositionsGPU/UpdatePositionsCPU methods 10 times and averages the time used to execute. Results will vary based on hardware.

Specs:

• CPU - i7-8700
• GPU - RTX 2060
Execution Mode Time
CPU 316.5ms
GPU 0.1ms

An additional practical example can be found in the marching cubes guides.