Task shader driver implementation on AMD HW

Previously, I gave you an introduction to mesh/task shaders and wrote up some details about how mesh shaders are implemented in the driver. But I left out the important details of how task shaders (aka. amplification shaders) work in the driver. In this post, I aim to give you some details about how task shaders work under the hood. Like before, this is based on my experience implementing task shaders in RADV and all details are already public information.

Refresher about the task shader API

The task shader (aka. amplification shader in D3D12) is a new stage that runs in workgroups similar to compute shaders. Each task shader workgroup has two jobs: determine how many mesh shader workgroups it should launch (dispatch size), and optionally create a “payload” (up to 16K data of your choice) which is passed to mesh shaders.

Additionally, the API allows task shaders to perform atomic operations on the payload variables.

Typical use of task shaders can be: cluster culling, LOD selection, geometry amplification.

Expectations on task shaders

Before we get into any HW specific details, there are a few things we should unpack first. Based on the API programming mode, let’s think about some expectations on a good driver implementation.

Storing the output task payload. There must exist some kind of buffer where the task payload is stored, and the size of this buffer will obviously be a limiting factor on how many task shader workgroups can run in parallel. Therefore, the implementation must ensure that only as many task workgroups run as there is space in this buffer. Preferably this would be a ring buffer whose entries get reused between different task shader workgroups.

Analogy with the tessellator. The above requirements are pretty similar to what tessellation can already do. So a natural conclusion may be that we may be able to implement task shaders by abusing the tessellator. However, this introduces a potential bottleneck on fixed-function hardware which we would prefer not to do.

Analogy with a compute pre-pass. Another similar thing that comes to mind is a compute pre-pass. Many games already do something like this: some pre-processing in a compute dispatch that is executed before a draw call. Of course, the application has to insert a barrier between the dispatch and the draw, which means the draw can’t start before every invocation in the dispatch is finished. In reality, not every graphics shader invocation depends on the results of all compute invocations, but there is no way to express a more fine-grained dependency. For task shaders, it is preferable to avoid this barrier and allow task and mesh shader invocations to overlap.

Task shaders on AMD HW

What I discuss here is based on information that is already publicly available in open source drivers. If you are already familiar with how AMD’s own PAL-based drivers work, you won’t find any surprises here.

First things fist. Under the hood, task shaders are compiled to a plain old compute shader. The task payload is located in VRAM. The shader code that stores the mesh dispatch size and payload are compiled to memory writes which store these in VRAM ring buffers. Even though they are compute shaders as far as the AMD HW is concerned, task shaders do not work like a compute pre-pass. Instead, task shaders are dispatched on an async compute queue while at the same time the mesh shader work is executed on the graphics queue in parallel.

The task+mesh dispatch packets are different from a regular compute dispatch. The compute and graphics queue firmwares work together in parallel:

Compute queue launches up to as many task workgroups as it has space available in the ring buffer.
Graphics queue waits until a task workgroup is finished and can launch mesh shader workgroups immediately. Execution of mesh dispatches from a finished task workgroup can therefore overlap with other task workgroups.
When a mesh dispatch from the a task workgroup is finished, its slot in the ring buffer can be reused and a new task workgroup can be launched.
When the ring buffer is full, the compute queue waits until a mesh dispatch is finished, before launching the next task workgroup.

You can find out the exact concrete details in the PAL source code, or RADV merge requests.

Side note, getting some implementation details wrong can easily cause a deadlock on the GPU. It is great fun to debug these.

The relevant details here are that most of the hard work is implemented in the firmware (good news, because that means I don’t have to implement it), and that task shaders are executed on an async compute queue and that the driver now has to submit compute and graphics work in parallel.

Keep in mind that the API hides this detail and pretends that the mesh shading pipeline is just another graphics pipeline that the application can submit to a graphics queue. So, once again we have a mismatch between the API programming model and what the HW actually does.

Squeezing a hidden compute pipeline in your graphics

In order to use this beautiful scheme provided by the firmware, the driver needs to do two things:

Create a compute pipeline from the task shader.
Submit the task shader work on the asyc compute queue while at the same time also submit the mesh and pixel shader work on the graphics queue.

We already had good support for compute pipelines in RADV (as much as the API needs), but internally in the driver we’ve never had this kind of close cooperation between graphics and compute.

When you use a draw call in a command buffer with a pipeline that has a task shader, RADV must create a hidden, internal compute command buffer. This internal compute command buffer contains the task shader dispatch packet, while the graphics command buffer contains the packet that dispatches the mesh shaders. We must also ensure correct synchronization between these two command buffers according to application barriers ― because of the API mismatch it must work as if the internal compute cmdbuf was part of the graphics cmdbuf. We also need to emit the same descriptors and push constants, etc. When the application submits the graphics queue, this new, internal compute command buffer is then submitted to the async compute queue.

Thus far, this sounds pretty logical and easy.

The actual hard work is to make it possible for the driver to submit work to different queues at the same time. RADV’s queue code was written assuming that there is a 1:1 mapping between radv_queue objects and HW queues. To make task shaders work we must now break this assumption.

So, of course I had to do some crazy refactor to enable this. At the time of writing the AMDGPU Linux kernel driver doesn’t support “gang submit” yet, so I use scheduled dependencies instead. This has the drawback of submitting to the two queues sequentially rather than doing everything in the same submit.

Conclusion, perf considerations

Let’s turn the above wall of text into some performance considerations that you can actually use when you write your next mesh shading application.

Because task shaders are executed on a different HW queue, there is some overhead. Don’t use task shaders for small draws or other cases when this overhead may be more than what you gain from them.
For the same reason, barriers may require the driver to emit some commands that stall the async compute queue. Be mindful of your barriers (eg. top of pipe, etc) and only use these when your task shader actually depends on some previous graphics work.
Because task payload is written to VRAM by the task shader, and has to be read from VRAM by the mesh shader, there is some latency. Only use as much payload memory as you need. Try to compact the memory use by packing your data etc.
When you have a lot of geometry data, it is beneficial to implement cluster culling in your task shader. After you’ve done this, it may or may not be worth it to implement per-triangle culling in your mesh shader.
Don’t try to reimplement the classic vertex processing pipeline or emulate fixed-function HW with task+mesh shaders. Instead, come up with simpler ways that work better for your app.

NVidia also has some perf recommendations here which mostly apply to any other HW, except for the recommended number of vertices and primitives per meshlet because the sweet spot for that can differ between GPU architectures.

Stay tuned

It has been officially confirmed that a Vulkan cross-vendor mesh shading extension is coming soon. Update: it’s here!

~~While I can’t give you any details about the new extension,~~ I think it won’t be a surprise to anyone that it ~~may have been~~ was the motivation for my work on mesh and task shaders.

Once the new extension goes public, I will post some thoughts about it and a comparison to the vendor-specific NV_mesh_shader extension.

Note, 2022-09-03: updated with a link to the new vendor-netural Vulkan mesh shader extension.

The blog doesn't have comments, but feel free to reach out to me on IRC (Venemo on OFTC) or Discord (sunrise_sky) to discuss.

Timur's blog