AMD RDNA3 mesh shading with RADV

This is a long-awaited update to the previous mesh shading related posts. RDNA3 brings many interesting improvements to the hardware which simplify how mesh shaders work.

Reminder: main limitation of mesh shading on RDNA2

RDNA2 already supported mesh and task shaders, but mesh shaders had a big caveat regarding how outputs work: each shader invocation could only really write up to 1 vertex and 1 primitive, which meant that the shader compiler had to work around that to implement the programming model of the mesh shading API.

On RDNA2 the shader compiler had to:

Make sure to launch (potentially) more invocations than the API shader needed, to accomodate a larger number of output vertices / primitives.
Save all outputs to memory (shared memory or VRAM) and reload them at the end, unless they were only indexed by the local invocation index.

Shader output changes on RDNA3

RDNA3 changes how shader outputs work on pre-rasterization stages (including VS, TES, GS, MS).

Attribute ring

Previous architectures had a special buffer called parameter cache where the pre-rasterization stage stored positions and generic output attributes for fragment shaders (pixel shaders) to read.

The parameter cache was removed from RDNA3 in favour of the attribute ring which is basically a buffer in VRAM. Shaders must now store their outputs to this buffer and after rasterization, the HW reads the attributes from the attribute ring and stores them to the LDS space of fragment shaders.

When I first heard about the attribute ring I didn’t understand how this is an improvement over the previous design (VRAM bandwidth is considered a bottleneck in many cases), but then I realized that this is meant to work together with the Infinity Cache that these new chips have. In the ideal access pattern, each attribute store would overwrite a full cache line so the shader won’t actually touch VRAM.

For mesh shaders, this has two consequences:

Any invocation can now truly write generic attributes of any other invocation without restrictions, because these are just a memory write.
The shader compiler now has to worry about memory access patterns.

RADV already supports the attribute ring for VS, TES and GS so we have some experience with how it works and only needed to apply that to mesh shaders.

Row exports

For non-generic output attributes (such as position, clip/cull distances, etc.) we still need to use exp instructions just like the old hardware. However, these now have a new mode called row export which allows each lane to write not only its own outputs but also others in the same row.

Basic RDNA3 mesh shading: legacy fast launch mode

The legacy fast launch mode is essentially the same thing as RDNA2 had, so in this mode mesh shaders can be compiled with the same structure and the compiler only needs to be adjusted to use the attribute ring.

The drawback of this mode is that it still has the same issue with workgroup size as RDNA2 had. So this is just useful for helping driver developers port their code to the new architecture but it doesn’t allow us to fully utilize the new capabilities of the hardware.

The initial MS implementation in RADV used this mode.

New fast launch mode

In this mode, the number of HW shader invocations is determined similarly to how a compute shader would work, and there is no need to match the number of vertices and primitives in this mode.

Thanks to Rhys for working on this and enabling the new mode on RDNA3.

Based on the information we can glean from the open source progress (in particular, the published register files) happening thus far, we think RDNA4 will only support this new mode.

What took you so long?

I’ve wanted to write about this for some time, but somehow forgot that I have a blog… Sorry!

References

As always, what I discuss here is based on open source driver code including mesa (RadeonSI and RADV) and AMD’s reference driver code.

RadeonSI and RADV already had code and comments that explain the attribute ring.
RDNA3 shader ISA and LLPC mesh/task code do a good job at explaining row exports and hint at the new fast launch mode.
PAL’s reference implementation explains how the draw packets work.

The blog doesn't have comments, but feel free to reach out to me on IRC (Venemo on OFTC) or Discord (sunrise_sky) to discuss.

Timur's blog