What is NGG and shader culling on AMD RDNA GPUs?

NGG (Next Generation Geometry) is the technology that is responsible for any vertex and geometry processing in AMD RDNA GPUs. I decided to do a write-up about my experience implementing it in RADV, which is the Vulkan driver used by many Linux systems, including the Steam Deck. I will also talk about shader culling on RDNA GPUs.

Old stuff: the legacy geometry pipeline on GCN

Let’s start by briefly going over how the old GCN geometry pipeline works, so that we can compare old and new.

GCN GPUs have 5 programmable hardware shader stages for vertex/geometry processing: LS, HS, ES, GS, VS. These HW stages don’t exactly map to the software stages that are advertised in the API. Instead, it is the responsibility of the driver to select which HW stages need to be used for a given pipeline. We have a table if you want details.

The rasterizer can only consume the output from HW VS, so the last SW stage in your pipeline must be compiled to HW VS. This is trivial for VS and TES (yes, tess eval shaders are “VS” to the hardware), but GS is complicated. GS outputs are written to memory. The driver has no choice but to compile a separate shader that runs as HW VS, and copies this memory to the rasterizer. We call this the “GS copy shader” in Mesa. (It is not part of the API pipeline but required to make GS work.)

Notes:

All of these HW stages (except GS, of course) use a model in which 1 shader invocation (SIMD lane, or simply thread in D3D jargon) corresponds to 1 output vertex. These stages (except GS) are also not aware of output primitives.
Vega introduced “merged shaders” (LS+HS, and ES+GS) but did not fundamentally change the above model.

New stuff: NGG pipeline on RDNA

The next-generation geometry pipeline vastly simplifies how the hardware works. (At the expense of some increased driver complexity.)

There are now only 2 HW shader stages for vertex/geometry processing:

Surface shader which is a pre-tessellation stage and is equivalent to what LS + HS was in the old HW.
Primitive shader which can feed the rasterizer and replaces all of the old ES + GS + VS stages.

The surface shader is not too interesting, it runs the merged SW VS + TCS when tessellation is enabled. I’m not aware of any changes to how this works compared to old HW.

The interesting part, and subject of much discussion on the internet is the primitive shader. In some hardware documentation and register header files the new stage is referred to as simply “GS”, because AMD essentially took what the GS stage could already do and added the ability for it to directly feed the rasterizer using exp (export) instructions. Don’t confuse this with SW GS. It supports a superset of the functionality that you can do in software VS, TES, GS and MS. In very loose terms you can think about it as a “mesh-like” stage which all these software stages can be compiled to.

Compared to the old HW VS, a primitive shader has these new features:

Compute-like: they are running in workgroups, and have full support for features such as workgroup ID, subgroup count, local invocation index, etc.
Aware of both input primitives and vertices: there are registers which contain information about the input primitive topology and the overall number of vertices/primitives (similar to GS).
They have to export not only vertex output attributes (positions and parameters), but also the primitive topology, ie. which primitive (eg. triangle) contains which vertices and in what order. Instead of processing vertices in a fixed topology, it is up to the shader to create as many vertices and primitives as the application wants.
Each shader invocation can create up to 1 vertex and up to 1 primitive.
Before outputting any vertex or primitive, a workgroup has to tell how many it will output, using s_sendmsg(gs_alloc_req) which ensures that the necessary amount of space in the parameter cache is allocated for them.
On RDNA2, per-primitive output params are also supported.

How is shader compilation different?

Software VS and TES:
Compared to the legacy pipeline, the compiled shaders must now not only export vertex output attributes for a fixed number of vertices, but instead they create vertices/primitives (after declaring how many they will create). This is trivial to implement because all they have to do is read the registers that contain the input primitive topology and then export the exact same topology.

Software GS:
As noted above, each NGG shader invocation can only create up to 1 vertex + up to 1 primitive. This mismatches the programming model of SW GS and makes it difficult to implement. In a nutshell, for SW GS the hardware launches a large enough workgroup to fit every possible output vertex. This results in poor HW utilization (most of those threads just sit there doing nothing while the GS threads do the work), but there is not much we can do about that.

Mesh shaders:
The new pipeline enables us to support mesh shaders, which was simply impossible on the legacy pipeline, due to how the programming model entirely mismatches anything the old hardware could do.

How does any of this make my games go faster?

We did some benchmarks when we switched RADV and ACO to use the new pipeline. We found no significant perf changes. At all. Considering all the hype we heard about NGG at the hardware launch, I was quite surprised.

However, after I set the hype aside, it was quite self-explanatory. When we switched to NGG, we still compiled our shaders mostly the same way as before, so even though we used the new geometry pipeline, we didn’t do anything to take advantage of its new capabilities.

The actual perf improvement came after I also implemented shader-based culling.

What is NGG shader culling?

The NGG pipeline makes it possible for shaders to know about input primitives and create an arbitrary topology of output primitives. Even though the API does not make this information available to application shaders, it is possible for driver developers to make their compiler aware of it and add some crazy code that can get rid of primitives when it knows that those will never be actually visible. The technique is known as “shader culling”, or “NGG culling”.

This can improve performance in games that have a lot of triangles, because we only calculate the output positions of every vertex before we decide which triangle to remove. We then also remove unused vertices.

The benefits are:

Reduced bottleneck from the fixed-function HW that traditionally does culling.
Improved bandwidth use, because we can avoid loading some inputs for vertices we delete.
Improved shader HW utilization because we can avoid computing additional vertex attributes for deleted vertices.
More efficient PC (parameter cache) use as we don’t need to reserve output space for deleted vertices and primitives.

If there is interest, I may write a blog post about the implementation details later.

Caveats of shader culling

Due to how all of this reduces certain bottlenecks, its effectiveness very highly depends on whether you actually had a bottleneck in the first place. How many primitives it can remove of course depends on the application. Therefore the exact percentage of performance gain (or loss) also depends on the application.

If an application didn’t have any of the aforementioned bottlenecks or already mitigates them in its own way, then all of this new code may just add unnecessary overhead and actually slightly reduce performance rather than improve it.

Other than that, there is some concern that a shader-based implementation may be less accurate than the fixed-function HW.

It may leave some triangles which should have been removed. This is not really an issue as these will be removed by fixed-func HW anyway.
The bigger problem is that it may delete triangles which should have been kept. This can manifest itself by missing objects from a scene or flickering, etc.

Our results with shader culling on RDNA2

Shader culling seems to be most efficient in benchmarks and apps that output a lot of triangles without having any in-app solution for dealing with the bottlenecks. It is also very effective on games that suffer from overtessellation (ie. create a lot of very small triangles which are not visible).

An extreme example is the instancing demo by Sascha Willems which gets a massive boost
Basemark sees 10%+ performance improvement
Doom Eternal gets a 3%-5% boost (depending on the GPU and scene)
The Witcher 3 also benefits (likely thanks to its overuse of tessellation)
In less demanding games, the difference is negligible, around 1%

While shader culling can also work on RDNA1, we don’t enable it by default because we haven’t yet found a game that noticably benefits from it. On RDNA1, it seems that the old and new pipelines have similar performance.

Notes about hardware support

Vega had something similar, but I haven’t heard of any drivers that ever used this. Based on public info I cound find, it’s not even worth looking into.
Navi 10 and 12 lack some features such as per-primitive outputs which makes it impossible to implement mesh shaders on these GPUs. We don’t use NGG on Navi 14 (RX 5500 series) because it doesn’t work.
Navi 21 and newer have the best support. They have all necessary features for mesh shaders. We enabled shader culling by default on these GPUs because they show a measurable benefit.
Van Gogh (the GPU in the Steam Deck) has the same feature set as Navi 2x. It also shows benefits from shader culling, but to a smaller extent.

Closing thoughts

The main takeaway from this post is that NGG is not a performance silver bullet that magically makes all your games faster. Instead, it is an enabler of new features. It lets the driver implement new techniques such as shader culling and new programming models like mesh shaders.

The blog doesn't have comments, but feel free to reach out to me on IRC (Venemo on OFTC) or Discord (sunrise_sky) to discuss.

Timur's blog