Mesh and task shaders (amplification shaders in D3D jargon) are a new way to produce geometry in 3D applications. First proposed by NVidia in 2018 and initially available in the “Turing” series of GPUs, they are now supported on RDNA2 GPUs and on the API are part of the D3D12 API and also as vendor-specific extensions to Vulkan and OpenGL. In this post I’m going to talk about what mesh shaders are and in part 2 I’m going to talk about how they are implemented on the driver side.

Problems with the old geometry pipeline

The problem with the traditional vertex processing pipeline is that it is mainly designed assuming several fixed-function hardware units in the GPU and offers very little flexibility for the user to customize it. The main issues with the traditional pipeline are:

  • Vertex buffers and vertex shader inputs are annoying (especially from the driver’s perspective) and input assembly may be a bottleneck on some HW in some cases.
  • The user has no control over how the input vertices and primitives are arranged, so the vertex shader may uselessly run for primitives that are invisible (eg. occluded, or backfacing etc.) meaning that compute resources are wasted on things that don’t actually produce any pixels.
  • Geometry amplification depends on fixed function tessellation HW and offers poor customizability.
  • The programming model allows very poor control over input and output primitives. Geometry shaders have a horrible programming model that results in low HW occupancy and limited topologies.

The mesh shading pipeline is a graphics pipeline that addresses these issues by completely replacing the entire traditional vertex processing pipeline with two new stages: the task and mesh shader.

What is a mesh shader?

A mesh shader is a compute-like stage which allows the application to fully customize its inputs and outputs, including output primitives.

  • Mesh shader vs. Vertex shader: A mesh shader is responsible for creating its output vertices and primitives. In comparison, a vertex shader is only capable of loading a fixed amount of vertices, doing some processing on them and has no awareness of primitives.
  • Mesh shader vs. Geometry shader: As opposed to a geometry shader which can only use a fixed output topology, a mesh shader is free to define whatever topology it wants. You can think about it as if the mesh shader produces an indexed triangle list.

What does it mean that the mesh shader is compute-like?

You can use all the sweet good stuff that compute shaders already can do, but vertex shaders couldn’t, for example: use shared memory, run in workgroups, rely on workgroup ID, subgroup ID, etc.

The API allows any mesh shader invocation to write to any vertex or primitive. The invocations in each mesh shader workgroup are meant to co-operatively produce a small set of output vertices and primitives (this is sometimes called a “meshlet”). All workgroups together create the full output geometry (the “mesh”).

What does the mesh shader do?

First, it needs to figure out how many vertices and primitives it wants to create, then it can write these to its output arrays. How it does this, is entirely up to the application developer. Though, there are some performance recommendations which you should follow to make it go fast. I’m going to talk about these more in Part 2.

The input assembler step is entirely eliminated, which means that the application is now in full control of how (or if at all) the input vertex data is fetched, meaning that you can save bandwidth on things that don’t need to be loaded, etc. There is no “input” in the traditional sense, but you can rely on things like push constants, UBO, SSBO etc.

For example, a mesh shader could perform per-triangle culling in such a manner that it wouldn’t need to load data for primitives that are culled, therefore saving bandwidth.

What is a task shader aka. amplification shader?

The task shader is an optional stage which operates like a compute shader. Each task shader workgroup has two main purposes:

  • Decide how many mesh shader workgroups need to be launched.
  • Create an optional “task payload” which can be passed to mesh shaders.

The “geometry amplification” is achieved by choosing to launch more (or less) mesh shader workgroups. As opposed to the fixed-function tessellator in the traditional pipeline, it is now entirely up to the application how to create the vertices.

While you could re-implement the old fixed-function tessellation with mesh shading, this may not actually be necessary and your application may work fine with some other simpler algorithm.

Another interesting use case for task shaders is per-meshlet culling, meaning that a task shader is a good place to decide which meshlets you actually want to render and eliminate entire mesh shader workgroups which would otherwise operate on invisible primitives.

  • Task shader vs. Tessellation. Tessellation relies on fixed-function hardware and makes users do complicated shader I/O gymnastics with tess control shaders. The task shader is straightforward and only as complicated as you want it to be.
  • Task shader vs. Geometry shader. Geometry shaders operate on input primitives directly and replace them with “strip” primitives. Task shaders don’t have to be directy get involved with the geometry output, but rather just let you specify how many mesh shader workgroups to launch and let the mesh shader deal with the nitty-gritty details.

Usefulness

For now, I’ll just discuss a few basic use cases.

Meshlets. If your application loads input vertex data from somewhere, it is recommended that you subdivide that data (the “mesh”) into smaller chunks called “meshlets”. Then you can write your shaders such that each mesh shader workgroup processes a single meshlet.

Procedural geometry. Your application can generate all its vertices and primitives based on a mathematical formula that is implemented in the shader. In this case, you don’t need to load any inputs, just implement your formula as if you were writing a compute shader, then store the results into the mesh shader output arrays.

Replacing compute pre-passes. Many modern games use a compute pre-pass. They launch some compute shaders that do some pre-processing on the geometry before the graphics work. These are no longer necessary. The compute work can be made part of either the task or mesh shader, which removes the overhead of the additional submission.

Note that mesh shader workgroups may be launched as soon as the corresponding task shader workgroup is finished, so mesh shader execution (of the already finished tasks) may overlap with task shader execution, removing the need for extra synchronization on the application side.

Conclusion

Thus far, I’ve sold you on how awesome and flexible mesh shading is, so it’s time to ask the million dollar question.

Is mesh shading for you?

The answer, as always, is: It depends.

Yes, mesh and task shaders do give you a lot of opportunities to implement things just the way you like them without the stupid hardware getting in your way, but as with any low-level tools, this also means that you get a lot of possibities for shooting yourself in the foot.

The traditional vertex processing pipeline has been around for so long that on most hardware it’s extremely well optimized because the drivers do a lot of optimization work for you. Therefore, just because an app uses mesh shaders, doesn’t automatically mean that it’s going to be faster or better in any way. It’s only worth it if you are willing to do it well.

That being said, perhaps the easiest way to start experimenting with mesh shaders is to rewrite parts of your application that used to use geometry shaders. Geometry shaders are so horribly inefficient that it’ll be difficult to write a worse mesh shader.

How is mesh shading implemented under the hood?

Stay tuned for Part 2 if you are curious about that! In Part 2, I’m going to talk about how mesh and task shaders are implemented in a driver. This will shed some light on how these shaders work internally and why certain things perform really badly.

References