<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.2.2">Jekyll</generator><link href="https://timur.hu/feed.xml" rel="self" type="application/atom+xml" /><link href="https://timur.hu/" rel="alternate" type="text/html" /><updated>2022-09-03T23:28:34+02:00</updated><id>https://timur.hu/feed.xml</id><title type="html">Timur’s blog</title><subtitle>Developer and electrical engineer who prefers quality over quantity. Contributor to the Linux open source graphics stack and other cool stuff.</subtitle><author><name>Timur Kristóf</name></author><entry><title type="html">Mesh shaders arrive on your Linux computers</title><link href="https://timur.hu/blog/2022/mesh-shaders-arrive-on-linux" rel="alternate" type="text/html" title="Mesh shaders arrive on your Linux computers" /><published>2022-09-03T07:17:44+02:00</published><updated>2022-09-03T07:17:44+02:00</updated><id>https://timur.hu/blog/2022/mesh-shaders-arrive-on-linux</id><content type="html" xml:base="https://timur.hu/blog/2022/mesh-shaders-arrive-on-linux"><![CDATA[<p>September 1 was a big day!
The <a href="https://github.com/KhronosGroup/Vulkan-Docs/commit/135da3a538263ef0d194cab25e2bb091119bdc42">official cross-vendor Vulkan mesh shading extension</a> that I teased a while ago, has
<a href="https://www.khronos.org/blog/mesh-shading-for-vulkan">now been officially released</a>.
This is a significant moment for me because I’ve spent considerable time making
<a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18367">the RADV implementation</a> and
collaborated with some excellent people to help shape this extension in Khronos.</p>

<h2 id="how-it-started">How it started</h2>

<p>We first started talking about mesh shaders in Mesa about two years (maybe more?) ago.
At the time the only publicly available Vulkan mesh shading API was the vendor-specific
<code class="language-plaintext highlighter-rouge">NV_mesh_shader</code> made by NVidia.</p>

<ul>
  <li>At the time nobody quite fully understood what mesh shaders are supposed to be
and how they could work on the HW. I initially anticipated that they could be
made to work even on RDNA1, but this turned out false due to some HW limitations.</li>
  <li>It was unclear what was needed to get good performance.</li>
  <li>The NVidia extension was a good start, but it was full of things that don’t make
any real sense on any other HW.</li>
  <li>Nobody had a clue how we would implement task shaders.</li>
</ul>

<p>I was working together with <a href="https://twitter.com/pixeljetstream">Christoph Kubisch</a> (from NVidia) who helped understand what
this topic is all about.
<a href="https://gitlab.freedesktop.org/cmarcelo">Caio Oliveira</a> and <a href="https://gitlab.freedesktop.org/mslusarz">Marcin Slusarz</a> (from Intel, working on ANV)
joined the adventure too.</p>

<h2 id="how-we-made-it-work">How we made it work</h2>

<p>We made the decision to start working on some preliminary support for
the NV extension to get some experience with the topic and learn how it’s supposed to work.
Once we were satisfied with that, we made the jump to the EXT.</p>

<h3 id="nir-and-spir-v">NIR and SPIR-V</h3>

<p>These are <a href="https://www.jlekstrand.net/jason/blog/2022/01/in-defense-of-nir/">the common Mesa pieces that all drivers can use</a>.
The front-end and middleware code for mesh/task shader support are here.
Caio created the initial pieces and I expanded on that as I found more and more
things that needed adjustment in NIR, added a new storage class for task payloads etc.
Marcin also chimed in with fixes and improvements.</p>

<p>AMD and Intel hardware work differently, so most of the hard
work couldn’t be shared and needed to be implemented in the backends.
However, some of the commonalities are implemented in eg. <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16720"><code class="language-plaintext highlighter-rouge">nir_lower_task_shader</code></a>
for the work that needs to happen in both RADV and ANV.
There were dozens of merge requests that added support for various features,
cleaned up old features to make them not crash on mesh shaders, etc.
The latest is <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18366">this MR</a>
which adds all the remaining puzzle pieces for the EXT.</p>

<h3 id="lowering-the-shaders-in-the-backend">Lowering the shaders in the backend</h3>

<p>Because mesh shaders use NGG, the heavy lifting is done in <code class="language-plaintext highlighter-rouge">ac_nir_lower_ngg</code>
which is already responsible for other NGG shaders (VS, TES, GS).
The lowering basically “translates” the API shader to work like the HW expects.
Without going into much detail, it essentially wraps the application shader and replaces
some pieces of it to make them understandable by the HW.</p>

<p>There is now also an <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14929"><code class="language-plaintext highlighter-rouge">ac_nir_lower_taskmesh_io_to_mem</code></a>
for translating task payload I/O and <code class="language-plaintext highlighter-rouge">EmitMeshTasksEXT</code> to something the HW can understand.</p>

<h3 id="mesh-shading-draw-calls-in-radv">Mesh shading draw calls in RADV</h3>

<p>Previously I was mostly working on the compiler stack (NIR and ACO) and had
little experience with the hardcore driver code,
such as how draws/dispatches work and how the driver submits cmd buffers
to the GPU. As I had near-zero knowledge of these areas, I had to learn about them,
mostly by just reading the code.</p>

<p>So, I split the RADV work in two parts:</p>

<ol>
  <li><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13580">Mesh shader only pipelines.</a>
These required only moderate changes to
<code class="language-plaintext highlighter-rouge">radv_cmd_buffer</code> to add some new draw calls, and minor work in <code class="language-plaintext highlighter-rouge">radv_pipeline</code>
to figure out per-primitive output attributes and get the register programming right.</li>
  <li><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16531">Task shader support.</a>
Due to how task shaders work on AMD HW, these required
very severe refactoring in <code class="language-plaintext highlighter-rouge">radv_device</code> because we now have to submit to multiple
queues at once. This is actually not finished yet because “gang submit” is still missing.
Additionally, <code class="language-plaintext highlighter-rouge">radv_cmd_buffer</code> needed heavy changes to support an internal compute cmdbuf.</li>
</ol>

<p>Naturally, during this work I also hit several RADV bugs in pre-existing use cases which
just nobody noticed yet. Some of these were issues with secondary command buffers and
conditional rendering on compute-only queues. There was also a nasty firmware bug, and other
exciting stuff.</p>

<p>Let’s just say that my GPU loved to <strong>hang</strong> out with me.</p>

<h3 id="mesh-shading-on-intel">Mesh shading on Intel</h3>

<p>The Intel ANV compiler backend and driver implementation were done by
Caio and Marcin and I just want to take this opportunity to say that
I really enjoyed working together with them.</p>

<h3 id="the-guy-who-wrote-the-most-mesh-shaders-on-earth">The guy who wrote the most mesh shaders on Earth</h3>

<p>All of the above would have been impossible if I didn’t have some solid test cases
which I could throw at my implementation. Ricardo Garcia developed the
CTS (Vulkan Conformance Test Suite) testcases for both NV_mesh_shader and EXT_mesh_shader.
During that work, Ricardo wrote <strong>several thousand</strong> mesh and task shaders for us to test with.
Ricardo if you’re reading this, <strong>THANK YOU</strong>!!!</p>

<h2 id="conclusion">Conclusion</h2>

<p>Implementing <code class="language-plaintext highlighter-rouge">VK_EXT_mesh_shader</code> gave me a very good learning experience and helped me
get an understanding of parts of the driver that I had never looked at before.</p>

<h3 id="what-happens-to-nv_mesh_shader-now">What happens to NV_mesh_shader now?</h3>

<p>We never wanted to officially support it, it was just a stepping stone for us to help
start working on the EXT. The NV extension support will be removed soon.</p>

<h3 id="when-is-it-coming-to-my-steam-deck--linux-computer">When is it coming to my Steam Deck / Linux computer?</h3>

<p>The RADV and ANV support will be included once the system is updated to Mesa 22.3,
though we may be convinced to bring it to the Deck sooner
if somebody finds a game that uses mesh shaders.</p>

<p>For NVidia proprietary driver users, EXT_mesh_shader is already included in the latest beta drivers.</p>

<h3 id="waiting-for-the-gang">Waiting for the gang…</h3>

<p>We marked mesh/task shader support “experimental” in RADV because it has one main caveat
that we are unable to solve without kernel support.
Thanks to how we need to submit to two different queues at the same time, this can deadlock
your GPU if you are running two (or more) processes which use task shaders at the same time.
To properly solve it we need the <strong>“gang submit”</strong> feature in the kernel which prevents such deadlocks.</p>

<p>Unfortunately “gang submit” is not upstream yet. Cross your fingers and let’s hope it’ll be included in Linux 6.1.</p>

<p>Until then, you can use the <code class="language-plaintext highlighter-rouge">RADV_PERFTEST=ext_ms</code> environment variable
to play your favourite mesh shader games!</p>]]></content><author><name>Timur Kristóf</name></author><category term="graphics" /><category term="mesh" /><category term="freedesktop" /><category term="work" /><summary type="html"><![CDATA[September 1 was a big day! The official cross-vendor Vulkan mesh shading extension that I teased a while ago, has now been officially released. This is a significant moment for me because I’ve spent considerable time making the RADV implementation and collaborated with some excellent people to help shape this extension in Khronos.]]></summary></entry><entry><title type="html">What is NGG and shader culling on AMD RDNA GPUs?</title><link href="https://timur.hu/blog/2022/what-is-ngg" rel="alternate" type="text/html" title="What is NGG and shader culling on AMD RDNA GPUs?" /><published>2022-06-27T07:17:44+02:00</published><updated>2022-06-27T07:17:44+02:00</updated><id>https://timur.hu/blog/2022/what-is-ngg</id><content type="html" xml:base="https://timur.hu/blog/2022/what-is-ngg"><![CDATA[<p>NGG (Next Generation Geometry) is the technology that is responsible for
any vertex and geometry processing in AMD RDNA GPUs.
I decided to do a write-up about my experience implementing it
in RADV, which is the Vulkan driver used by many Linux systems, including the Steam Deck.
I will also talk about shader culling on RDNA GPUs.</p>

<h2 id="old-stuff-the-legacy-geometry-pipeline-on-gcn">Old stuff: the legacy geometry pipeline on GCN</h2>

<p>Let’s start by briefly going over how the old GCN geometry pipeline works,
so that we can compare old and new.</p>

<p>GCN GPUs have 5 programmable hardware shader stages for vertex/geometry processing: LS, HS, ES, GS, VS.
These HW stages don’t exactly map to the software stages that are advertised in the API.
Instead, it is the responsibility of the driver to select which HW stages need to be
used for a given pipeline. <a href="https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/amd/compiler/README.md#which-software-stage-runs-on-which-hardware-stage">We have a table</a> if you want details.</p>

<p>The rasterizer can only consume the output from HW VS, so the last SW stage in your pipeline
must be compiled to HW VS. This is trivial for VS and TES (yes, tess eval shaders are “VS” to
the hardware), but GS is complicated.
GS outputs are written to memory. The driver has no choice but to compile a separate shader that runs as HW VS,
and copies this memory to the rasterizer.
We call this the “GS copy shader” in Mesa. (It is not part of the API pipeline but required to make GS work.)</p>

<p>Notes:</p>

<ul>
  <li>All of these HW stages (except GS, of course) use a model in which
1 shader invocation (SIMD lane, or simply thread in D3D jargon) corresponds to 1 output vertex.
These stages (except GS) are also not aware of output primitives.</li>
  <li>Vega introduced “merged shaders” (LS+HS, and ES+GS) but did not fundamentally change the above model.</li>
</ul>

<h2 id="new-stuff-ngg-pipeline-on-rdna">New stuff: NGG pipeline on RDNA</h2>

<p>The next-generation geometry pipeline vastly simplifies how the hardware works.
(At the expense of some increased driver complexity.)</p>

<p>There are now only 2 HW shader stages for vertex/geometry processing:</p>

<ul>
  <li><em>Surface shader</em> which is a pre-tessellation stage and is equivalent to what LS + HS was in the old HW.</li>
  <li><em>Primitive shader</em> which can feed the rasterizer and replaces all of the old ES + GS + VS stages.</li>
</ul>

<p>The surface shader is not too interesting, it runs the merged SW VS + TCS when tessellation is enabled.
I’m not aware of any changes to how this works compared to old HW.</p>

<p>The interesting part, and subject of much discussion on the internet is the primitive shader.
In some hardware documentation and register header files the new stage is referred to as simply “GS”,
because AMD essentially took what the GS stage could already do and added the ability for it to
directly feed the rasterizer using <code class="language-plaintext highlighter-rouge">exp</code> (export) instructions.
Don’t confuse this with SW GS.
It supports a superset of the functionality that you can do in software VS, TES, GS and MS.
In very loose terms you can think about it as a “mesh-like” stage which all these software stages can be compiled to.</p>

<p>Compared to the old HW VS, a primitive shader has these new features:</p>

<ul>
  <li>Compute-like: they are running in workgroups, and have full support for features
such as workgroup ID, subgroup count, local invocation index, etc.</li>
  <li>Aware of both input primitives and vertices:
there are registers which contain information about the input
primitive topology and the overall number of vertices/primitives (similar to GS).</li>
  <li>They have to export not only vertex output attributes (positions and parameters),
but also the <em>primitive topology</em>, ie. which primitive (eg. triangle) contains which vertices
and in what order.
Instead of processing vertices in a fixed topology, it is up to the shader to create
as many vertices and primitives as the application wants.</li>
  <li>Each shader invocation can create up to 1 vertex and up to 1 primitive.</li>
  <li>Before outputting any vertex or primitive, a workgroup has to tell how many it will output,
using <code class="language-plaintext highlighter-rouge">s_sendmsg(gs_alloc_req)</code> which ensures that the necessary amount of space
in the parameter cache is allocated for them.</li>
  <li>On RDNA2, per-primitive output params are also supported.</li>
</ul>

<h3 id="how-is-shader-compilation-different">How is shader compilation different?</h3>

<p>Software VS and TES:<br />
Compared to the legacy pipeline, the compiled shaders must now not only export vertex output attributes
for a fixed number of vertices, but instead they
create vertices/primitives (after declaring how many they will create).
This is trivial to implement because all they have to do is read the registers that contain the input primitive
topology and then export the exact same topology.</p>

<p>Software GS:<br />
As noted above, each NGG shader invocation can only create up to 1 vertex + up to 1 primitive.
This mismatches the programming model of SW GS and makes it difficult to implement.
In a nutshell, for SW GS the hardware launches a large enough workgroup to fit every possible
output vertex. This results in poor HW utilization (most of those threads just sit there doing nothing
while the GS threads do the work), but there is not much we can do about that.</p>

<p>Mesh shaders:<br />
The new pipeline enables us to support mesh shaders, which was simply impossible on the legacy pipeline,
due to how the programming model entirely mismatches anything the old hardware could do.</p>

<h3 id="how-does-any-of-this-make-my-games-go-faster">How does any of this make my games go faster?</h3>

<p>We did some benchmarks when we switched RADV and ACO to use the new pipeline.
We found no significant perf changes. At all.
Considering all the hype we heard about NGG at the hardware launch, I was quite surprised.</p>

<p>However, after I set the hype aside, it was quite self-explanatory.
When we switched to NGG, we still compiled our shaders mostly the same way as before, so even though
we used the new geometry pipeline, we didn’t do anything to take advantage of its new capabilities.</p>

<p>The actual perf improvement came after I also implemented shader-based culling.</p>

<h3 id="what-is-ngg-shader-culling">What is NGG shader culling?</h3>

<p>The NGG pipeline makes it possible for shaders to know about input primitives and
create an arbitrary topology of output primitives.
Even though the API does not make this information available to application shaders, it is possible
for driver developers to make their compiler aware of it and add some crazy code that
can get rid of primitives when it knows that those will never be actually
visible. The technique is known as “shader culling”, or “NGG culling”.</p>

<p>This can improve performance in games that have a lot of triangles, because
we only calculate the output positions of every vertex before we decide
which triangle to remove. We then also remove unused vertices.</p>

<p>The benefits are:</p>

<ul>
  <li>Reduced bottleneck from the fixed-function HW that traditionally does culling.</li>
  <li>Improved bandwidth use, because we can avoid loading some inputs for vertices we delete.</li>
  <li>Improved shader HW utilization because we can avoid computing additional vertex attributes for deleted vertices.</li>
  <li>More efficient PC (parameter cache) use as we don’t need to reserve output space for deleted vertices and primitives.</li>
</ul>

<p>If there is interest, I may write a blog post about the implementation details later.</p>

<h4 id="caveats-of-shader-culling">Caveats of shader culling</h4>

<p>Due to how all of this reduces certain bottlenecks, its effectiveness very highly depends on
whether you actually had a bottleneck in the first place.
How many primitives it can remove of course depends on the application. Therefore
the exact percentage of performance gain (or loss) also depends on the application.</p>

<p>If an application didn’t have any of the aforementioned bottlenecks or already mitigates them in its
own way, then all of this new code may just add unnecessary overhead and actually slightly
reduce performance rather than improve it.</p>

<p>Other than that, there is some concern that a shader-based implementation may be less accurate
than the fixed-function HW.</p>

<ul>
  <li>It may leave some triangles which should have been removed.
This is not really an issue as these will be removed by fixed-func HW anyway.</li>
  <li>The bigger problem is that it may delete triangles which should have been kept.
This can manifest itself by missing objects from a scene or flickering, etc.</li>
</ul>

<h4 id="our-results-with-shader-culling-on-rdna2">Our results with shader culling on RDNA2</h4>

<p>Shader culling seems to be most efficient in benchmarks and apps that output a lot of
triangles without having any in-app solution for dealing with the bottlenecks.
It is also very effective on games that suffer from overtessellation (ie.
create a lot of very small triangles which are not visible).</p>

<ul>
  <li>An extreme example is the instancing demo by Sascha Willems which gets a massive boost</li>
  <li>Basemark sees 10%+ performance improvement</li>
  <li>Doom Eternal gets a 3%-5% boost (depending on the GPU and scene)</li>
  <li>The Witcher 3 also benefits (likely thanks to its overuse of tessellation)</li>
  <li>In less demanding games, the difference is negligible, around 1%</li>
</ul>

<p>While shader culling can also work on RDNA1, we don’t enable it by default
because we haven’t yet found a game that noticably benefits from it.
On RDNA1, it seems that the old and new pipelines have similar performance.</p>

<h4 id="notes-about-hardware-support">Notes about hardware support</h4>

<ul>
  <li>Vega had something similar, but I haven’t heard of any drivers that ever used this.
Based on public info I cound find, it’s not even worth looking into.</li>
  <li>Navi 10 and 12 lack some features such as per-primitive outputs which makes it
impossible to implement mesh shaders on these GPUs.
We don’t use NGG on Navi 14 (RX 5500 series) because it doesn’t work.</li>
  <li>Navi 21 and newer have the best support.
They have all necessary features for mesh shaders.
We enabled shader culling by default on these GPUs because they show a measurable benefit.</li>
  <li>Van Gogh (the GPU in the Steam Deck) has the same feature set as Navi 2x.
It also shows benefits from shader culling, but to a smaller extent.</li>
</ul>

<h2 id="closing-thoughts">Closing thoughts</h2>

<p>The main takeaway from this post is that NGG is not a performance silver bullet that
magically makes all your games faster. Instead, it is an enabler of new features.
It lets the driver implement new techniques such as shader culling and new programming models like mesh shaders.</p>]]></content><author><name>Timur Kristóf</name></author><category term="graphics" /><category term="mesh" /><category term="freedesktop" /><category term="work" /><summary type="html"><![CDATA[NGG (Next Generation Geometry) is the technology that is responsible for any vertex and geometry processing in AMD RDNA GPUs. I decided to do a write-up about my experience implementing it in RADV, which is the Vulkan driver used by many Linux systems, including the Steam Deck. I will also talk about shader culling on RDNA GPUs.]]></summary></entry><entry><title type="html">Task shader driver implementation on AMD HW</title><link href="https://timur.hu/blog/2022/how-task-shaders-are-implemented" rel="alternate" type="text/html" title="Task shader driver implementation on AMD HW" /><published>2022-05-21T17:16:56+02:00</published><updated>2022-05-21T17:16:56+02:00</updated><id>https://timur.hu/blog/2022/how-task-shaders-are-implemented</id><content type="html" xml:base="https://timur.hu/blog/2022/how-task-shaders-are-implemented"><![CDATA[<p>Previously, I gave you an <a href="/blog/2022/mesh-and-task-shaders">introduction to mesh/task shaders</a> and wrote up some
details about <a href="/blog/2022/how-mesh-shaders-are-implemented">how mesh shaders are implemented in the driver</a>.
But I left out the important details of how task shaders (aka. amplification shaders) work in
the driver. In this post, I aim to give you some details about how task shaders
work under the hood. Like before, this is based on my experience implementing task shaders in RADV
and all details are already public information.</p>

<h2 id="refresher-about-the-task-shader-api">Refresher about the task shader API</h2>

<p>The task shader (aka. amplification shader in D3D12) is a new stage that runs in workgroups similar to compute shaders.
Each task shader workgroup has two jobs: determine how many mesh shader workgroups
it should launch (dispatch size), and optionally create a “payload” (up to 16K data of your choice)
which is passed to mesh shaders.</p>

<p>Additionally, the API allows task shaders to perform atomic operations on the
payload variables.</p>

<p>Typical use of task shaders can be: cluster culling, LOD selection, geometry amplification.</p>

<h2 id="expectations-on-task-shaders">Expectations on task shaders</h2>

<p>Before we get into any HW specific details, there are a few things we should unpack first.
Based on the API programming mode, let’s think about some expectations on a good
driver implementation.</p>

<p><strong>Storing the output task payload.</strong>
There must exist some kind of buffer where the task payload is stored, and the size of this
buffer will obviously be a limiting factor on how many task shader workgroups can run in parallel.
Therefore, the implementation must ensure that only as many task workgroups run as there is space in this buffer.
Preferably this would be a ring buffer whose entries get reused between different task shader workgroups.</p>

<p><strong>Analogy with the tessellator.</strong>
The above requirements are pretty similar to what tessellation can already do.
So a natural conclusion may be that we may be able to implement task shaders by
abusing the tessellator. However, this introduces a potential bottleneck on
fixed-function hardware which we would prefer not to do.</p>

<p><strong>Analogy with a compute pre-pass.</strong>
Another similar thing that comes to mind is a compute pre-pass.
Many games already do something like this: some pre-processing in a compute dispatch that is
executed before a draw call. Of course, the application has to insert a barrier between the
dispatch and the draw, which means the draw can’t start before every invocation in the dispatch
is finished. In reality, not every graphics shader invocation depends on the results of
all compute invocations, but there is no way to express a more fine-grained dependency.
For task shaders, it is preferable to avoid this barrier and allow task and mesh
shader invocations to overlap.</p>

<h2 id="task-shaders-on-amd-hw">Task shaders on AMD HW</h2>

<p>What I discuss here is based on information that is already publicly available
in open source drivers. If you are already familiar with how AMD’s own PAL-based drivers
work, you won’t find any surprises here.</p>

<p>First things fist. Under the hood, task shaders are compiled to a plain old compute shader.
The task payload is located in VRAM.
The shader code that stores the mesh dispatch size and payload are compiled to
memory writes which store these in VRAM ring buffers.
Even though they are compute shaders as far as the AMD HW is concerned, task shaders
do not work like a compute pre-pass.
Instead, task shaders are dispatched on an async compute queue while
at the same time the mesh shader work is executed on the graphics queue
in parallel.</p>

<p>The task+mesh dispatch packets are different from a regular compute dispatch.
The compute and graphics queue firmwares work together in parallel:</p>

<ul>
  <li>Compute queue launches up to as many task workgroups as it has space available in the ring buffer.</li>
  <li>Graphics queue waits until a task workgroup is finished and can launch
mesh shader workgroups immediately.
Execution of mesh dispatches from a finished task workgroup
can therefore overlap with other task workgroups.</li>
  <li>When a mesh dispatch from the a task workgroup is finished,
its slot in the ring buffer can be reused and a new task workgroup can be
launched.</li>
  <li>When the ring buffer is full, the compute queue waits until a mesh dispatch
is finished, before launching the next task workgroup.</li>
</ul>

<p>You can find out the exact concrete details in the PAL source code, or RADV merge requests.</p>

<p>Side note, getting some implementation details wrong can easily cause a
deadlock on the GPU. It is great fun to debug these.</p>

<p>The relevant details here are that
most of the hard work is implemented in the firmware (good news, because that
means I don’t have to implement it), and that
<strong>task shaders are executed on an async compute queue</strong>
and that the driver now has to
<strong>submit compute and graphics work in parallel</strong>.</p>

<p>Keep in mind that the API hides this detail and pretends that the mesh shading pipeline
is just another graphics pipeline that the application can submit to a graphics queue.
So, once again we have a <strong>mismatch between the API programming model and what the
HW actually does</strong>.</p>

<h2 id="squeezing-a-hidden-compute-pipeline-in-your-graphics">Squeezing a hidden compute pipeline in your graphics</h2>

<p>In order to use this beautiful scheme provided by the firmware, the driver needs
to do two things:</p>

<ul>
  <li>Create a compute pipeline from the task shader.</li>
  <li>Submit the task shader work on the asyc compute queue while at the same time
also submit the mesh and pixel shader work on the graphics queue.</li>
</ul>

<p>We already had good support for compute pipelines in RADV (as much as the API needs),
but internally in the driver we’ve never had this kind of close cooperation between
graphics and compute.</p>

<p>When you use a draw call in a command buffer with a pipeline that has a task shader,
RADV must create a hidden, internal compute command buffer. This internal compute command buffer
contains the task shader dispatch packet, while the graphics command buffer contains the packet that
dispatches the mesh shaders.
We must also ensure correct synchronization between these two command buffers
according to application barriers ― because of the API mismatch it must work
as if the internal compute cmdbuf was part of the graphics cmdbuf.
We also need to emit the same descriptors and push constants, etc.
When the application submits the graphics queue, this new,
internal compute command buffer is then submitted to the async compute queue.</p>

<p>Thus far, this sounds pretty logical and easy.</p>

<p>The actual hard work is to make it possible for the driver to submit work
to different queues at the same time. RADV’s queue code was written
assuming that there is a 1:1 mapping between <code class="language-plaintext highlighter-rouge">radv_queue</code> objects and
HW queues. To make task shaders work we must now break this assumption.</p>

<p>So, of course I had to do some crazy refactor to enable this.
At the time of writing the AMDGPU Linux kernel driver doesn’t support
<em>“gang submit”</em> yet, so I use scheduled dependencies instead.
This has the drawback of submitting to the two queues sequentially
rather than doing everything in the same submit.</p>

<h2 id="conclusion-perf-considerations">Conclusion, perf considerations</h2>

<p>Let’s turn the above wall of text into some performance considerations that
you can actually use when you write your next mesh shading application.</p>

<ol>
  <li>Because task shaders are executed on a different HW queue,
there is some overhead.
Don’t use task shaders for small draws or other cases when this
overhead may be more than what you gain from them.</li>
  <li>For the same reason, barriers may require the driver to emit
some commands that stall the async compute queue.
Be mindful of your barriers (eg. top of pipe, etc) and only use these
when your task shader actually depends on some previous graphics work.</li>
  <li>Because task payload is written to VRAM by the task shader,
and has to be read from VRAM by the mesh shader, there is some
latency.
Only use as much payload memory as you need.
Try to compact the memory use by packing your data etc.</li>
  <li>When you have a lot of geometry data, it is beneficial to
implement cluster culling in your task shader.
After you’ve done this, it may or may not be worth it
to implement per-triangle culling in your mesh shader.</li>
  <li>Don’t try to reimplement the classic vertex processing
pipeline or emulate fixed-function HW with task+mesh shaders.
Instead, come up with simpler ways that work better for your app.</li>
</ol>

<p><a href="https://developer.nvidia.com/blog/advanced-api-performance-mesh-shaders/">NVidia also has some perf recommendations here</a>
which mostly apply to any other HW, except for the recommended number of vertices and primitives per meshlet because
the sweet spot for that can differ between GPU architectures.</p>

<h3 id="stay-tuned">Stay tuned</h3>

<p><a href="https://github.com/KhronosGroup/Vulkan-Docs/issues/1423#issuecomment-1098534021">It has been officially confirmed</a>
that a Vulkan cross-vendor mesh shading extension is coming soon.
Update: <a href="https://www.khronos.org/blog/mesh-shading-for-vulkan">it’s here!</a></p>

<p><del>While I can’t give you any details about the new extension,</del> I think it won’t be a surprise
to anyone that it <del>may have been</del> was the motivation for my work on mesh and task shaders.</p>

<p>Once the new extension goes public, I will post some thoughts about it and a comparison to
the vendor-specific <code class="language-plaintext highlighter-rouge">NV_mesh_shader</code> extension.</p>

<p><em>Note, 2022-09-03: updated with a link to the new vendor-netural Vulkan mesh shader extension.</em></p>]]></content><author><name>Timur Kristóf</name></author><category term="graphics" /><category term="mesh" /><category term="freedesktop" /><category term="work" /><summary type="html"><![CDATA[Previously, I gave you an introduction to mesh/task shaders and wrote up some details about how mesh shaders are implemented in the driver. But I left out the important details of how task shaders (aka. amplification shaders) work in the driver. In this post, I aim to give you some details about how task shaders work under the hood. Like before, this is based on my experience implementing task shaders in RADV and all details are already public information.]]></summary></entry><entry><title type="html">How mesh shaders are implemented in an AMD driver</title><link href="https://timur.hu/blog/2022/how-mesh-shaders-are-implemented" rel="alternate" type="text/html" title="How mesh shaders are implemented in an AMD driver" /><published>2022-05-12T14:00:56+02:00</published><updated>2022-05-12T14:00:56+02:00</updated><id>https://timur.hu/blog/2022/how-mesh-shaders-are-implemented</id><content type="html" xml:base="https://timur.hu/blog/2022/how-mesh-shaders-are-implemented"><![CDATA[<p>In <a href="/blog/2022/mesh-and-task-shaders">the previous post</a> I gave a brief introduction on what mesh
and task shaders are from the perspective of application developers. Now it’s time to
dive deeper and talk about how mesh shaders are implemented in a Vulkan driver on AMD HW.
Note that everything I discuss here is based on my experience and understanding as
I was working on mesh shader support in RADV and is already available as public information in
open source driver code.
The goal of this blog post is to elaborate on how mesh shaders are implemented
on the NGG hardware in AMD RDNA2 GPUs, and to show how these details
affect shader performance. Hopefully, this helps the reader better understand how the
concepts in the API are translated to the HW and what pitfalls to avoid to get good perf.</p>

<h2 id="short-intro-to-ngg">Short intro to NGG</h2>

<p>NGG (Next Generation Geometry) is the technology that is responsible for
any vertex and geometry processing in RDNA GPUs (with some <em>caveats</em>).
Also known as “primitive shader”, the main innovations of NGG are:</p>

<ul>
  <li>Shaders are aware of not only vertices, but also primitives
(this is why they are called primitive shader).</li>
  <li>The output topology is entirely up to the shader, meaning that it can create
output vertices and primitives with an arbitrary topology regarless of its input.</li>
  <li>On RDNA2 and newer, per-primitive output attributes are also supported.</li>
</ul>

<p>This flexibility allows the driver to implement every vertex/geometry processing stage using NGG.
Vertex, tess eval and geometry shaders can all be compiled to NGG “primitive shaders”.
The only major limiting factor is that each thread (SIMD lane) can only output
up to 1 vertex and 1 primitive (with <em>caveats</em>).</p>

<p>The driver is also capable of extending the application shaders with sweet stuff such as
per-triangle culling, but this is not the main focus of this blog post.
I also won’t cover the <em>caveats</em> here, but I may write more about NGG in the future.</p>

<h2 id="mapping-the-mesh-shader-api-to-ngg">Mapping the mesh shader API to NGG</h2>

<p>The draw commands as executed on the GPU only understand a number of input vertices
but the mesh shader API draw calls specify a number of workgroups instead.
To make it work, we configure the shader such that the number of input
vertices per workgroup is 1, and the output is set to what you passed into the API.
This way, the FW can figure out how many workgroups it really needs to launch.</p>

<p>The driver has to accomodate the HW limitation above, so we must ensure that in the compiled shader,
each thread only outputs up to 1 vertex and 1 primitive.
Reminder: the API programming model allows any shader invocation to write any vertex and/or primitive.
So, there is a fundamental <strong>mismatch</strong> between the programming model and what the HW can do.</p>

<p>This raises a few interesting questions.</p>

<p><strong>How do we allow any thread to write any vertex/primitive?</strong>
The driver allocates some LDS (shared memory) space, and writes all mesh shader outputs there.
At the very end of the shader, each thread reads the attributes of the vertex and primitive that matches the thread ID
and outputs that single vertex and primitive. This roundtrip to the LDS can be omitted if an output is only written
by the thread with matching thread ID.</p>

<p><strong>What if the MS workgroup size is less than the max number of output vertices or primitives?</strong>
Each HW thread can create up to 1 vertex and 1 primitive.
The driver has to set the real workgroup size accordingly:<br />
<code class="language-plaintext highlighter-rouge">hw workgroup size = max(api workgroup size, max vertex count, max primitive count)</code> <br />
The result is that the HW will get a workgroup that has some threads (invocations) that
execute the code you wrote (the “API shader”), and then some that won’t do anything
but wait until the very end to output their up to 1 vertex and 1 primitive.
It can result in poor occupancy (low HW utilization = bad performance).</p>

<p><strong>What if the shader also has barriers in it?</strong>
This is now turning into a headache.
The driver has to ensure that the threads that “do nothing” also execute an equal amount
of barriers as those that run your API shader. If the HW workgroup has the same number of
waves as the API shader, this is trivial. Otherwise, we have to emit some extra code
that keeps the extra waves running in a loop executing barriers. This is the worst.</p>

<p><strong>What if the API shader also uses shared memory, or not all outputs fit the LDS?</strong>
The D3D12 spec requires the driver to have at least 28K shared memory (LDS) available to the shader.
However, graphics shaders can only access up to 32K LDS. How do we make this work, considering the
above fact that the driver has to write mesh shader outputs to LDS?
This is getting really ugly now, but in that case, the driver is forced to write MS outputs to VRAM
instead of LDS.</p>

<p><strong>How do you deal with the compute-like stuff, eg. workgroup ID, subgroup ID, etc.?</strong>
Fortunately, most of these were already available to the shader, just not exposed
in the traditional VS, TES, GS programming model. The only pain point is the workgroup ID
which needs trickery. I already mentioned above that the HW is tricked into thinking that
each MS workgroup has 1 input vertex. So we can just use the same register that contains
the vertex ID for getting the workgroup ID.</p>

<h2 id="conclusion-performance-considerations">Conclusion, performance considerations</h2>

<p>The above implementation details can be turned into performance recommendations.</p>

<p><strong>Specify a MS workgroup size that matches the maximum amount of vertices and primitives.</strong>
Also, distribute the work among the full workgroup rather than leaving some
threads doing nothing. If you do this, you ensure that the hardware is optimally utilized.
<em>This is the most important recommendation here today.</em></p>

<p><strong>Try to only write to the mesh output array indices from the corresponding thread.</strong>
If you do this, you hit an optimal code path in the driver, so it won’t have to
write those outputs to LDS and read them back at the end.</p>

<p><strong>Use shared memory, but not excessively.</strong>
Implementing any nice algorithm in your mesh shader will likely need you to share
data between threads. Don’t be afraid to use shared memory, but prefer to use
subgroup functionality instead when possible.</p>

<h3 id="what-if-you-dont-want-do-any-of-the-above">What if you don’t want do any of the above?</h3>

<p>That is perfectly fine. Don’t use mesh shaders then.</p>

<p>The main takeaway about mesh shading is that it’s a very low level tool.
The driver can implement the full programming model, but it
can’t hold your hands as well as it could for traditional
vertex processing.
You may have to implement things (eg. vertex inputs, culling, etc.)
that previously the driver would do for you. Essentially, if you write a mesh shader you
are trying to beat the driver at its own game.</p>

<h3 id="wait-arent-we-forgetting-something">Wait, aren’t we forgetting something?</h3>

<p>I think this post is already dense enough with technical detail.
Brace yourself for <a href="/blog/2022/how-task-shaders-are-implemented">the next post</a>,
where I’m going to blow your mind even more and talk about
how task shaders are implemented.</p>]]></content><author><name>Timur Kristóf</name></author><category term="graphics" /><category term="mesh" /><category term="freedesktop" /><category term="work" /><summary type="html"><![CDATA[In the previous post I gave a brief introduction on what mesh and task shaders are from the perspective of application developers. Now it’s time to dive deeper and talk about how mesh shaders are implemented in a Vulkan driver on AMD HW. Note that everything I discuss here is based on my experience and understanding as I was working on mesh shader support in RADV and is already available as public information in open source driver code. The goal of this blog post is to elaborate on how mesh shaders are implemented on the NGG hardware in AMD RDNA2 GPUs, and to show how these details affect shader performance. Hopefully, this helps the reader better understand how the concepts in the API are translated to the HW and what pitfalls to avoid to get good perf.]]></summary></entry><entry><title type="html">Welcome to my blog!</title><link href="https://timur.hu/blog/2022/welcome-to-my-blog" rel="alternate" type="text/html" title="Welcome to my blog!" /><published>2022-05-09T14:32:56+02:00</published><updated>2022-05-09T14:32:56+02:00</updated><id>https://timur.hu/blog/2022/welcome-to-my-blog</id><content type="html" xml:base="https://timur.hu/blog/2022/welcome-to-my-blog"><![CDATA[<p>I’ve wanted to do this for a long time, but somehow never got around to it.
The main inspiration for this blog are the blogs of my colleagues <a href="https://www.supergoodcode.com/">Mike</a>,
<a href="https://basnieuwenhuizen.nl/">Bas</a> and <a href="https://mupuf.org/">Martin</a> who are all
writing exciting stuff about their work.</p>]]></content><author><name>Timur Kristóf</name></author><summary type="html"><![CDATA[I’ve wanted to do this for a long time, but somehow never got around to it. The main inspiration for this blog are the blogs of my colleagues Mike, Bas and Martin who are all writing exciting stuff about their work.]]></summary></entry><entry><title type="html">Mesh and task shaders intro and basics</title><link href="https://timur.hu/blog/2022/mesh-and-task-shaders" rel="alternate" type="text/html" title="Mesh and task shaders intro and basics" /><published>2022-05-09T14:32:56+02:00</published><updated>2022-05-09T14:32:56+02:00</updated><id>https://timur.hu/blog/2022/mesh-and-task-shaders</id><content type="html" xml:base="https://timur.hu/blog/2022/mesh-and-task-shaders"><![CDATA[<p>Mesh and task shaders (amplification shaders in D3D jargon) are a new way to process geometry in 3D applications.
First proposed by NVidia in 2018 and <a href="https://developer.nvidia.com/blog/introduction-turing-mesh-shaders/">initially available in the “Turing” series</a>, they are now supported on RDNA2 GPUs and are
part of the D3D12 API. There is also an extension in Vulkan (and a vendor-specific one in OpenGL). This post is about
what mesh shadig is and next time I’m going to talk about how <a href="/blog/2022/how-mesh-shaders-are-implemented">mesh</a>/<a href="/blog/2022/how-task-shaders-are-implemented">task</a> shaders are implemented on the driver side.</p>

<h2 id="problems-with-the-old-geometry-pipeline">Problems with the old geometry pipeline</h2>

<p>The problem with the traditional vertex processing pipeline is that it is mainly designed assuming several
fixed-function hardware units in the GPU and offers very little flexibility for the user to customize it.
The main issues with the traditional pipeline are:</p>

<ul>
  <li>Vertex buffers and vertex shader inputs are annoying (especially from the driver’s perspective)
and input assembly may be a bottleneck on some HW in some cases.</li>
  <li>The user has no control over how the input vertices and primitives are arranged, so the vertex
shader may uselessly run for primitives that are invisible (eg. occluded, or backfacing etc.)
meaning that compute resources are wasted on things that don’t actually produce any pixels.</li>
  <li>Geometry amplification depends on fixed function tessellation HW and offers poor customizability.
(Or GS, which is even worse.)</li>
  <li>The programming model allows very poor control over input and output primitives.
Geometry shaders have a horrible programming model that results in low HW occupancy and limited
topologies.</li>
</ul>

<p>The mesh shading pipeline is a graphics pipeline that addresses these issues by
completely replacing the entire traditional vertex processing pipeline with two new stages: the task and mesh shader.</p>

<h2 id="what-is-a-mesh-shader">What is a mesh shader?</h2>

<p>The mesh shader is a compute-like stage which allows the application to fully customize its inputs and outputs,
including output primitives and their topology.</p>

<ul>
  <li><strong>Mesh shader vs. Vertex shader</strong>:  A mesh shader is responsible for <strong>creating</strong> its
output vertices and primitives. In comparison, a vertex shader is only capable of loading a fixed amount
of vertices, doing some processing on them and has no awareness of primitives.</li>
  <li><strong>Mesh shader vs. Geometry shader</strong>: As opposed to a geometry shader which can only use a fixed “strip” output topology,
a mesh shader is free to define whatever topology it wants. You can think about it
as if the mesh shader produces a small indexed triangle list.</li>
</ul>

<h3 id="what-does-it-mean-that-the-mesh-shader-is-compute-like">What does it mean that the mesh shader is compute-like?</h3>

<p>You can use all the sweet good stuff that compute shaders already can do, but vertex shaders couldn’t, for example:
use shared memory, run in workgroups, rely on workgroup ID, subgroup ID, etc.</p>

<p>The API allows any mesh shader invocation to write to any vertex or primitive.
The invocations in each mesh shader workgroup are meant to co-operatively produce a small set of output vertices and primitives (this is sometimes called a “meshlet”). All workgroups together create the full output geometry (the “mesh”).</p>

<h3 id="what-does-the-mesh-shader-do">What does the mesh shader do?</h3>

<p>First, it needs to figure out how many vertices and primitives it wants to create and
pass those numbers to <code class="language-plaintext highlighter-rouge">SetMeshOutputsEXT</code> (<code class="language-plaintext highlighter-rouge">SetMeshOutputCounts</code> in D3D12),
then it can write to its output arrays. <strong>How</strong> it does this, is entirely up to the application
developer. Though, there are some performance recommendations which you should follow to make it go fast.
I’m going to talk about these more in <a href="/blog/2022/how-mesh-shaders-are-implemented">the next post</a>.</p>

<p>The input assembler step is entirely eliminated, which means that the application is now in full control
of how (or if at all) the input vertex data is fetched, meaning that you can save bandwidth on things that
don’t need to be loaded, etc. There is no “input” in the traditional sense, but you can rely on things
like push constants, UBO, SSBO etc.</p>

<p>For example, a mesh shader could perform per-triangle culling in such a manner that it wouldn’t need to
load data for primitives that are culled, therefore saving bandwidth.</p>

<h2 id="what-is-a-task-shader-aka-amplification-shader">What is a task shader aka. amplification shader?</h2>

<p>The task shader is an optional stage which percedes the mesh shader (and is also compute-like).
Each task shader workgroup has two main purposes:</p>

<ul>
  <li>Decide how many mesh shader workgroups need to be launched.</li>
  <li>Create an optional “task payload” which can be passed to mesh shaders.</li>
</ul>

<p>“Geometry amplification” is achieved by choosing to launch more (or less) mesh shader workgroups.
As opposed to the fixed-function tessellator in the traditional pipeline, it is now entirely up to
the application how to create the vertices.</p>

<p>While you could re-implement the old fixed-function tessellation with mesh shading,
this may not actually be necessary and your application may work better with a simpler custom algorithm.</p>

<p>Another interesting use case for task shaders is per-meshlet culling, meaning that a task shader
is a good place to decide which meshlets you actually want to render and eliminate entire mesh shader
workgroups which would otherwise operate on invisible primitives.</p>

<ul>
  <li><strong>Task shader vs. Tessellation.</strong>
Tessellation relies on fixed-function hardware and makes
users do complicated shader I/O gymnastics with tess control shaders.
The task shader is straightforward and only as complicated as you want it to be.</li>
  <li><strong>Task shader vs. Geometry shader.</strong>
Geometry shaders operate on input primitives directly and replace them with “strip”
primitives.
Task shaders don’t have to be directy get involved with the geometry output, but rather just
let you specify how many mesh shader workgroups to launch and let the mesh shader deal with
the nitty-gritty details.</li>
</ul>

<h2 id="usefulness">Usefulness</h2>

<p>For now, I’ll just discuss a few basic use cases.</p>

<p><strong>Meshlets.</strong>
If your application loads input vertex data from somewhere, it is recommended that you
subdivide that data (the “mesh”) into smaller chunks called “meshlets” (it is recommended that
you do this at asset creation time).
Then you can write your shaders such that
each mesh shader workgroup processes a single meshlet.</p>

<p><strong>Procedural geometry.</strong>
Your application can generate all its vertices and primitives
based on a mathematical formula that is implemented in the shader. In this case, you don’t need to load
any inputs, just implement your formula as if you were writing a compute shader, then store the results
into the mesh shader output arrays.</p>

<p><strong>Replacing compute pre-passes.</strong>
Many modern games use a compute pre-pass. They launch some compute shaders that do some pre-processing on
the geometry before the graphics work. These are no longer necessary. The compute work can be made part of
either the task or mesh shader, which removes the overhead of the additional dispatch and barrier.</p>

<p>Note that mesh shader workgroups may be launched as soon as the corresponding task shader workgroup is
finished, so mesh shader execution (of the already finished tasks) may overlap with task shader execution,
removing the need for extra synchronization on the application side.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Thus far, I’ve sold you on how awesome and flexible mesh shading is, so it’s time to ask
the million dollar question.</p>

<h3 id="is-mesh-shading-for-you">Is mesh shading for you?</h3>

<p>The answer, as always, is: <strong>It depends.</strong></p>

<p>Yes, mesh and task shaders do give you a lot of opportunities to implement things just the way
you like them without the stupid hardware getting in your way,
but as with any low-level tools, <em>this also means that you get a lot of possibities
for shooting yourself in the foot</em>.</p>

<p>The traditional vertex processing pipeline has been around for so long that on most hardware
it’s extremely well optimized because the drivers do a lot of optimization work for you.
Therefore, just because an app uses mesh shaders, doesn’t automatically mean that it’s going to
be faster or better in any way. It’s only worth it if you are willing to do it <strong>well</strong>.</p>

<p>That being said, perhaps the easiest way to start experimenting with mesh shaders is
to rewrite parts of your application that used to use geometry shaders.
Geometry shaders are so horribly inefficient that it’ll be difficult to write a worse mesh shader.</p>

<h3 id="how-is-mesh-shading-implemented-under-the-hood">How is mesh shading implemented under the hood?</h3>

<p>Stay tuned for the next posts if you are curious about that!
I’m going to talk about how <a href="/blog/2022/how-mesh-shaders-are-implemented">mesh</a> and <a href="/blog/2022/how-task-shaders-are-implemented">task</a> shaders are implemented in a driver.
This will shed some light on how these shaders work internally and why certain things perform really badly.</p>

<h3 id="references">References</h3>

<ul>
  <li><a href="https://developer.nvidia.com/blog/introduction-turing-mesh-shaders/">Introduction to Turing Mesh Shaders</a></li>
  <li><a href="https://developer.nvidia.com/blog/using-mesh-shaders-for-professional-graphics/">Using Mesh Shaders for Professional Graphics</a></li>
  <li><a href="https://developer.nvidia.com/blog/advanced-api-performance-mesh-shaders/">Advanced API Performance: Mesh Shaders</a></li>
  <li><a href="https://microsoft.github.io/DirectX-Specs/d3d/MeshShader.html">DirectX 12 mesh shader API</a></li>
  <li><a href="https://www.khronos.org/blog/mesh-shading-for-vulkan">Khronos blog: Mesh shading for Vulkan</a></li>
</ul>

<p><em>Note, 2022-09-03: updated with a link to the new vendor-netural Vulkan mesh shader extension.</em></p>]]></content><author><name>Timur Kristóf</name></author><category term="graphics" /><category term="mesh" /><category term="freedesktop" /><category term="work" /><summary type="html"><![CDATA[Mesh and task shaders (amplification shaders in D3D jargon) are a new way to process geometry in 3D applications. First proposed by NVidia in 2018 and initially available in the “Turing” series, they are now supported on RDNA2 GPUs and are part of the D3D12 API. There is also an extension in Vulkan (and a vendor-specific one in OpenGL). This post is about what mesh shadig is and next time I’m going to talk about how mesh/task shaders are implemented on the driver side.]]></summary></entry></feed>