Rabbit holes in shader linking

In the previous post I gave an introduction to shader linking. Mike has already blogged about this topic a while ago, focusing mostly on Zink, and now it’s time for me to share some of my adventures about it too, but of course focusing on how we improved it in RADV and the various rabbit holes that this work has lead me to.

Motivation for improving shader linking in RADV

In Mesa, we mainly represent shaders in NIR (the NIR intermediate representation) and that is where link-time optimizations happen.

The big news is that Marek Olšák wrote a new pass called nir_opt_varyings which is an all-in-one solution to all the optimizations above, and now authors of various drivers are rushing to take advantage of this new code. We can’t miss the opportunity to start using nir_opt_varyings in RADV too, so that’s what I’ve been working on for the past several weeks.

It is intended to replace all of the previous linking passes and can do all of the following:

Remove unused outputs and undefined inputs (including “system values”)
Code motion between stages
Compaction of I/O space

So, I started by adding a call to nir_opt_varyings and went from there.

Enabling the new linking optimization in RADV

The naive reader might think using the new pass is as simple as going to radv_link_shaders and calling nir_opt_varyings there. But it can never be that easy, can it?

The issue is that one can’t simply deal with shader linking. We also need to get our hands dirty with all the details of how shader I/O works on a very low level.

I/O variable dereferece vs. lowered explicit I/O

The first problem is that RADV’s current linking solution radv_link_shaders works with the shaders when their I/O variables are still intact, meaning that they are still treated as dereferenced variables. However nir_opt_varyings expects to work with explicit I/O.

In fact, all of RADV’s linking code all works based on I/O variables and dereferences, so much so that it’d be too much to refactor all of that (and such refactoring probably would have its own set of problems and rabbit holes). So the solution here is to add a new linking step that runs after nir_lower_io and call the new pass in there.

Here is the MR that enables nir_opt_varyings in RADV

After writing the above, I quickly discovered that some tests crash, others fail, and most applications render incorrectly. So I’ve set out on a journey to solve all that.

The rabbit holes

Shader I/O information collection

Like every driver, RADV needs to collect certain information about every shader in order to program the GPU’s registers correctly before a draw. This information includes the number of inputs / outputs (and which slots are used), in order to determine how much LDS needs to be allocated (in case of tessellation shaders) or how much FS inputs are needed, etc.

This is done by radv_nir_shader_info_pass which also operated on I/O variables, rather than information from I/O instructions. However, after nir_lower_io the explicit I/O instructions may no longer be in sync with the original I/O variables. This wasn’t a problem before because we haven’t done any optimizations on explicit I/O so we could rely on I/O variable information being accurate.

However, in order to use an optimization based on explicit I/O, the RADV shader info pass had to be refactored to collect its information from explicit I/O intrinsics, otherwise we wouldn’t be able to have up-to-date information after running nir_opt_varyings, resulting in wrong register programming.

Dealing with driver locations

Driver location assignment is how the driver decides which input and output goes to which “slot” or “address”. RADV did this after linking, but still based on I/O variables, so the mechanism needed to be re-thought.

It also came to my attention that there are some plans to deprecate the concept of driver locations in NIR in favour of so-called I/O semantics. So I had to do my refactor with this in mind; I spent some effort on removing our uses of driver locations in order to make the new code somewhat future-proof.

For most stage combinations, nir_recompute_io_bases can be used as a stopgap to simply reassign the driver locations based on the assumption that a shader will only write the outputs that the next stage reads. However, this is somewhat difficult to achieve for tessellation shaders because of their unique “brain puzze” (TCS can read its own outputs, so the compiler can’t simply remove TCS outputs when TES doesn’t read them).

Removed driver locations for trivial cases
Refactored tessellation I/O to be independent of driver locations (based on all the other work listed below on tessellation shader I/O)

Tessellation control shader epilogs

Due to the unique brain puzzle that is I/O in tessellation shaders, they require extra brain power… shader linking between TCS and TES was implemented all over the place; even our backend compiler ACO had some dependence on TCS linking information, which made any kind of refactor difficult. At the time, VK_EXT_shader_object was new, and our implementation used so-called shader epilogs to deal with the dynamic states of the TCS (including in OpenGL for RadeonSI), which is what ACO needed the linking information for.

After a discussion with the team, we decided that TCS epilogs had to go; not only because of my shader linking effort, but also to make the code base saner and more maintainable.

This effort made our code lighter by about 1200 LOC.

Tessellation shader I/O

On AMD hardware, TCS outputs are implemented using LDS (when the TCS reads them) and VRAM (when the TES reads them), which means that the driver has two different ways to store these variables depending on their use. However, since the code was based on driver_location and there can only be one driver location, we used effectively the same location for both LDS and VRAM, which was suboptimal.

With the TCS epilog out of the way, now ac_nir_lower_tess_io_to_mem is free to choose the LDS layout because the drivers no longer need to generate a TCS epilog that would need to make assumptions about their memory layout.

Changed the LDS location to be independent of the VRAM location, making it more efficient
Also removed an extra dword of unused LDS when VS outputs don’t need it
Refactored the SGPR arg that contains the dynamic state
With all of that, I found an inefficiency in an ACO optimization, so that had to be fixed too
Fix of an unintended regression
In the meantime, Samuel has found an opportunity to share some code between RADV and RadeonSI
After all the above was merged and tessellation I/O was independent from driver locations, I also made another MR to undo a small stats regression by using a more optimal VRAM layout as well.
Fixing another regression

Packed 16-bit I/O

One of the innovations of nir_opt_varyings is that it can pack two 16-bit inputs and outputs together into a single 32-bit slot in order to save I/O space. However, unfortunately RADV didn’t really work at all with packed 16-bit I/O. Practically, this meant that every test case using 16-bit I/O failed.

I considered to disable 16-bit packing in nir_opt_varyings but eventually I decided to just implement it in RADV properly instead.

Repetitiveness in AMD common code

While writing patches to handle packed 16-bit I/O, we’ve taken note of how repetitive the code was; basically the same thing was implemented several times with subtle differences. Of course, this had to be dealt with.

Refactored pre-rasterization output info to avoid repetition

Mesh shading per-primitive I/O

Mesh shading pipelines have so-called per-primitive I/O, which need special handling. For example, it is wrong to pack per-primitive and per-vertex inputs or outputs into the same slot. Because OpenGL doesn’t have per-primitive I/O, this was left unsolved and needed to be fixed in nir_opt_varyings before RADV could use it.

Updating load/store alignments

Using nir_opt_varyings had a slight regression in shader instruction counts due to inter-stage code motion. Essentially, it moved some instructions into the previous stage, which prevented nir_opt_load_store_vectorize to deduce the alignment of some memory instructions, resulting in worse vectorization.

The solution was to add a new pass based on the code already written for nir_opt_load_store_vectorize that would just update the alignments of each memory access called nir_opt_load_store_update_alignments, and run that pass before nir_opt_varyings, thereby preserving the aligment info before it is lost.

Added nir_opt_load_store_update_alignments

Promoting FS inputs to `FLAT`

FLAT fragment shader inputs are generally better because they require no interpolation and allow packing 32-bit and 16-bit inputs together, so nir_opt_varyings takes special care trying to promote interpolated inputs to FLAT when possible.

However, there was a regression caused by this mechanism unintentionally creating more inputs than there were before.

Here is the fix so that it only promotes to FLAT when appropriate

Dealing with `mediump` I/O

The mediump qualifier effectively means that the application allows the driver to use either 16-bit or 32-bit precision for a variable, whichever it deems more optimal for a specific shader.

It turned out that RADV didn’t deal with mediump I/O at all. This was fine, because it just meant they got treated as normal 32-bit I/O, but it became a problem when it turned out nir_opt_varyings is unaware of mediump and mixed it up with other inputs, which confused the vectorizer.

This MR deals with mediump by lowering it to normal 32-bit I/O

Side note: I am not quite done yet with mediump. In the future I plan to lower it to 16-bit precision, but only when I can make sure it doesn’t result in more inputs.

Explicitly interpolated (per-vertex) FS inputs

Vulkan has some functionality that allow applications to do custom FS input interpolation in shader code, which from the driver perspective means that each FS invocation needs to access the output of each vertex. (This requires special register programming from RADV on the FS inputs.) This is something that also needed to be added to nir_opt_varyings before RADV could use it.

Marek wrote some patches to deal with explicit FS inputs

Scalarization and re-vectorization of I/O components

The nir_opt_varyings pass only works on so-called scalarized I/O, meaning that it can only deal with instructions that write a single output component or read a single input component. Fortunately, there is already a handy nir_lower_io_to_scalar pass which we can use.

The downside is that scalarized shader I/O becomes sub-optimal (on all stages on AMD HW except VS -> PS) because the I/O instructions are really memory accesses, which are simply more optimal when more components are accessed by the same instruction.

This is solved in two ways:

Rhys has enhanced the excellent nir_opt_load_store_vectorize pass to better deal with lowered shader I/O, meaning that it can now better vectorize the memory access instructions that are generated by scalarized I/O.
Marek has developed a new nir_opt_vectorize_io pass which can re-vectorize the I/O intrinsics (before they are lowered to memory access).

Shader I/O stats

One of the main questions with any kind of optimization is how to measure the effects of that optimization in an objective way. This is a solved problem, we have shader stats for this which contain instructions about various aspects of a shader, such as number of instructions, register use etc. Except, there was no stats about I/O, so this needed to be added.

These stats are useful to prove that all of this work actually improved things. Furthermore, they turned out useful for finding bugs in existing code as well.

Removing the old linking passes

Considering that nir_opt_varyings is supposed to be an all-in-one linking solution, a naive person like me would assume that once the driver had been refactored to use nir_opt_varyings we can simply stop using all of the old linking passes. But… It turns out that getting rid of any of the other passes seem to cause regressions in shaders stats such as instruction count (which we don’t want).

Why is this?

Due to the order in which we call various NIR optimizations, it seems that we can’t effectively take advantake of new optimization opportunities after nir_opt_varyings. This means that we either have to re-run all expensive optimizations once more after the new linking step, or we will have to reorder our optimizations in order to be able to remove the old linking passes.

Conclusion; future work

While we haven’t yet found the exit from all the rabbit holes we fell into, we made really good progress and I feel that all of our I/O code ended up better after this effort. Some work on RADV shader I/O (as of June 2024) remains:

Completely remove the old linking passes
Stop using driver locations (in the few places where they are still used)
Lower mediump to 16-bit when beneficial

Thanks

I owe a big thank you to Marek for developing nir_opt_varyings in the first place and helping me adopt it every step of the way. Also thanks to Samuel, Rhys and Georg for the good conversations we had and for reviewing my patches.

The blog doesn't have comments, but feel free to reach out to me on IRC (Venemo on OFTC) or Discord (sunrise_sky) to discuss.

Timur's blog

Rabbit holes in shader linking

Motivation for improving shader linking in RADV

Enabling the new linking optimization in RADV

I/O variable dereferece vs. lowered explicit I/O

The rabbit holes

Shader I/O information collection

Dealing with driver locations

Tessellation control shader epilogs

Tessellation shader I/O

Packed 16-bit I/O

Repetitiveness in AMD common code

Mesh shading per-primitive I/O

Updating load/store alignments

Promoting FS inputs to `FLAT`

Dealing with `mediump` I/O

Explicitly interpolated (per-vertex) FS inputs

Scalarization and re-vectorization of I/O components

Shader I/O stats

Removing the old linking passes

Conclusion; future work

Thanks

Comments

Motivation for improving shader linking in RADV

Enabling the new linking optimization in RADV

I/O variable dereferece vs. lowered explicit I/O

The rabbit holes

Shader I/O information collection

Dealing with driver locations

Tessellation control shader epilogs

Tessellation shader I/O

Packed 16-bit I/O

Repetitiveness in AMD common code

Mesh shading per-primitive I/O

Updating load/store alignments

Promoting FS inputs to FLAT

Dealing with mediump I/O

Explicitly interpolated (per-vertex) FS inputs

Scalarization and re-vectorization of I/O components

Shader I/O stats

Removing the old linking passes

Conclusion; future work

Thanks

Comments

Promoting FS inputs to `FLAT`

Dealing with `mediump` I/O