Rabbit holes in shader linking
In the previous post I gave an introduction to shader linking. Mike has already blogged about this topic a while ago, focusing mostly on Zink, and now it’s time for me to share some of my adventures about it too, but of course focusing on how we improved it in RADV and the various rabbit holes that this work has lead me to.
Motivation for improving shader linking in RADV
In Mesa, we mainly represent shaders in NIR (the NIR intermediate representation) and that is where link-time optimizations happen.
The big news is that Marek Olšák wrote a new pass called
nir_opt_varyings
which is an all-in-one solution to all the optimizations above, and
now authors of various drivers are rushing to take advantage of this new code.
We can’t miss the opportunity to start using nir_opt_varyings
in RADV too,
so that’s what I’ve been working on for the past several weeks.
It is intended to replace all of the previous linking passes and can do all of the following:
- Remove unused outputs and undefined inputs (including “system values”)
- Code motion between stages
- Compaction of I/O space
So, I started by adding a call to nir_opt_varyings
and went from there.
Enabling the new linking optimization in RADV
The naive reader might think using the new pass is as simple as
going to radv_link_shaders
and calling nir_opt_varyings
there.
But it can never be that easy, can it?
The issue is that one can’t simply deal with shader linking. We also need to get our hands dirty with all the details of how shader I/O works on a very low level.
I/O variable dereferece vs. lowered explicit I/O
The first problem is that RADV’s current linking solution radv_link_shaders
works with the shaders when their I/O variables are still intact,
meaning that they are still treated as dereferenced variables.
However nir_opt_varyings
expects to work with explicit I/O.
In fact, all of RADV’s linking code all works based on I/O variables and dereferences,
so much so that it’d be too much to refactor all of that
(and such refactoring probably would have its own set of problems and rabbit holes).
So the solution here is to add a new linking step that runs after nir_lower_io
and
call the new pass in there.
After writing the above, I quickly discovered that some tests crash, others fail, and most applications render incorrectly. So I’ve set out on a journey to solve all that.
The rabbit holes
Shader I/O information collection
Like every driver, RADV needs to collect certain information about every shader in order to program the GPU’s registers correctly before a draw. This information includes the number of inputs / outputs (and which slots are used), in order to determine how much LDS needs to be allocated (in case of tessellation shaders) or how much FS inputs are needed, etc.
This is done by radv_nir_shader_info_pass
which also operated on I/O variables,
rather than information from I/O instructions.
However, after nir_lower_io
the explicit I/O instructions may no longer be in
sync with the original I/O variables. This wasn’t a problem before because we
haven’t done any optimizations on explicit I/O so we could rely on I/O variable
information being accurate.
However, in order to use an optimization based on explicit I/O,
the RADV shader info pass had to be refactored
to collect its information from explicit I/O intrinsics, otherwise we wouldn’t be
able to have up-to-date information after running nir_opt_varyings
, resulting
in wrong register programming.
- Part 1 of refactoring RADV to use IO semantics, covering everything but FS
- Part 2 of refactoring RADV to use IO semantics, covering FS and deleting IO variables afterwards
Dealing with driver locations
Driver location assignment is how the driver decides which input and output goes to which “slot” or “address”. RADV did this after linking, but still based on I/O variables, so the mechanism needed to be re-thought.
It also came to my attention that there are some plans to deprecate the concept of driver locations in NIR in favour of so-called I/O semantics. So I had to do my refactor with this in mind; I spent some effort on removing our uses of driver locations in order to make the new code somewhat future-proof.
For most stage combinations, nir_recompute_io_bases
can be used as a stopgap to
simply reassign the driver locations based on the assumption that a shader will
only write the outputs that the next stage reads.
However, this is somewhat difficult to achieve for tessellation shaders because of their
unique “brain puzze” (TCS can read its own outputs, so the compiler can’t simply
remove TCS outputs when TES doesn’t read them).
- Removed driver locations for trivial cases
- Refactored tessellation I/O to be independent of driver locations (based on all the other work listed below on tessellation shader I/O)
Tessellation control shader epilogs
Due to the unique brain puzzle that is I/O in tessellation shaders,
they require extra brain power…
shader linking between TCS and TES was implemented
all over the place; even our backend compiler ACO had some dependence on TCS linking information,
which made any kind of refactor difficult.
At the time, VK_EXT_shader_object
was new, and our implementation
used so-called shader epilogs to deal with the dynamic states of the TCS
(including in OpenGL for RadeonSI),
which is what ACO needed the linking information for.
After a discussion with the team, we decided that TCS epilogs had to go; not only because of my shader linking effort, but also to make the code base saner and more maintainable.
- I changed the code to pass tess factors in registers
- Then I entirely deleted TCS epilogs from RADV
- Finally I also deleted them from RadeonSI, allowing it to be ultimately removed from ACO as well
This effort made our code lighter by about 1200 LOC.
Tessellation shader I/O
On AMD hardware, TCS outputs are implemented using LDS (when the TCS reads them)
and VRAM (when the TES reads them), which means that the driver has two different
ways to store these variables depending on their use. However, since the code
was based on driver_location
and there can only be one driver location,
we used effectively the same location for both LDS and VRAM, which was suboptimal.
With the TCS epilog out of the way, now ac_nir_lower_tess_io_to_mem
is free to
choose the LDS layout because the drivers no longer need to generate a TCS epilog
that would need to make assumptions about their memory layout.
- Changed the LDS location to be independent of the VRAM location, making it more efficient
- Also removed an extra dword of unused LDS when VS outputs don’t need it
- Refactored the SGPR arg that contains the dynamic state
- With all of that, I found an inefficiency in an ACO optimization, so that had to be fixed too
- Fix of an unintended regression
- In the meantime, Samuel has found an opportunity to share some code between RADV and RadeonSI
- After all the above was merged and tessellation I/O was independent from driver locations, I also made another MR to undo a small stats regression by using a more optimal VRAM layout as well.
- Fixing another regression
Packed 16-bit I/O
One of the innovations of nir_opt_varyings
is that it can pack two 16-bit inputs and
outputs together into a single 32-bit slot in order to save I/O space.
However, unfortunately RADV didn’t really work at all with packed 16-bit I/O.
Practically, this meant that every test case using 16-bit I/O failed.
I considered to disable 16-bit packing in nir_opt_varyings
but eventually
I decided to just implement it in RADV properly instead.
- Packed 16-bit mesh shader outputs
- Packed 16-bit FS inputs
- Packed 16-bit pre-rasterization outputs
- Packed 16-bit tessellation and GS I/O
Repetitiveness in AMD common code
While writing patches to handle packed 16-bit I/O, we’ve taken note of how repetitive the code was; basically the same thing was implemented several times with subtle differences. Of course, this had to be dealt with.
Mesh shading per-primitive I/O
Mesh shading pipelines have so-called per-primitive I/O, which need special handling.
For example, it is wrong to pack per-primitive and per-vertex inputs or outputs
into the same slot. Because OpenGL doesn’t have per-primitive I/O, this was
left unsolved and needed to be fixed in nir_opt_varyings
before RADV could use it.
nir_recompute_io_bases
needed to learn about per-primitive I/Onir_opt_varyings
itself needed to learn per-primitive I/O too
Updating load/store alignments
Using nir_opt_varyings
had a slight regression in shader instruction counts
due to inter-stage code motion.
Essentially, it moved some instructions into the previous stage, which prevented
nir_opt_load_store_vectorize
to deduce the alignment of some memory instructions,
resulting in worse vectorization.
The solution was to add a new pass based on the code already written for
nir_opt_load_store_vectorize
that would just update the alignments of each
memory access called nir_opt_load_store_update_alignments
, and run that pass before
nir_opt_varyings
, thereby preserving the aligment info before it is lost.
Promoting FS inputs to FLAT
FLAT
fragment shader inputs are generally better because they require no interpolation
and allow packing 32-bit and 16-bit inputs together, so
nir_opt_varyings
takes special care trying to promote interpolated inputs to
FLAT
when possible.
However, there was a regression caused by this mechanism unintentionally creating more inputs than there were before.
Dealing with mediump
I/O
The mediump
qualifier effectively means that the application allows the driver to
use either 16-bit or 32-bit precision for a variable, whichever it
deems more optimal for a specific shader.
It turned out that RADV didn’t deal with mediump
I/O at all.
This was fine, because it just meant they got treated as normal 32-bit I/O,
but it became a problem when it turned out nir_opt_varyings
is unaware of
mediump
and mixed it up with other inputs, which confused the vectorizer.
Side note: I am not quite done yet with mediump
.
In the future I plan to lower it to 16-bit precision,
but only when I can make sure it doesn’t result in more inputs.
Explicitly interpolated (per-vertex) FS inputs
Vulkan has some functionality that allow applications to do custom FS input interpolation
in shader code, which from the driver perspective means that each FS invocation needs to
access the output of each vertex. (This requires special register programming from RADV
on the FS inputs.) This is something that also needed to be added to nir_opt_varyings
before RADV could use it.
Scalarization and re-vectorization of I/O components
The nir_opt_varyings
pass only works on so-called scalarized I/O, meaning that it can
only deal with instructions that write a single output component or read a single
input component. Fortunately, there is already a handy nir_lower_io_to_scalar
pass which we can use.
The downside is that scalarized shader I/O becomes sub-optimal (on all stages on AMD HW except VS -> PS) because the I/O instructions are really memory accesses, which are simply more optimal when more components are accessed by the same instruction.
This is solved in two ways:
- Rhys has enhanced the excellent
nir_opt_load_store_vectorize
pass to better deal with lowered shader I/O, meaning that it can now better vectorize the memory access instructions that are generated by scalarized I/O. - Marek has developed a new
nir_opt_vectorize_io
pass which can re-vectorize the I/O intrinsics (before they are lowered to memory access).
Shader I/O stats
One of the main questions with any kind of optimization is how to measure the effects of that optimization in an objective way. This is a solved problem, we have shader stats for this which contain instructions about various aspects of a shader, such as number of instructions, register use etc. Except, there was no stats about I/O, so this needed to be added.
These stats are useful to prove that all of this work actually improved things. Furthermore, they turned out useful for finding bugs in existing code as well.
Removing the old linking passes
Considering that nir_opt_varyings
is supposed to be an all-in-one linking solution,
a naive person like me would assume that once the driver had been refactored to use
nir_opt_varyings
we can simply stop using all of the old linking passes. But…
It turns out that getting rid of any of the other passes seem to cause regressions in shaders stats
such as instruction count (which we don’t want).
Why is this?
Due to the order in which we call various NIR optimizations, it seems that we can’t effectively
take advantake of new optimization opportunities after nir_opt_varyings
. This means that
we either have to re-run all expensive optimizations once more after the new linking step,
or we will have to reorder our optimizations in order to be able to remove the old linking
passes.
Conclusion; future work
While we haven’t yet found the exit from all the rabbit holes we fell into, we made really good progress and I feel that all of our I/O code ended up better after this effort. Some work on RADV shader I/O (as of June 2024) remains:
- Completely remove the old linking passes
- Stop using driver locations (in the few places where they are still used)
- Lower
mediump
to 16-bit when beneficial
Thanks
I owe a big thank you to Marek for developing nir_opt_varyings
in the first place
and helping me adopt it every step of the way.
Also thanks to Samuel, Rhys and Georg for the good conversations we had and for reviewing my patches.
Comments
The blog doesn't have comments, but
feel free to reach out to me on IRC (Venemo
on OFTC) or Discord (sunrise_sky
) to discuss.