What is shader linking?

Shader linking is one of the more complicated topics in graphics driver development. It is both a never ending effort in the pursuit of performance and a black hole in which driver developers disappear. In this post, I intend to give an introduction to what shader linking is and why it’s worth spending our time working on it in general.

Let’s start with the basics.

Shaders are smallish programs that run on your GPU and are necessary for any graphical application in order to draw things on your screen using your GPU. A typical game can have thousands, or even hundreds of thousands of shaders. Because every GPU has its own instruction set with major differences, it is generally the reponsibility of the graphics driver to compile shaders in a way that is most optimal on your GPU in order to make games run fast.

One of the ways of making them faster is linking. Many times, the driver knows exactly which shaders are going to be used together and that gives it the opportunity to perform optimizations based on the assumption that two shaders are only ever used together (and never with other shaders).

In Vulkan, there are now three ways for an application to create graphics shaders and all of these have a possibility for utilizing linking:

Graphics pipelines, which contain a set of shaders and a bunch of graphics state baked together. We can always link shaders of a graphics pipeline, because it is impossible to change the shaders in a pipeline after it was created.
Graphics pipeline libraries, which essentially split a pipeline into some parts. The API allows to create full pipelines by linking pipeline libraries together.
Shader objects, which is a newer extension that deals with shaders individually, however still allows the application to ask the driver link them together.

In Mesa, we mainly represent shaders in NIR (the NIR intermediate representation) and that is where link-time optimizations happen.

Why is shader linking beneficial to performance?

Shader linking allows the compiler stack to make assumptions about a shader by looking at another shader that it is used together with. Let’s take a look at what optimizations are possible.

Deletion of I/O operations

The compiler may look at the outputs of a shader and the inputs of the next stage, and delete unnecessary I/O. For example, when you have a pipeline with VS (vertex shader) and FS (fragment shader):

The compiler can safely remove outputs from the VS that are not read by the FS. As a result, any shader code which was used to calculate the value of those outputs will now become dead code and may be removed as well.
The compiler can replace FS inputs that aren’t written by the VS with an “undefined” value (or zero) which algebraic optimizations can take advantage of.
The compiler may detect that some VS outputs are duplicates and may remove the superfluous outputs from the VS, and also merge their corresponding inputs in the FS.
When a VS output is a constant value, the compiler can delete the VS output and replace the corresponding FS input load instructions with the constant value. This enables further algebraic optimizations in the FS.

As a result, both the VS and FS will have fewer IO instructions and more optimal algebraic instructions. The same ideas are basically applicable to any two shader stages.

This first group of optimizations are the easiest to implement and has been supported by NIR for a long time: nir_remove_dead_variables, nir_remove_unused_varyings and nir_link_opt_varyings have existed for a long time.

Compaction of I/O space

Shader linking also lets the compiler to “compact” the output space, by reordering I/O variables in both shaders so that they use the least amount of space possible.

For example, it may be possible that they have “gaps” between the I/O slots that they use, and the compiler can then be smart and rearrange the I/O variables so that there are as few gaps as possible.

As a result, less I/O space will be used. The exact benefit of this optimization depends highly on the hardware architecture and which stages are involved. But generally speaking, using less space can mean less memory use (which can translate into better occupancy or higher throughput), or simply that the next stage is launched faster, or will use fewer registers, or less fixed-function HW resources needed, etc.

NIR has also supported I/O compaction in nir_compact_varyings, however its implementation was far from perfect, the main challenges were handling indirect indexing and it lacked packing 16-bit varyings into 32-bits.

Code motion between shader stages

Also known as inter-stage code motion, this is a complex optimization that has two main goals:

Create more optimization opportunities for doing more of the aforementioned optimizations. For example, when you have two outputs such that the value of one output can be trivially calculated from the other, it may be beneficial to just calculate that value in the next stage, which then enables us to remove the extra output.
Move code to earlier stages, with the assumption that earlier stages have fewer invocations than later ones, which means the same instructions will need to be executed fewer times, making the pipeline overall faster. Most of the time, we can safely assume that there are fewer VS invocations than FS, so it’s overall a good thing to move instructions from FS to VS. The same is unquestionably beneficial for geometry amplification, such as moving code from TES to TCS.

This concept is all-new in Mesa and hasn’t existed until Marek wrote nir_opt_varyings recently.

Why is all of that necessary?

At this point you might ask yourself the question, why is all of this necessary? In other words, why do shaders actually need these optimizations? Why don’t app developers write shaders that are already optimal?

The answer might surprise you.

Many times, the same shader is reused between different pipelines, in which case the application developer needs to write them in a way in which they are interchangeable. This is simply a good practice from the perspective of the application developer, reducing the number of shaders they need to maintain.

Sometimes, applications effectively generate different shaders from the same source using ifdefs, specialization constants etc.

Even though the same source shader was written to usable with multiple other shaders; in each pipeline the driver can deal with it as if it were a different shader and in each pipeline the shader will be linked to the other shaders in that specific pipeline.

What’s new in Mesa shader linking?

The big news is that Marek Olšák wrote a new pass called nir_opt_varyings which is an all-in-one solution to all the optimizations above, and now authors of various drivers are rushing to take advantage of this new code. I’ll share my experience of using that in RADV in the next blog post.

The blog doesn't have comments, but feel free to reach out to me on IRC (Venemo on OFTC) or Discord (sunrise_sky) to discuss.

Timur's blog