NV40 Technology explained
Part 2: Get more out of the Hardware
September 20, 2004 / by aths / Page 4 of 7
Higher utilization
In the last part we learned how Nvidia managed to significantly increase the pixel shader speed of a single pipeline without using more transistors, compared to NV35: The pipeline comes with less calculation units, but any single pipeline can use much more of these units in a single clock. This requires very sophisticated crossbar logic. The realization of the NV40 pipeline is an unheard of engineering feat. Of course, this is not coming out of nowhere. Such hardware was only possible with practical experience.
Honestly, what is the NV40-pipeline really? It looks like the NV30 shader core was cut in the middle. The TEX unit is attached to the first part (alias the first shader unit) while the other part (alias the second shader unit) has an additional MUL now. While NV30 has two TEX units per pipe, NV40 comes with a single dedicated texture unit.
As NV30 supports faster RSQ_PP (useable for a partial precision NRM) now with NV40 a complete NRM_PP is accelerated. Normalization is often used to generate good-looking bumpmapping effects. The greatest advantage over NV30's pipeline is, of course, the implementation of co- and dual issue. The additional FPU we saw in NV35 is no longer needed to gain the power we want.
More details about NV40‘s capabilities
In PS_3_0, there is still no "DIV" instruction, a division has to be made "by hand" using RCP (reciprocal) and MUL (since a/x = 1/x * a = RCP(x) MUL a). The NV40-pipeline can perform a single-clock DIV anyway: "Unit 1" can perform both the RCP and the MUL in a single clock cycle. The optimizer has to recognize the possible meaning of "DIV" when encountering an RCP and a MUL, to pack it into a single internal slot.
The same instruction slot can be loaded even more, because Unit 2 offers a MAD or dot product operation. A full vector4 DP (DP4) instruction can be moved into the same slot. Or another scalar MUL and an independent scalar ADD, since Unit 2 offers co-issue, too. It's obvious the optimizer's task is not an easy one. Of course, even NV40 has its limits.
Full precision normalization (with DP, RSQ, MUL) consumes some clocks since RSQ alone takes two cycles. Nevertheless, while DP is performed, Unit 1 is free to do a multiply or start a texture operation, plus calculate an RCP in the same cycle.
Admittedly in some cases simple data flow will block at least parts of a shading unit, which should be considered in the optimization process as well.
The compiler
To take advantage of both co- and dual-issue, one needs to use a good shader compiler to reorder instructions as to load the pipeline as much as possible. A part of this is DirectX's shader compiler. In contrast to DirectX8 pixel shaders, co-issue is not directly supported in shader code output for Pixelshader 2.0 and above. No assembly pixel shader is actually executed verbatim, but first translated into native chip instructions. During this translation into hardware instructions (which is done by the driver), another optimization stage is required to take advantage of such capabilities.
While the NV40 has good crossbar logic to arrange the data flow, the calculation units are still integrated in a specific order. This is hardwired. Building a pipeline able to rearrange its units would lead to a bunch of new problems: Higher latencies, higher transistor count, in the end it's also harder to optimize for.
NV40 is still a GPU without real control logic able to optimize the instruction order. Internally driven by very large instruction words (VLIW) the effort to gain the best performance is as usual a task of the software, saving many transistors in the GPU. Maybe there is an additional limit on how many instructions can be executed per clock, because the data paths are too narrow or the crossbar may not be flexible enough.
In addition, as far as we know, any cycle is limited to deliver up to two temporary registers or constants and a single interpolated value. Constants find their way into the pipeline directly with the instruction flow. The pipeline can use some internal temporary registers to allow an efficient caching of data. Making good use of this is also a task for the driver's shader optimizer.
More details about the optimization
Reordering of shader instructions without changing the mathematical expression must be done very quickly. We don‘t have detailed information about what the optimizer really is capable of, so we offer some guesses instead.
We believe the optimizer to split a shader into parts, recognizing dependency chains and reordering the instructions in different optimizer algorithm passes. Some optimization steps already may be active during the translation process from pixel shader code to internal instructions, perhaps the original pixel shader code goes through optimizations even before the translation starts.
Co- and dual issues can only be used efficiently in a "co- and dual-issue friendly" order. To ensure such a usable execution order, the driver has to check what instructions can be executed within a single clock cycle.
To fully utilize the pipeline, the optimizer also has to do register-reallocations. Furthermore, the optimizer should recognize unused parts of the result register (for example the A component), and check if such unused register components can be used to execute another operation with the same clock cycle.
We don‘t know for sure if an NV40 shader unit can execute two MUL2-instructions using the same channels each (lets say R and G) at the same time, but we think it can pull this off. Some instructions can be expressed with fewer instructions, but in some cases, this only works within a given range of allowed values. The optimizer has to check for it. For example, many texture formats limit the possible range of values. Lets take a MUL_SAT followed by ADD_SAT; this can sometimes be executed as a single MAD_SAT instruction. If so, the optimizer should recognize it.
NV40 (as well as NV30) supports arbitrary swizzles. This means that component values (R, G, B and A channels) of a register can be freely reordered.