NV40 Technology explained
Part 1: Secrets of the Pixelshading Unit
August 19, 2004 / by aths / Page 2 of 7
Counting shader units
We don't regard Radeon's Mini-FPUs as "real" shading units, because they are not able to perform any instruction. But, inconsequently, we are counting two shader units for NV40's pipes, while both units are limited also. At first glance, we have kind of an all-purpose shader core, and an additional FPU we already know from NV35. Let's have a closer look.
Overview of NV40‘s pipeline.
Here, the loopback around the block of all this units is not shown. Unit 1 apparently consists of two units, a math and a texture unit.
Math Unit 1 cannot provide the simple ADD operation. ADD is handled by Unit 2, as well as an additional MUL (two MULs and an ADD per pipeline are provided since the first Riva TNT). Regarding the special functions, Unit 2 performs some of them, because Unit 1 cannot calculate any special function. Both units complement each other to a complete single all-purpose arithmetic unit. MUL is the only operation both units can compute. These two MULs per pipeline are needed to execute some other instructions within a single cycle. In the end, we have a single all-purpose shader unit – but with the ability to perform as much as two ordinary units under certain circumstances. Nvidia calls this "dual issue". Additionally, "co-issue" is also there. Co-Issue means using separate color and alpha combiners. NV40 can do even more.
The traditional co-issued pipeline can work as a single vector4 unit, or as seperated vector3 and scalar units. NV40's pipelines can also work in a vector2 + vector2 configuration. Dual- and co-issue combined, an NV40 pipe can execute up to four instructions – while having a single all-purpose arithmetic unit only. For explanatory reasons, we will continue to talk about Unit 1 and 2.
We have to make clear the difference between shader units and actual calculation units. Pixelshader 2.0 and above require additional instructions in hardware which cannot be executed with additions and multiplications alone. Therefore, any general-purpose shader unit must have built-in units for the required "special function" calculations.
Simply put, NV30 can execute just one FP-instruction unit per clock, while R420 can execute up to two (with a "vertical split"). Now, NV40 can execute up to four instructions per clock (with both a "vertical" and a "horizontal" split). However, since the "two" NV40 shader units are limited in instructions (as explained above), the average workload will be lower in real world situations. There are even more limitations preventing to reach the full throughput.
Quad-based rendering
We have to state that the pipeline we discussed is not a pipeline. The real pipeline exists once per quad, while a quad is a block of four pixels. For example, NV30 and NV35 come with just a single quad pipeline, each capable of rendering four pixels at the same time. This SIMD technology is very similar to the MMX, 3DNow, or SSE extensions found in CPUs. This way control and dataflow logic is saved. Moreover, certain calculations can only be carried out with quads rather than on single pixels, for example the LOD (level of detail) for texture sampling. GeForce 6800 GT and Ultra have four independent quad pipelines; each quad pipeline has its own single pixel processor rendering four pixels at the same time.
"Execution in a single clock" does not mean that the result is provided in the next cycle. Any shading unit has internal crossbars and calculation units, delivering new results per clock – after the time to load it. NV40's pipelines has probably a depth of about 256 stages. This means, a full quad pipeline has 256 quads in flight at the same time. As the GeForce 6800 GT/Ultra has four quad pipelines, and a quad consists of four pixels, 4096 pixels are moving through the chip at the same time, to hide the pipeline latencies. Our numbers should be considered a guess.
More details about Shader Unit 1
Since DirectX9, interpolated vertex colors and interpolated (plus perspective corrected) texture coordinates can be freely used as input for any instruction. Those values are calculated only when needed, but then for the entire quad. This is true for any instruction. Let's go down to a single pixel level and have a closer look into Shader Unit 1:
Shader Unit 1
From top to bottom: Two different sources can be used as inputs for the pixel pipeline (the rasterizer or the pipeline loopback). A crossbar transmits the required values to the appropriate interpolators in Unit 1. We don't know how many interpolators are implemented in the hardware. (Shader Model 3.0 logically offers 10 interpolators, instead of 8 in version 2.0 / 2.X). Shader Unit 1 has an SFU built in (yellow), and four multiply channels (shown in blue-green ). A dedicated unit for texture operations follows (orange).
Actually, the special functions RCP and RSQ are two different units in the hardware, but to keep it simple, we abstract them to a single unit we call SFU#1. Now, the whole shader unit can execute up to two instructions per clock, either SFU+MUL3 ("3" stands for up to three components) or SFU+TEX. If only MUL is needed, up to two independent MULs can be executed, but still any clock cycle is limited to four data channels at maximum, meaning MUL2 and MUL3 cannot be computed in a single cycle. Since some data paths are shared, any unit can also just hand over the data, which means this unit is effectively blocked for this cycle. The result of the special functions RCP and RSQ can be used as input for any of the four multiply-channels.
The TEX unit needs texture coordinates as input. Because this unit does not have access to the input registers, Shader Unit 1 has to channel them through. The input for the TEX operations goes through the MUL channels. It is possible to modify the input with the MUL operation right before the TEX operation in the same clock. In most cases, at least the scalar special function can still be used while a TEX operation is performed.
Subsequently, TEX calculates the LOD (level of detail). Both, coordinates and LOD, are transmitted then to the TMU located near the memory controller (which is not included in our picture). The TMU performs the actual sampling of the texture and returns the sample to an input register for Unit 2. Because any texture sampling comes with a latency of some clocks, the pipeline executes instructions for other quads meanwhile.