Inside nVidia NV40

April 14, 2004 / by aths / page 2 of 6

A single NV40-pipe in detail

Our picture of the pipe is simplified, to avoid getting lost with too much detail. We are also looking at DirectX9-shader power only.

A single NV30-pipe can perform two texture operations or one (of any) arithmetic instruction. All this work is done in the shader core. Some special functions need more clock cycles. A texture operation does not provide the final texture sample instantly, if you filter better than bilinear the pipe have to use all its two TMUs, and in some cases even internal loops are required. Because the actual texture sampling usually takes longer to execute than the calculations for the texture operation, we count the power of texture sampling in our article, and have a result of 2T+1M.

A single NV35-pipe can do all of the above plus take advantage of its new FPU. This unit provides one or two additional arithmetic instructions per clock, depending on the actual shader code. This improvement of arithmetic power via the FPU is limited to MAD-operations. MAD is the shortcut for multiply-add. Multiply (MUL), ADD, and multiply-add are very common operations and often used. Roughly speaking, an NV35-pipe is about 2T+2M. With its 4 pipes we result in about 8T+8M total.

A single NV40-pipe can do the following: in Shader Unit 1 (formerly "shader core") either a texture operation, a MUL, or a special function. Shader Unit 2 (formerly "FPU") can supply any MAD-operation. This includes Dot3, which is important for Dot3-lighting and bumpmapping. Again, Doom3 makes much use of Dot3 bumpmapping. It's quite probable that shader unit 2 is also able to calculate some additional special functions, we couldn't finish the investigation on time. We should keep in mind, that we discuss the logical pipe here. The actual hardware may works in a complete different way than we think.

Each NV40 pipe comes with two different shader units.

"MUL": Multiply
"SFU": Special functions (like RCP and RSQ, that is 1/x, and 1/√ x)
"MAD": Multiply-Add, "Dot": Any Dot-product (up to Dot4.)
"Texture" means a texture operation. Such an operation initiates the filtering of a texture sample. (Certain calculations are needed before filtering can be done.)

The whole truth is more complicated, though. What's more, we don't know each and every detail at this time. We expect the revelation of the last secrets in tech papers for developers. Anyway, let's discuss our simplified scheme first, then have a look at the more complete picture.

Each pipe has its own shader units. The beauty is that any single unit can perform up to two independent instructions per clock. This is done by splitting the unit. A full pixel vector consists of 4 single values (these are: red, green and blue for color, and alpha for transparency or other special effects.) Any unit can split into 3:1 (since Radeon 9500, ATI can do this too, we should mention) or into 2:2 (this is an NV40 exclusive). It is now possible to bundle some of the DP2-instructions. Also, because some calculations alter texture sample position (which is often a vector2 value) the 2:2-split really comes in handy.

To raise the chances to perform two operations in one shader unit, the driver optimizes the shader code. The shader-instructions are reordered without changing the mathematical expression, to "combine" the instructions (this is called "dual issue") as often as possible. No chip directly handle the assembler shader, but a translated version. The translated result can be optimized, too. So, any NV40 pipe can perform up to four arithmetic instruction with its two shader units. Because we have 16 Pipes in total, the overall chip-performance is 16T+64M at maximum. Since "dual issue" can only sometimes be used, the actual performance will be lower.

R300 splits its shader units too, but in a more restricted way, and every R300-pipe comes with just one shader unit. But: the R300 shader units can do all the math and there is also an additional unit for texture operations only, while NV40's units #1 and #2 are each limited to specific operations. In the end, NV40-pipes should be more powerful than R300-pipes.

We assume that nVidia's current driver release does not expose the full potential for "dual issue" optimizations yet. For the technical enthusiasts, we offer a closer look at the pipe, now. If you alread heard enough, you are encouraged to wind foreward to the next chapter.

A more detailed picture of NV40-pipes' units.

On the left, we added the "Mini-ALUs". A Mini-ALU is also an FP32-Unit, but limited to some simple scale- and bias-operations. For DirectX8-shaders, those "modifiers" integrated in the pipe are important for "full speed". With a good shader optimizer, such a Mini-ALU will also provide certain DirectX9-Pixelshader operations "for free." (Since DX8-performance is mostly fillrate-bounded, NV40 with its 16 TMUs delivers about as twice as much power here as the previous generation with its 8 TMUs.)

Speaking of "free": FP16-NRM come free, often. NRM is an instruction (actually a macro) for normalizing a vector (that means, set the lenght of a vector to 1.) This is the one and only NV40-instruction where FP16 is faster than FP32. Even FX12 (a fixpoint-format for DirectX8-shaders) is not faster than FP32 anymore. For NV30 and NV35, FX12-calculations are way quicker than the same calculations with floating point precision.

Normalization calculations are often used while rendering bumpmaps. Previously, there where two ways to normalize: Either use a precalculated cubemap, or use a Dot operation followed by RSQ and MUL. It is a good thing NRM_PP ("PP" stands for "partial precision", i.e. FP16) is this fast, now. It would be better though if a full precision NRM would be this fast, too.

Regarding the "real world performance", NV40 is still faster with FP16. This smaller FP-format needs less bandwidth and less space in the register file, so the file can hold more registers at the same time.

NV35-pipes can perform at least two Dot-operations per clock. NV40 is limited to one, because Shader Unit 1 can do a MUL, but not an ADD. nVidia claims their shader analysis showed that in most cases two ADDs or DOTs weren't needed, so they didn't implement it with NV40.

After all, NV40 is less "static" than you may think. If Shader Unit 1 performs a texture operation, this usually blocks the arithmetic part. But sometimes, it can still execute at least a scalar arithmetic instruction during the same clock cycle. Also, it looks like the TMU includes a logic for scaling. This can offload the shader units. Again, please consider our picture not a view how the real hardware looks like. May be there are in fact no two shader units, but a single big one doing all the work. We don't know, we try to explain what a single pipe can do per clock.

Even though a single NV40-pipe has less raw power than NV35-pipes, nVidia managed to greatly increase the efficency. The big picture: with its large count of pipes, the GPU is very fast indeed. nVidia promised that there will be no hand-optimized shader replacement for benchmarks. Simply put, such "optimizations" are no longer necessary.