NV40 Technology explained
Part 2: Get more out of the Hardware
September 20, 2004 / by aths / Page 5 of 7
Which one is more efficient?
In previous articles, we considered the NV40 less efficient than R420 because NV40 cannot perform twice as much as R420, even though it comes with double the count of shading units. Now, knowing more details, it looks like NV40 is more efficient than R420, because it looks like the two shading units are really nothing more than a model to explain the dependencies of calculation sub-units in one full shader unit (logically, R420 has a single full shader unit per pipe, too).
Of course, this is still not the whole truth. NV40‘s texture format conversion unit attached to the TMU can provide some scaling operations "for free" (in the same, hidden stage of the format conversion) since scaling by a power of 2 is very easy to implement for floating point. Such additional logic can offload the pixel shader's main shading units. Of course, it's not possible to use such specialized extra logic every clock, which leads to some inefficiency. As far as we know, the TMU of NV40 can apply the _bx2 modifier, but is not able to provide any scaling in the range from _d8 to _x8.
Incidentally, the shader code is delivered by TMU memory access. This leads to the question if the texture cache can buffer pixelshader code, too. Regarding NV40's texture cache there are a number of open questions indeed, but this is not related to the topic of this article series.
What is no longer possible is to perform two DP3 instructions in a single clock. From GeForce256 on, any register combiner can do that. Even GeForce 5800 is able to do that with its extra unit for fixed-point calculations. GeForce 6800 breaks this line. Also, register combiners are no longer faster than the same calculations with floating point (since any internal shader calculation is done with floating point units). As usual, some DirectX8 texture operations still take more than one cycle. This is no drawback, because NV40 is fast enough for DirectX7 and 8 anyway (about twice as fast as NV35 already is).
Roughly speaking, NV40 handles calculations in all formats, FP32, FP16 and FX12 with the same execution speed, since with all formats the same units (with FP32 precision) are used. The space for temporary registers provided by the hardware is still limited. Because such a register can hold either a single FP32 or two FP16 values, rendering with FP32 can be slower if the temporary register file is full, forcing the pipeline to wait until some pixels are finished, freeing their register space for new pixels. Also, internal bandwidth is an issue under some circumstances, so anything FP16 delivers enough resolution for should still be performed with this smaller format.
DirectX7 multitexturing setups and DirectX8 pixel shaders are getting FP16 for operations on colors, and FP32 for texture operations. Fix point calculations are no longer offered (FX12 can be lossless converted to FP16).
For FP16, additional hardware is provided enabling the pipeline to perform an extra partial precision NRM instruction every clock (while this operations has a latency of two cycles). Since no one needs a NRM_PP every clock, this may appear to be overkill. On the other hand, bump mapping should always use normalized vectors, and FP16 delivers great resolution compared to the common FX8 format.
FP16 normal maps can provide much finer detail than common normal maps, or 3Dc compressed normal maps. Unnormalized FX8 normal maps can offer a finer resolution of different angles. 3Dc compression works with normalized normal maps only. In addition, 3Dc burdens the pixelshader. Instead of using 3Dc, normal maps can be compressed with DXT5 also. Due to the high efficiency of its NRM_PP, NV40 offers high-quality bump mapping at very good performance. Therefore, having this "free" NRM_PP (_PP stands for partial precision, that is FP16) is an advantage regardless if this solution is transistor efficient or not (for Shader Model 3.0, FP24 is also partial precision only).
The same unit dedicated for NRM_PP can alternatively do a fog blend. This again offloads the main shader unit. Other hardware units are also able to take some burden the pixelshader otherwise would have to bear. Expect more about those features in part thee of our article series.
Comparison result
The conclusion is: With NV40, Nvidia provides both the crossbar logic for a very efficient use of a single pipe, plus fits out the pipes with enough calculation units and helping mini-FPU stages to provide the power future games will require. Even if an NV40 pipeline comes with less raw power than any NV35 pipeline, in real world situations the new design can perform up to twice as much as the NV35 pipeline! R300/R420's power per pipeline and clock is easily outperformed as well – if the optimizer works well enough.
However, we also have to state, that NV40's pipeline is a compromise between flexibility and "what can be done within the transistor budget". The dense instruction "packing" is carefully optimized for some specific instruction sequences to perform frequently used operations like DIV (RCP, MUL), SQRT (RSQ, RCP), NRM (DP, RSQ, MUL) and others. To save transistors, the crossbar logic and the read port count is limited. In some cases, the chip need extra cycles to arrange the data.
NV40 relies on a good optimizer for achieving its fast shader execution. Luckily, the shader compiler profile does not have an impact on performance. A pixel shader code optimized for Radeon 9700 or X800, executed "as is", would slow down the NV40. But the optimizer transforms this shader code into a shader better suited for NV40, performance-wise. We conclude: It is not enough to develop fast hardware, high performance pixel shader execution also asks for excellent software engineering.
What can we expect in the future?
Obviously, future shader implementations will provide more flexibility compared to NV40. The big question is: Will the next chipsets offer more pipelines, or more units per pipeline? We think an additional MAD-unit with co-issue ability (like Shader Unit 2 but without SFUs), leading to three shading units per pipe, would hit the "sweet spot" for future pixel shader demands.
On the other hand it's far easier to take some existing quad-pipelines and just add more of them to the chip design. A deeper pipeline featuring three shader units is harder to optimize for than is currently the case with the NV40's architecture. So we don't dare to speculate too heavily about future designs just yet.
ASM is bad
Low level "hand optimizing" for a particular chip is a very bad idea. A poor compiler result should lead to improvements in the compiler, not to a return to assembly programming. Let's assume we have vertex shader 1.1 with no SINCOS provided. The good way is to write a high level shader. The programming language can provide a SINCOS anyway, breaking up the instruction into some operations approximating SINCOS.
Vertex shader 2.0 has such an instruction, but only as a macro. However, today‘s implementations of the vertex shader offer such instructions "truly", way faster, using internal lookup tables rather than the common approximation with a sequence of other operations.
Having the assembler code only, it's hard to detect that a given sequence of instructions was used to approximate SINCOS. Even though the new hardware can execute the old assembler code as well, it's slower than a newly compiled high-level shader is.
Nvidia will further refine the pipeline based on NV40, that‘s for sure. Maybe some limitations will fall; maybe other dependencies will arise, leading to lower performance if an old optimization is used. The developer should not try to save a few clock cycles using assembler, because the portability of written code is much more important. Any pipeline has its weaknesses and bottlenecks, leading to low performance in some cases. Of course, a developer should be aware of such issues and may try to use a workaround.
Since we got HLSL (and Cg), assembler should not be an option for any DirectX9 shader. Nvidia is well known for strong developer support, helping to improve the performance without using stone age techniques. Speaking of DirectX9 shaders, let's have a look at Shader Model 3.0 in the last part of this article series.