NV40 Technology explained
Part 1: Secrets of the Pixelshading Unit
August 19, 2004 / by aths / Page 1 of 7
Foreword
The past 10 years of developing dedicated 3D-acceleration hardware for the consumer market brought us milestones like Voodoo Graphics, GeForce3, and Radeon 9700. Still, computer generated realtime graphics looks very "synthetic": Surfaces appear too smooth, the geometry resolution is too rough, liquids still do not look natural enough. But we also saw dramatic progress leading to the ultimate goal: Rendering photo-realistic graphics at HDTV resolutions in realtime.
The NV40 is the most advanced GPU available. We invite you to have a look on some aspects. For that reason, our article is split into three parts. Today, we will see how NV40's pipelines work – and why Nvidia designed them this way. Part two will explain how to take full advantage of the pipelines. Part three will discuss Shader Model 3.0 and other features. Let's begin with the road leading to NV40.
How past determines future
Responsible for the "big decisions" regarding Nvidia's GPUs is David Kirk. His most remarkable character trait is a pragmatic view of reality. What can be done with a limited number of transistors, in a given span of developing time? He sets the emphasis and ensures today's hardware will prepare the use of future features. Since, both developers and customers are asking for a lot of stuff, he also has to decide what can not be delivered in a particular generation.
David Kirk entered the team when Nvidia was still a little start-up company. His first task was to stop the development of NV2 and to decide that NV3 (Riva 128) uses plain triangles as the basic shape. Compared to bezier-quads, triangles were a less flexible, but proven technique. NV3 was built to meet the requirements for the OEM, meaning the chip and its hardware configuration had to be cheap. So Riva 128 did not support 2x AGP and came with 4 MB RAM only.
The first effort showing Nvidia trying to push the envelope was NV4, aka Riva TNT and its successor NV5, Riva TNT2. The customers where satisfied with 32 MB RAM, 4x AGP, and 32-bit rendering, so NV5 was quite a success. The original NV4 came already with quite a powerful combiner, featuring two MULs and an ADD per pipe and clock. Last but not least, it also offered two pipes instead of just one.
NV10 (GeForce 256) came with a "programmable" register combiner capable to perform Dot3 bumpmapping (Doom 3 makes heavy use of this technique to provide per-pixel lighting), or rendering nice mirror-effects with masked environment mapping. Other graphics chips at this time where limited to one texture per stage, any GeForce can access at least two textures per stage.
NV20 (GeForce3) finally came with dependent reads. This technology allows rendering of environmental bumpmapping, which was invented by Bitboys, licensed to Matrox, inserted into DirectX since DirectX6, and also offered by ATI's Radeon at that time. While 3dfx' Rampage was in some ways more advanced than GeForce3, Rampage never made its way to the customer. Even though ATIs Radeon and Radeon 8500 offered much more advanced pixel combiners at its time, a better card from Nvidia was never far away.
Decisions
At any time, the production engineering limits the number of transistors on a chip. As both ATI and Nvidia employ excellent developers, we should expect similar performance and feature sets. Though any chip in existence is a trade-off, NV30 (GeForceFX 5800) has very good shader features indeed, but limited DirectX9 performance. NV30‘s CineFX engine was designed primarily for developers. NV35 (GeForceFX 5900) comes with much more pixelshader speed, but still could not keep up with ATI's masterpiece, called Radeon 9700. We know, this R300 is a compromise, too: It comes with low filter precision only, and its shaders are limited in many ways.
NV30 is mostly an NV25 (GeForce4 Ti) with additional DirectX9 units. NV35 blurs the line between "old" and "new" shader units, saving transistors, but taking compromises on DirectX8 pixel shader performance. This is balanced out with increased clock speed and other improvements. The DirectX9 pixel shader speedup in NV35 comes from an additional FPU able to execute some often-needed operations (while the shader core can still execute any instruction). Even though NV35's CineFX2 delivered significantly more power than NV30's CineFX1, it was even harder to optimize for it.
The big challenge for NV40's CineFX3 engine was to significantly boost pixel shader performance while keeping the transistor count as low as possible. ATIs way of using lower precision and a more limited feature set was no option since fewer capabilities than in NV30 wouldn't be accepted by the consumers. More power by means of more units (like we have seen in NV35) would result in an unmanageable chip size. Thus, Nvidia was forced to make a new design of the pixelshader pipeline. This new pipe is what we investigate now.
How to fuel the pipeline
NV40 has a much leaner pipeline than its direct predecessor does. To balance this out, the chip does not only come with more pipelines, but any one can be used more efficiently. Nvidia claims that the new pipeline offers two shading units. This was shown already in the following figure taken from an older article.
NV40‘s two shader units per pipe, each limited in its functions.
Whether this figure is right or wrong depends on how much detail you'd like to be considered. First, we don't count any mini-FPU (or "mini-ALU"), because any common architecture comes with such helper units, as well as any DirectX7 and DirectX8 class hardware. These mini-units handle modifiers like in- or output-scaling and / or biasing "for free".
Traditionally, any combiner (already in the first Voodoo Graphics chipset) is separated into a vector3 color combiner and a scalar alpha combiner. This is because alpha operations, for transparency and other effects, are often different from the required color calculations. Now, NV30 and NV35 have only vector4 units for floating point calculations. In contrast, Radeon's DirectX9 pixel shader can execute a vector3 and another scalar operation in the same cycle. Nevertheless, we still count such a pair of units as one, because some instructions use all four data channels at the same time.
Traditional Co-Issue.
How do we actually shade a pixel? To have textures, we need a texture unit. Furthermore, all light values for a given pixel are added, then the final light value is multiplied with the texture. Therefore, we need ADD and MUL. To determine the light intensity, the dot product DP3 (or "dot3") comes into play, which is based on ADD and MUL. To interpolate the normal vectors (only provided per vertex) over the entire triangle, we need linear interpolation.
As this is a very often used operation, any 3D-chip provides dedicated hardware for it. But any pixelshader offers a linear interpolation (LRP) as an arithmetic instruction also, using again ADD- and MUL-units. To re-normalize values (to avoid blocky artefacts) we need the appropriate instruction, too - this NRM now asks for some additional functions like square root and reciprocal.
These facts in mind, we can start to break the pipe into its units.