Inside nVidia NV40

April 14, 2004 / by aths / page 3 of 6

Why quad-based rendering?

A pixel block of 2x2 pixels is often called "quad". Quad-based rendering is mostly done for two reasons: First, to save some logic and therefore transistors. Play Quake3 at low resolution with mipmap-colors enabled. You will see easily that the LOD (level of detail) for the textures is calculated in quads, not per pixel. The second reason is that certain LOD-calculations (especially in conjunction with enviromental bumpmapping) are reliant on to have the whole quad instead of a single pixel.

Sounds good so far, so why do we need to care about it? Because of effiency issues! On triangle-borders, the GPU still works with full quads, even if only one pixel is part of the triangle to be computed. In this example, three pipes won't do any work. Also, Early-Z occlusion (a technique preventing invisible pixels to be calculated) can discard full quads only. Early-Z does not discard any hidden quad, chances are, all four pixel of a given quad are over-drawn. This waste of fillrate is a general problem with traditional "immediate render" architectures. NV40's Early-Z occlusion can discard up to 16 quads per clock - that is, up to 64 pixels. Game engines using an Z-first rendering pass can be greatly accelerated by NV40's improved Early-Z occlusion feature.

Pixelshader 3.0 poses even more of an efficiency challenge to quad-based rendering if the shader uses branches. It's possible the calculations are different for each of the four pixels making up the quad. In such cases, the GPU has to loop the quad four times through the pipe. We will get more into the issues with Pixelshader 3.0 later in this article.

NV40 has 16 ROPs. Sounds like one ROP per pipe. But nVidia implemented a "fragment crossbar". This now ensures the use of ROPs where they are needed the most. With the "advanced" ROPs, up to 32 "zixels" can be rendered per single clock cycle. Again, forget about "pipeline counts" as this would mean NV40 could be considered a 32 pipeline design (this would be bogus, of course.) With 4x antialiasing we have up to 64 subpixel per clock forcing the ROPs to loop.

Memory controlling

The memory controller is the backbone of performance. This is very often underestimated. We don't have detailed information at this time, but it looks like nVidia improved the "Lightspeed memory architecture" once again. For example, color compression is much improved: NV40 can compress framebuffer tiles also while antialiasing is inactive. Also, it looks like the cache sizes were beefed up. Such efforts are of course important to maximize the performance while maintaining an affordable memory interface. By the way, NV40 is able to adress up to 2 Gigs of RAM.

Inside nVidia's main headquarters: Keith Setho, the keeper of the performance testing labs (on the left).

How serious are NV40's bandwidth-limitations?

At first glance, NV40 is heavily bandwith-bound. If you look closer, it really depends on the situation. Old games are suffering from low bandwidth compared to fillrate, but old games run fast enough anyway. So why should we care about bandwidth-limitations? You can turn it all on: 4x antialiasing, 16x anisotropic, and play with the highest resolution your monitor can provide: NV40 delivers extremely fast framerates in old titles.

Long pixelshaders heavily using arithmetic calculations are shader-bound, though. Hence, considering NV40 was made for future games, we think there is no need to bother about NV40's bandwidth-issue today.

Floating point texture filtering

NV40 introduces floating point texture filtering. Previous GeForces use 10-bit fixpoint internal filter-precision and 8-bit fixpoint output value (including the sign-bit, the output value has a length of 9 bit.) NV40 delivers true 16 Bit floating point internal filtering with FP16-output! The output can also be converted to FX12 or FP32.

At a first glance, FP16 looks like a great waste: 50% of the values possible are darker than black, about 25% are brighter than white. So approximatly only the remaining 25% of all representable FP16 values seem "usuable". Let's consider this a waste (while it is, in fact, not) because we apparently "lose" two bits. So 14 bits remain for "real use", this is still way more that 8.

But FP-formats works a different way compared to fixpoint. FP16 means: 11 bit precision, but over a long range. Compared to the previous 8 bit output, we are at least 3 bits better, therefore we have 2^3 = 8 times better resolution.

Let's assume we are using 8 bit fixpoint, and only the last bit is set. This corresponds to a value of about 0.004. We get very poor resolution now - all we can do is turn this bit off or on. (So we can represent either 0.0 or 0.004, but nothing between.) With FP16, we still have the full 11-bit precision even with such low values! You can go even far lower, and keep the full precision. With values smaller than 0.00006, though, we are going to lose precision. (This is true for FP16 with "denorm-support". NV40 fully supports denorms for FP16. Without denorm support, FP16 values smaller 0.00006 are instantly cut to zero. Denorm-support is not required by the API specifications, and needs much transistors. Hence we are quite happy nVidia didn't simply meet the specs, but pushes the envelope at this point.)

The beauty of floating point is that resolution is in the range where it's needed. Broadly speaking, all our human senses are "logarithmic", so we can experience a wide range of energy levels. At low volumes, we easily hear low random noise, while higher levels are "masking" most of the low noise. The same effect applies to lighting conditions: In a dark enviroment, we are able see very slight differences, while in a bright enviroment, we can only distinguish greater varieties in brightness.

FP16 delivers resolution not only to low values. Let's talk about overbright lighting for a minute. Quake3 uses 1 bit for this technique. (Which allows a range of up to 2.0) While we have 8 bit precision only, and one bit is used to represent overbright values, 7 bits precision remain. FP of course delivers the full bit-wide precision in any representable range.

At values higher than 1024, FP16 no longer allows point positions (.x), because the precision is fully used for the positions before the point, with values higher than 2048, odd numbers are not representable. Considering 2^11 = 2048, we still have our 11 bit precision. With FP16, we can go to a range of up to about 65000 and still maintain 11 bit precision! (While we cannot represent any integer number in this range, we have 11 bit resolution anyhow. If you like to read more about floating point, we can offer a full-length article (at this time, german language only.) After all, both brighter than white and darker than black is not a waste as mentioned before.

All the advantages of FP16 are now included in the texture mapping units (TMUs.) We consider this a real leap forward. This makes FP16 textures really useful, because FP16 textures deliver more precision and greater dynamic range, while NV40's TMUs can perform bilinar, trilinar, and anisotropic with FP16! Also, for texture-lookups used by mathematical functions, FP16 greatly improves the resolution. Both material shaders and especially textures with lighting information are better off using FP16 input than FX8 values. Image-based lighting content really profits from the higher dynamic of FP16.

NV40 has a two level texture cache system. The L2-cache is chip-wide and can be accessed by every quad-pipeline. We estimate the L2 cachesize to be 8 kib. Any quad-pipeline has its own L1-cache. While the L2-cache can store compressed textures, the decompression is done by the transfer of the data into the L1-cache. Of course such big caches raise the transistor count, but it helps to lower the influence of onboard-memory bandwidth limitations.

With the aquisition of 3dfx's core assetts, nVidia also added some valuable patents to its portfolio.