NV40 Technology explained
Part 3: Shader Model 3 and the future
May 12, 2005 / by aths / Page 7 of 7
Importance of Textures
With vertex and pixel shaders, it's possible to use "material shaders" to create realistic looking surface materials. But even with the new math power that NV40 delivers, this is quite expensive compared to using a texture instead. Textures need much, much more storage space than the input data for material shaders, but memory is in most cases not the problem. Interesting note: NV40 supports up to 2048 MB, which will of course never be fully utilized.
The foreseeable future will provide games with textures and special effects like bump and offset mapping. Because of gamers' desire for higher resolution plus high degrees of anisotropy, texturing power will be getting more important. Even though arithmetic power gets even more important to render the special effects with reasonable performance, common texturing will stay with us for a long time.
That is why we still have traditional TMUs as dedicated units. How the filtering is done in hardware and what was changed in NV40 is supposed to be the topic of another article. One thing is clear: All the high precision calculation units are in vain without superb (anisotropic) filtering. Therefore we are not happy with the anisotropic filtering on NV40 offered today. This chip clearly deserves better filtering algorithms. We consider this to be the main weakness of NV40.
Even though the texture filtering is in the high quality mode still a bit better than the Radeon‘s, S3 proves with its Deltachrome how 16x AF can look: extremely good in any direction. In some directions, both Radeon and GeForce 6800 produce quite blurry textures even with 16x AF enabled. For material shaders, this means NV40 delivers imprecise input in some cases. With FP32 precision we obviously also need the best possible input data filtering.
NV40 offers nearly perfect isotropic texture filtering in text-book quality, though. While this is good (actually much better even than Microsoft‘s Reference Rasterizer), it‘s not good enough for such a chip to offer only isotropic filtering uncompromised.
Specialization vs. general purpose and the future
SM3 is intended to disburden both GPU and CPU. NV40 has additional specialized hardware to push this even further. For textures stored in the floating point format, filtering was previously a task for the pixel shader. NV40 has native FP filter support for FP16. FP16 is often enough for both applications and games in the foreseeable future. To filter FP16 data, the often TMU needs two cycles for a basic bilinear sample (instead of one cycle). Compared to filtering "by hand" in the pixel shader, this is incredibly cheap (meaning very, very fast).
The NV40 was intended to offer dedicated tone mapping hardware, too, but now comes without a functional tone mapping unit. That also means, a 64 bit (4 channel FP16) framebuffer cannot be provided. But NV40 still offers dedicated FP16 alpha blending for FP16 targets, which are still possible. In the end, NV40 (and NV43) provide two out of three types of "dedicated HDR units".
For good, NV40 supports "denorms" for FP16. This allows to handle very small values, while FP16 without denorm support rounds very small values to zero.
The OpenEXR definition of the "half" format is bit-compatible with this FP16 format and also support denorms. Since the mathematical nature of zero is very different from any non-zero number, it is desirable to have support of denorms. In fact, any GPU offering FP16 for real usage, should support denorms for FP16, too. Without FP16 denorms, the developer may be forced to use full precision, which is still slower on NV40 due to register space and bandwidth limitations.
Though the DirectX API does not requires this, so Nvidia might give up FP16-denorm support in future hardware. As long as FP16 is intended to be really used, denorm support should be there, because it widens the field of application. As long as FP16 can be faster than FP32, it's better to spend transistors for denorms and let the developer use this smaller format. Hardware in the distant future should be powerful enough to ignore any _PP flag and handle everything at least with FP32 precision.
NV30 got a little speedup with FP16, while NV35 gets significantly faster. NV40 can get an even greater boost from FP16 usage. On the one hand, as the overall performance was increased with every generation, NV35 is a better FP32 chip than NV30, and NV40 easily outperforms NV35. On the other hand, any new generation is even faster with partial precision, which is an advantage, too, since _PP is applicable often enough, while the hunger for more rendering performance can hardly be satisfied with raw performance upgrades alone. Special tools can force Far Cry and Half Life 2 to use _PP whenever possible and the graphics still looking good.
Traditionally, a GPU got its performance through the fixed functional but massively parallel implemented units. The future is flexible shader hardware. NV40's SM3 implementation provides programmability on a new level (as the spec requires) and additional fixed function units help to boost the performance. The NV40 design also heeded the needs of the Doom 3 engine, which increased the transistor count indeed, but delivers outstanding performance here.
Nevertheless, NV40 should still be considered a traditional architecture, but with many improvements and new features. Most of them just to speed up existing rendering techniques. Therefore, more effects are possible in realtime applications now. Since any CPU can render any effect imaginable (as long as they can be expressed as a formula) the question is not "does this GPU provide new effects?" but "which effects are now possible in realtime?"
SM3 was built around an improved CineFX implementation. In contrast to CineFX, ATI's Smartshader is too far away to be "beefed up" all the way to SM3. We would take R420‘s shader profile 2_B as "Pixelshader 2.1" while GeForce FX‘ profile 2_A can be considered as "2.7" (these numbers are not official in any way). Of course, since ATI has to design a new architecture anyway, their implementation might turn out to be much better than Nvidia‘s. At least, NV40 and NV43, both with SM3 support, are available now.
NV40 offers 4x sparse multisampling. While this is good, 8x would be better. Of course, 8x asks for much memory and requires high bandwidth, but at least older games would come to shine while the performance is still good enough. Since Nvidia didn't seem convinced of the advantages of so-called gamma corrected downfiltering, we don't expect dedicated hardware for it in the foreseeable future. We do expect 8x AA with NV50. While a RAMDAC gamma ramp of 10 bit is state of the art, a future chip with an optional 64 bit framebuffer should offer more. 10 bits provide about 1000 levels of brightness. The human eye can differentiate some million levels of brightness, so we think the next graphics generation should offer at least a 12 bit gamma ramp (Windows supports up to 16bit).
Conclusion
What did you buy your VGA for? Presumably still to display Windows applications and occasionally accelerate 3D-Games. Since GPU performance increases faster than CPU performance, and GPUs get more flexibility with every new generation, the field of applications is not limited to the common use. It‘s still a debate if a game engine will ever execute the physics engine on the GPU, since such a stream processor is not intended for such tasks.
A GPU is an integrated circuit able to perform special streaming data calculations very fast. Today, this is sometimes used to accelerate video de- and encoding, but NV40‘s Video Processor (VP) seems to be not fully functional, it also requires extra software you have to pay for. But take the pixelshader: NV40 Ultra is clocked with 400 MHz only, but can do up to 128 floating point multiplications per clock and access much more bandwidth than any available personal computer. Add the power of the vertexshader, and you can accelerate virtually anything – if you are able to program your GPU using SM3.
So NV40 is a chip right in the transition: it accelerates existing techniques on the one hand, on the other hand it opens the door to a new field of application.
Acknowledgements
This article would not have been possible without the outstanding help of a LOT of people. Especially I want to thank Demirug (author of DXTweaker) and Damien Triolet (editor for Hardware.fr) for providing valuable knowledge and ideas. I also want give kudos to zeckensack (maker of the world‘s best Glide Wrapper) and Xmas (author of the famous OpenGL Filter Testing Application) for giving criticism. I thank Nvidia and especially David Kirk for providing invaluable information. My thanks goes to Leonidas, editor-in-chief and webmaster of 3dcenter. Last but not least I want mention nggalai's (editor-in-chief of some very good Websites like Marketingblubber) and again zeckensack's strong support regarding the English language.
This article represents the author‘s knowledge at the time the article was released. If you encounter any error or want to give any comment, feel free to post in our forum or mail the author.