NV40 Technology explained
Part 3: Shader Model 3 and the future
May 12, 2005 / by aths / Page 6 of 7
What is NV40 good for?
Since we talked much about the pipeline, the question arises what we can do with it. Is NV40's architecture really ready for an efficient use of Shader Model 3? What actually is Shader Model 3? To make any kind of sense of it, we need to go back four years. Back when the first pixel shader hardware came out.
The pixel shader in general solves a problem on the mind of any developer: compatibility. Earlier, the graphic chips got some fixed function combiners able to render certain effects. Since any effect consists of certain operations, and the developers demanded more flexibility, newer graphics chips allowed custom configurations of the combiners. This can already be done with a DirectX7 multi-texturing setup. The problem was that any such combiner function is optional. There was no guarantee that a given effect can be rendered on any DirectX7 hardware. This was solved with DirectX8 and the first generation of pixel shaders.
Pixel shader 1.1 ("InfiniteFX" of GeForce3) is actually nothing but a new way to setup DirectX7 multi-texturing, plus the requirement of certain combiner operations. But any DirectX8 compliant hardware can render any valid pixel shader 1.1 code. This offered reliability for developers creating their effects. Pixel shader 1.3 is an overhauled 1.1, with some new instructions and a wider read port limit.
Such pixel shaders allow freely combining sampled textures, but not sampling at arbitrary coordinates. The real next shader generation, pixel shader 1.4 (Smartshader), has fewer and simpler texture operations, because the texture coordinates can be modified in the arithmetic part of the shader, meaning textures can also be sampled after some arithmetic instructions have been executed. This allows new effects. Quite beefed-up versions were marketed as Smartshader 2.0 and 2.1, where the first one is pixel shader 2.0, and the latter is pixel shader 2.x (the 2_b compiler profile).
The real next generation was CineFX, in comparison to ATI's 2.0 implementation it allows much more instructions and has fewer limitations (for both pixel and vertex shaders). Over Smartshader 2, CineFX allows additional effects, better image quality and accomplishing certain effects with fewer instructions. Sadly for Nvidia, the GeForceFX's real-world pixelshader performance is, measured in instructions per clock, much lower than the Radeon's with Smartshader 2. R420 doubled the performance per clock. NV40 can execute even more (or more complex) instructions per clock than R420, including full CineFX feature set applied, while R420 is still limited to its (only barely improved) Smartshader 2. This, not SM3, is NV40's main advantage today.
In addition, bot pixel and vertex shader is now able to perform dynamic branches, too (the original CineFX can already do it only in the vertex shader) and the new vertex shader can fetch textures now (without texture filtering though). Texture access causes some latency which, when used sparingly, can often be "hidden" with other independent instructions right after the fetch. Reasonable dynamic displacement mapping asks for texture filtering right in the vertex shader. After all, NV's SM3 implementation is an improved CineFX engine, feature-wise. While not revolutionary, it's nice to have – after all, you got the full CineFX features included.
But the NV40 pixelshader architecture has still one weakness regarding the performance: space for temporary registers. According to "GPU Gems 2", the pipeline can hold only up to 4 FP32-registers before the pipeline is full. Or 8 FP16-registers, or 2 FP32 and 4 FP16 - you get the idea. To offer more registers, physically, the pipeline may not be fully loaded - which will cut performance. Nvidia's shader optimizer hence must work very well to reach the best possible performance.
NV40's implementation of SM3
The DirectX API limits the SM 3.0 instruction count, but NV40's hardware doesn't set any limit regarding program lengths. Of course, memory is always a finite resource. More than 32000 instructions possible with DirectX 9.0c should be enough anyway. ATI‘s R420 allows 512 pixel shader instruction with version 2.x profile 2_B, but due the lack of arbitrary swizzles, a 2_B shader often needs some more instructions than a 3_0 (or 2_A) shader.
Will shaders always be written one instruction at a time, in assembly? No, the goal is to use general C if the time is ripe (Cg and HLSL are the first steps here).
This sounds good, but we have to see the restrictions of a GPU, for example they still don't support the full IEEE754 requirements regarding the FP format, while any common CPU features full IEEE compliance. As SM3 now requires the ability to branch dynamically, this shader model could be labelled as the first "truly programmable" one. We assume that Nvidia already targeted NV30 for SM3, but Microsoft insisted on both FP32 (which NV30 already offered) and dynamic branching in the pixel shader, too. Our theory is that the original plans for NV40 did not include dynamic flow control, and this ability was added as the actual design begun.
With NV40, some instructions should better not be used in branches, because the LOD for textures and interpolators in general are still calculated per quad, a block of four pixels. Branching can lead to have a given texture applied on a single pixel only (another hint that this feature might not have been intended in the first stage of development).
The pixel shader branching units just change the instruction pointer on the condition of a flag stored in a branching register. The actual comparison deciding which branch has to be used is a task of the shader unit. In the end, any branching control instruction (if, else, endif) takes at least two cycles each, sometimes more.
The smallest possible branch control structure (if, endif) will consume at least 4 clocks (in most cases 5 clocks). If such a branch can prevent 10 - 12 (or more) instructions from being executed, the performance increases. Maybe. Maybe not.
As already mentioned, what we normally consider a pipeline is actually only a part of the real pipeline. GeForce 6800 GT/Ultra have in fact four pipelines, each able to render four pixels. Now, it's possible that some pixels take one branch, while other pixels in the same quad take another branch. The pipeline has to calculate both branches then. The decision whether to execute both branches or not cannot be made for any quads individually.
For some quads both branches are executed even though only one is needed. This maybe affects the performance even more than the common branching performance issues of SIMD architectures. We know from different sources that "the" NV40-demonstration called Nalu" got only a very small speedup from branching. All the pixel shading is done by a single shader with dynamic flow control to render either the skin, the clothes, or translucent clothes plus the skin.
As one easily can see, branches in the pixel shader are expensive on GeForce6 series. If you can prevent some umpteen instructions from being executed, you can get a speed boost. NV40 already features early-out, meaning you can set control points to end the calculation anywhere in the shader code (using texkill). This is in contrast to earlier Nvidia and current ATI hardware, where the shader will always be executed in its entirety regardless, so this should also lead to more performance. Yet, we have not seen an example which shows that NV40 really benefits from early-out – maybe a problem of too large quad batches again.
As PS3.0 offers more special floating point registers (interpolators), this sometimes allows to render a given effect in a single pass compared to a multi pass 2.x shader, which will lead to higher performance and better image quality.
After all, a possible performance gain is reason enough to fit smaller cards with SM3 as well. While the most exciting 3.0 features are still quite slow in NV40's architecture, they can be used optionally. Interestingly, NV43 (GeForce 6600) with its narrower architecture can execute dynamic branching more efficiently, leading to greater performance gains. In the end, dynamic branching does not look like an unneeded feature even on NV40, since it allows easier programming, shaders in "the future‘s style", and can render very complex shaders without worries about the instruction limit.
Nvidia didn't forget to fit the GeForce 6800 with all the other nice stuff needed for extremely high performance, and everything is realized in a chip of manageable die size. Also, the CineFX engine is at "full speed" now, while "full speed" means faster execution, clock-by-clock, compared to R420.
Shader Model 3 means much more than only pixel- and vertex shader version 3.0. Have a look Microsoft‘s list what any SM3 compatible GPU offers.
But the full SM3 implementation is only part of a bunch of new features in NV40. We will now have a look at some others.