CineFX (NV30) Inside

August 31, 2003 / by Demirug / page 5 of 7

Performance Considerations (cont.)

After looking at the raw performance, let's consider another aspect regarding performance. Theoretical performance figures are great for putting them on the retail box, but in the end it's real-world performance that counts. This is of course hard to pull out of an architectural design paper, but chip designers have to do the same task.

Let us have a look at how prone to wasting performance the NV30 design is. Its universal FPU can execute any operation, which means that the texture op to arithmetic op ratio has no influence on performance waste. However, to be able to perform two texture ops at once they have to occur as pairs. If not, you waste half the possible texture ops.

Again comparing NV30 to R300, we see that the chip from ATi is much more dependent on the ratio of texture ops to arithmetic ops, because it uses separate units for each. Any deviation from the 1:1 ideal leads to waste of performance. But it is irrelevant whether texture instruction come in pairs.

The following diagram shows how the texture to arithmetic ratio affects the sum total of instructions per second. The three lines show NV30 with paired texture instructions and without, and R300:

R300 reaches its maximum of 5200 MI/s at a ratio of 1:1, dropping at both extremes (only texture/arithmetic ops, respectively) to 2600 MI/s. Without paired texops, NV30 yields a constant 2000 MI/s, far below R300 even in the best case.

Paired texture instructions show the chip in a better light. It reaching a peak of 4000 MI/s in case of a shader with only texops. But this value steadily drops with more arithmetic ops, down to a mere 2000 MI/s when processing a pure procedural shader. With a ratio of 4 and more texture ops per arithmetic operation, NV30 can take advantage of that little weakness in R300's design. But that's only a small win considering that this situation is very rare.

Another weakness of the R300 shows up in the case of dependent reads. Here it can happen that, in spite of even an ideal ratio of operations, it can only execute 2600 MI/s. Dependent reads are texture sampling operations that use texture coordinates that depend on a previous texture read. Combined environmental bump mapping is an effect that requires dependent reads. NV30 doesn't have this problem and can fill the gap in such situations.

Conclusively one can say that NV30 is less prone to wasting performance, but R300 has enough raw performance, so wasting a bit doesn't hurt too much. NV30 obviously lacks the power of the Combiners which it can't use in FP-only shaders 2.0.

And don't forget the R300's ability of using its FPUs to exevute a vec3 and a scalar operation simultaneously in certain cases. We left this out of consideration in the diagrams to keep them simple. And the Shader Core of NV30 might theoretically be capable of the same thing, as shown in part 3.

If we try to explain the real measured performance of the NV30 with the aforementioned characteristics, we notice that there must be something missing from the picture. There must be something else that keeps NV30 from performing to its full potential. One special recommendation from nVidia to the use of pixel shaders is to use as few temporary registers as possible. While analyzing the Gatekeeper function we noticed that the number of quads in the pipeline depends straight on the number of temp registers. The less temp registers are used, the more quads fit into memory.

The recommendation form nVidia aims at having as many quads as possible in the pipeline. Why is this so important? We found three central reasons:

Before a quad can take another pass through the entire pipeline, it is neccessary to send an empty quad down the pipe for technical reasons. This is of course detrimental to the usable performance. But this influence is smaller the less empty quads are necessary. And that can be achieved by increasing the number of quads in the pipeline.
Because of the length of the pipeline and the latencies of sampling textures it is possible that the pipeline is full before the first quad reaches its end. In this case the Gatekeeper has to wait as long as is takes the quad to reach the end. Every clock cycle that passes means wasted performance then. An increased number of quads in the pipeline lowers the risk of such pipeline stalls.
The textures to read from can change in every pass through the pipeline. Because few quads result in few texture samples read in a row, the cache hit rate decreases. More memory bandwidth is required.

A pixel shader 2.0 written according to the recommendations from ATi or created with the current DX9-SDK HLSL compiler unfortunately highlights all those problems. Such shaders use many temp register and have texture instructions packed in a block. A shader of version 1.4 uses the same principles.

nVidia will have to tackle these problems, the earlier the better. Microsoft is working on an SDK update which will include a new version of the compiler that can produce more "NV30-friendly" code with less temp registers and paired texture ops. But if developers use PS 1.4 or generate only one PS 2.0 shader, this compiler won't help nVidia much. They will have to solve the problem in the driver.