CineFX (NV30) Inside
August 31, 2003 / by Demirug / page 6 of 7
CineFX II
With their new NV35, nVidia changed the name of their shader technology to CineFX II. The name implies that the changes were evolutionary in their nature. The biggest problem nVidia had to solve was the low performance when using PS 1.4 and above. Because there is no known patent covering CineFX II and it is improbable that there will be one, we'll leave the path of facts now and head for the trail of speculations.
nVidia mentions at multiple instances that they doubled the floating point performance compared to its predecessor. The "NVIDIA GEFORCE FX 5900 PRODUCT OVERVIEW" also claims that che chip is able to execute 12 instructions per clock. This leads to the conclusion that nVidia added a second FPU that can perform four instructions per clock. Together with the eight texture instructions, this adds up to the claimed 12 operations.
The additional FPU was most probably placed at the combiner stage. The five million additional transistors do not suffice for an FPU of this complexity. nVidia had to remove something to fit the FPU on the chip. Because the new FPU is able to handle the tasks of the integer ALUs, we can assume that those units were removed. Tests with NV35 show only minimal performance losses in PS1.1 to 1.3, so the FPU can perform almost all operations at the same rate as the integer ALUs do.
In some exceptional cases this doesn't hold true and the FPU needs more cycles for the same task. Then there is the question of which data formats the new FPU can handle. We can be sure that that the FPU is able to handle fp32 (s23e8) numbers. The mantissa of 23 bits requires the FPU to have 23 bit adders and multipliers. Extending those to 24 bits and allowing them to split yields two 12 bit adders and multipliers. Exactly what is neccessary to replace the integer ALUs of NV30. This seems more logical than replacing only one of the integer ALUs because it needs much less transistors in total.
The question whether the FPU can be split into two fp16 units will be left unanswered. Tests have not definitely shown whether performance increases can be accounted to a higher calculating power or to the smaller register usage footprint. The marketing department surely would have mentioned it if fp16 allowed for 16 instructions per clock.
And our approach explains why performance in PS1.1 to 1.3 applications suffers a bit. The integer ALUs in NV30 are placed in serial which means they're immune to mutual dependencies. The model described above however resembles two parallel units so it isn't possible to calculate two dependent instructions in one pass.
Those CineFX II changes naturally have an influence on raw performance and dependencies. The additional FPU increases the maximum number of instructions performed to either 8 texture instructions and four arithmetic instructions, or 8 arithmetic instructions. This is still below R350 which can do 8 texture and 8 arithmetic instructions, just like its predecessor R300.
Let's compare those chips: nVidia lowered the core clock in NV35, while ATi raised the clock of R350 compared to R300. R350 reaches 3040 MTO/s + 3040 MAO/s = 6080 MI/s. NV35 can do either a maximum of 3600 MTO/s + 1800 MAO/s = 5400 MI/s or a minimum of 0 MTO/s + 3600 MAO/s = 3600 MI/s. The clock rate increase saves ATi the prize of best raw performance. If nVidia didn't lower the core clock, ATi would have scored a very close victory, leading by only 1,3%. But with NV35's actual clock rates, it's a respectable 12,6% lead for ATi.
When comparing worst case situations, NV30 was able to perform slightly better than R300 in a few, practically irrelevant situations. Let's look at how the situation has changed. Because of the additional FPU in NV35, different requirements apply to utilize its raw performance to the fullest. Besides the known problem that texture accesses have to occur in pairs, another condition has to be fulfilled now. The pair of texture operations must be trailed by an arithmetic operation, else the new FPUs would be sitting idle.
This isn't a big problem because texture values usually undergo some combining before leaving the pixel shader. But the PS 1.4 model and the PS 2.0 model as recommended by ATi begin with reading many textures before starting with the calculations, which is problematic for nVidia's architecture. In PS2.0 this can be solved with a new compiler, but with PS 1.4 this structure is fixed. nVidia has to do some work in the driver.
R350 shows a similar curve as R300, just shifted up a bit because of a higher clock. A maximum of 6080 MI/s at 1:1 ratio and a minimum of 3040 MI/s when there are only instructions of one kind.
NV35 shows a different behavior than NV30. In the ideal case of paired texture instructions followed by an arithmetic instruction, it can reach a maximum of 5400 MI/s. The second line shows a shader without paired texture operations. The last curve shows how NV35 behaves when using PS1.4 and PS2.0 in the form preferred by ATi. Because this means either 8 texture or 8 arithmetic instructions per clock, we get a constant 3600 MI/s. At a 1:6 ratio and below, NV35 is able to beat R350.
With shaders optimized for both architectures, NV35 does a much better job than its predecessor did. NV35 beats R350 outside the range of 2:1 to 1:3. But in between, ATi dominates and even R300 is able to beat NV35 here. If we consider the bigger performance hit of R350 when doing dependent reads, we can conclude that NV35 and R350 are competitors of equal weight if both get fed with optimized shader code.
But nVidia can't expect an application to always deliver such code. At this point we can only preach that nVidia has to put instruction reordering capabilities into their drivers, but without changing the function of the code. The method the current driver uses nothing more than a stop-gap solution that is acceptable as long as the number of applications containing unfavorable shader code is small. But wiht a growing number of titles using shader technology, nVidia can't expect GeForceFX users to wait for new drivers to enjoy higher performance through replacement shaders.