Zum 3DCenter Forum
Inhalt




CineFX (NV30) Inside

August 31, 2003 / by Demirug / page 7 of 7


   The future of CineFX and CineFX II

The previous chapter was only partly based on facts, now we leave the realm of facts completely and delve deep into the troublesome sea of speculations.

First we'll have a glance at the near future. On the hardware side, NV36 and NV38 can be expected, but there are enough locations where you can find speculations on those. However, we expect all upcoming NV3x series chips to be equipped with a CineFX II pixel engine.

On the software side, the highly anticipated "miracle driver" Detonator 50 is about to be published. Equipped with some more technical background on CineFX and CineFX II now, we dare to speculate what this driver might be doing. But first an explanation of what the 4x.xx series drivers seem to be doing.

With shaders up to PS1.3, it uses the same procedures as are used on GF3 and GF4 for the most part. Shaders above that are treated differently. If it belongs to a group of selected programs nVidia considers important, the driver already contains an optimized version of it which is used instead. All other shaders are handled as-is.

That's the first possibility for nVidia to improve utilization of the limited shading power. Experiments with a new shader compiler showed that most shaders can be converted into a more favorable form. Unlike the old shader compiler that only supportet the ATi shader model, the new compiler uses less temporary registers and produces the instruction order that fits CineFX best.

CineFX II has been considered, too, so every pair of texture instructions is followed by an arithmetic instruction if possible. The CineFX I pipeline also seems to gain performance from this. The loopback controller of the shader core seems to be able to perform several shader instructions in a row without depending on a "big loopback" over the whole pipeline. Because this loopback is punished with an extra cycle per quad group, a reduction of these big loopbacks is desirable.

Additionally, this new compiler is able to use the extended features of CineFX compared to SmartShader II to realize the same effect with less instructions. Tests with an early version of this compiler showed that these measures alone can lead to an up to 40% higher frame rate in certain cases, without modifications in the driver. Using this driver is up to the software developers, though, and nVidia shouldn't always expect cooperation. But a great deal of these techniques could also be implemented in the driver and used on shaders already compiled for the ATi model.


If we dive even deeper, we find another performance source the driver might tap under some circumstances. nVidia designed CineFX as a combination of VLIW (Very Long Instruction Word) and SIMD (Single Instruction Multiple Data). In one step, a number of operations is performed on multiple data (4 pixels) using several calculation units. This approach is close to ideal for a PS2.x graphics chip without dynamic branching.

The SIMD part saves transistors because it's not neccessary to duplicate the control logic for every single data packet that goes through the pipeline. And the VLIW part makes it unneccessary to have an instruction sequencer. Considering that an instruction sequencer often occupies more than half of the transistors in an execution pipeline, it might be understandable that they didn't want to take this part of a modern CPU over to a GPU.

Surely the advantages of an instruction sequencer would be a good thing for a GPU, too. The driver developers would be thankful for some saved sleepless nights. But if that means doing without half the shader performance, the decision should be a simple one.

But there are not only advantages. SIMD wastes performance if not all data within a block can be used. At the edge of a triange it can happen that only one of four pixels can really be used, which means a loss of 75% of computing power. A VLIW-Architecture confronts the designer with two problems. First, an optimizing compiler that determines the ideal utilization for each unit beforehand is essential. Reordering of instructions like in a CPU isn't possible in a pure VLIW architecture. Second, "very long" of course has an influence on program length.

This might be the reason why nVidia stores pixel shader code in the graphics card's local memory. The wide data bus allows to load a program much faster. Another interesting detail: Loading pixel shader programs utilizes the existing connection from pixel pipeline to memory via the TMUs. nVidia seemingly didn't want to use up transistors on another unit for the relatively rare procedure of switching shaders. Maybe even the texture cache is used for shaders, too.

This thorough explanation has a reason, of course. A driver can't use the SIMD aspect to improve performance, but chances are better with the VLIW part. VLIW, as mentioned above, needs a good compiler, something the 4x.xx Detonators seem to lack completely. It seems as if a pixel shader instruction simply gets transformed into one or multiple instruction words. The only exception are texture operations which can be combined into one instruction word.

Pure scalar operations leave the biggest part of processing power unused. We don't want to raise too high hopes here, since we don't have any really reliable information on how flexible te VLIW aspect of the CineFX pipeline really is. Maybe the suggestion of an nVidia employee that there are still many ways to improve shader performance without using lower precision aimed at exactly this point. With that thought we'll finish the driver topic and have an outlook on the things that are further away.


   CineFX and NV40

nVidia already anounced that NV40 would have a totally new architecture, but we don't believe everything IHVs tell the public. Therefore, our last topic is: "Has CineFX a future in NV4x?"

To answer this question it is important to know which targets nVidia intends to hit with NV40. Despite the rather limited availability of information regarding this topic, it is highly probable that they want to reach PS3.0 compliance. CineFX is essentially lacking the ability to conditionally influence the execution flow of a program.

The patent mentions the possibility of single quads leaving the pipeline at any time and being replaced by others. In this context it is also mentioned that for different quads, different programs or different parts of the same program can be active simultaneously. What seems to be missing is a facility to change the program counter from within the pipeline.

This would enable dynamic branching, which is the basis for the PS3.0 main features: loops, conditionals and procedure calls. Because under these circumstances, it cannot be guaranteed that that all pixels of a quad take the same path through program code, nVidia will probably drop the SIMD approach and use an independent pipeline for each pixel.

The technical feasibility is already there, but the marketing term "CineFX" is unlikely to be used in combination with NV40. But the underlying technology has a good chance to make it to the next round.


We are now at the end of our trip through the world of CineFX and what remains is a list of people we would like to thank: You, for reading this article up to the end. Of course all those that contributed to this article in one way or another. They're too many to be listed separately. nVidia, for developing CineFX so we could write an article about it. ATi for the R300 design that forced nVidia to present CineFX II. And, last but not least: Xmas for this translation.






Kommentare, Meinungen, Kritiken können ins Forum geschrieben werden - Registrierung ist nicht notwendig Zurück / Back 3DCenter-Artikel - Index Home

Shortcuts
nach oben