CineFX (NV30) Inside

August 31, 2003 / by Demirug / page 4 of 7

Shader Backend

The backend unit performs the task of synchronizing the different data paths so all information belonging to one quad can pass the unit again as a unit.

CineFX Shader Backend

After synchronization a few more calculations take place (interpolation of a vertex color, fog, pixel depth) and the texture samples are converted to one of the internal formats (fp32, fp16, fx12), if necessary. Supplemented with values from the register file, this data is passed to the combiners.

Combiners

As mentioned above, the patent doesn't contain any specific information here. Instead, it says that any known technology can be used for the combiners. The only necessity is that the results from the combiners can either be written to the register file or sent on to the ROPs (raster operators). If the pixels need another pass through the pipeline, the shader core gets notified that there is a quad waiting at the end of the pipeline.

Gatekeeper and Loopback Control

Though the gatekeeper is at the start of the pipeline, it is described only now because it's function is best understood when you know the rest of the pipeline. The decisions of the gatekeeper have a direct influence on the choice of data source in the first stage of the shader core. In case a shader is as simple as not needing a loopback over the whole pipeline, it will just sit idle and pass on any quads that come from the rasterizer.

However, if loopbacks are necessary, the gatekeeper decides when a quad coming from the rasterizer may enter the pipeline. At first it will let as many quads into the pipeline as possible. From then on it will give clearance to those quads waiting at the end of the pipeline. Then it checks whether there is still space for additioinal quads in the pipeline. If yes, it will pass on quads from the rasterizer until the pipe is filled completely.

After executing the whole shader program, quads leave the pipeline towards the ROPs, leaving space for new quads. We were talking about a filled pipeline, but how many quads exactly fit into it? nVidia chose a flexible approach here. The maximum number of quads present in the pipeline at any one time is the quotient of the size of the reqister file and the amount of memory required for one quad. The memory required per quad depends on the number and format of the temporary registers used in the shader.

Performance Considerations

Everything used to be so much simpler in the past. You just looked how many pipelines a chip had and multiplied that figure with the number of TMUs per pipeline and the chip's clock. That amounts to the texel fillrate, a good performance indicator. Starting with the previous generation of chips (NV2x, R(V)2x0), this rule of thumb began to lose importance. Since the pixel shader units of those chips have been used more as an improved multitexturing unit than anything else, texel fillrate is still very important for those.

But in a modern shader pipeline, other performance figures take the leading role. GPUs are getting close to CPUs regarding programmability, so it makes sense to apply the same performance metric for both: instructions per second, measured in MIPS. A shader pipeline executes instructions just like a CPU. Besides floating point and integer calculations, there is a third category of instructions, texture operations.

The number of pixels/s that can be rendered using a certain shader is the quotient of available processing power and the performance required per pixel. NV30 can only use the Shader Core when processing pixel shaders 2.0. The Combiners don't have the required precision for this task.

Because the Shader Core operates on quads consisting of four pixels, we get four micro-instructions per clock. The term Micro-instructions is used because the Shader Core isn't able to finish any operation in exactly one cycle. On the one hand, there are instructions that need more than one pass through the core, and on the other hand two texture operations can be performed in parallel. So we get either 8 texture instructions or 4 arithmetic operations. At this point it is important to note that a texture instruction isn't identical with sampling and filtering from a texture. The instruction only creates the command passed to a TMU.

If we compare NV30 to its competitor from ATi on a per-clock basis, the NV30 receives a thorough beating from R300. ATi comes ahead with 8 texture ops and 8 arithmetic ops in parallel. The higher core clock improves the situation a bit for NV30, but it still only does 2600 MTO/s + 700 MAO/s or 2000 MTO/s + 1000 MAO/s.

The Shader Core allows to exchange texture ops for arithmetic ops in a 2:1 relation (see graph below). The R300 as R9700Pro reaches 2600 MTO/s + 2600 MAO/s without this balancing option. nVidia can only dare to dream of this raw performance, or try to overclock an NV30 to 1 GHz. Not only the award for best performance per clock, but also the one for best overall raw performance goes to R300 from ATi.