CineFX (NV30) Inside

August 31, 2003 / by Demirug / page 2 of 7

Shader Core

The Shader Core is a programmable CPU responsible for several different tasks. It consists of a multitude of stages.

CineFX Shader Core

1) Selection of the data source

The Shader Core can acquire data from three different sources:

The Shader Core itself. Certain complex operations cannot be executed in one pass. In such cases a quad doesn't leave the Shader Core after passing through it but immediately starts another pass. The table below shows a list of instructions and the number of passes required to execute them.
The end of the pipeline. If a quad needs more than one pass through the whole pipeline, it will start out again in the Shader Core.
The rasterizer is the data source of choice if neither of the above cases occurs.

2) Selection of interpolants

In this stage six floating point scalar values are selected for interpolation. The chip has to choose these six interpolants out of the individual scalars (s, t, r, q) of the eight possible texture coordinate sets, the pixel depth (Z value), and 1/W, which is necessary for perspective correction. Alternatively, it can grab scalars from a FIFO buffer that provides storage for four complete coordinate sets as well as the respective pixel depths and 1/W values. Selecting those coordinates that are to be stored in the FIFO buffer for later access is another task of this stage.

3) Interpolation

It's now time to interpolate the selected values. This also means that the quad is split into single pixels. All further steps in the pipeline are performed on four pixels in parallel.

4) First multiplexer/crossbar

Besides these six interpolated values, another 12 values (as three registers with four components each) from a previous pass can be used as input. A multiplexer controls the flow of information from this pool of data to the next stage.

5) First arithmetic unit

This stage can compute reciprocals and reciprocal square roots. The architecture covered in this patent is capable of two rcp and one rsq for each of four pixels.

6) Second multiplexer/crossbar

The results of the previous stage are sorted to form the correct input for the next stage.

7a) Multiplication

This stage can perform four scalar multiplications for each of the four pixels.

8a) Third multiplexer/crossbar

Again the results of the previous step are ordered in the way required for the next stage.

9a) Addition

A configurable addition unit allows three different operations:

four additions of two scalars
two additions of three scalars
one addition of four scalars

7b/8b/9b) Complex calculations

In parallel to multiplication/addition, the architecture is capable of performing a log or exp operation using a lookup table.

10) Last multiplexer/crossbar

The values computed in the aforementioned stages can now be stored into two texture coordinate sets with three components each, and one 4-component register.