# NV40 Technology explained

Part 1: Secrets of the Pixelshading Unit

*August 19, 2004 / by aths / Page 3 of 7*

** More details about Shader Unit 2**

It's still unclear if Unit 2 can use any input as Unit 1 can. We think it can, but this maybe blocks some functionality of Unit 1, too. Such subtle dependencies might be found out with intense tests; however, future generations will come with other qualities. Let's have a look on some rough characteristics:

*Shader Unit 2*

The crossbar can arrange up to four values for up to five channels, leaving at least one channel free. Four single multiply channels and four add channels are available, plus another unit able to perform specific additional mathematical functions. The ADD units are cascaded and got extra "wiring" which is only adumbrated in the figure. This provides more flexibility and a single-cycle dot product.

Unit 2 can handle up to two independent instructions per clock either. As usual, any calculation unit can also just forward the data. If no SFU is used, the MAD logic can perform up to two instructions from this list: MUL, ADD, MAD, DP, or any other instruction based on these operations.

In any clock, a maximum of four register components (R, G, B, A) can be produced, an SFU and a MUL4 is not possible in the same clock though. Please, consider our "wires" as data paths, not as a single channel: The MUL and ADD units both have two inputs; we show only a "data path".

The branching unit to modify the instruction pointer is between Unit 1 and Unit 2, and a helper stage for scaling (and biasing, we assume) follows Unit 2. Additionally, beside both shader units there is an extra unit in parallel, that can do one NRM_PP per clock (with a latency of two cycles, we think). Whether this unit can be used in a given clock cycle depends on the input source (temp, color, texture, ...) because the count of inputs is limited for each type.

Special functions of which we know that NV40 provides dedicated hardware for:

Name |
Unit used |
Cycles |
Cycle Notes |

RCP | #1 | 1 | 1/x |

RSQ* | #1 | 2 | 1/sqrt(x). In these two clocks, an RCP for the same channel is free, so SQRT is 2 cycles, too. |

LOG | #2 | 1 | log2(x) |

EXP | #2 | 1 | 2x |

SINCOS | #2 | 2 (3) | sine (x) and cosine (x), one component (resp. two components) |

* It looks as if RSQ blocks unit 2, too. Probably other SFUs are also blocking the additional units. Maybe such issues are solved with newer driver versions. |

Special functions executed by macros:

Name |
Unit used |
Cycles |
Cycle Notes |

LRP | #2 | 2 | Linear interpolation (LRP is a macro for w*a + (1-w) *b) |

POW | #2 | 3 | xy (POW is a macro for exp(y* log(x)) |

NRM | #2, #1 | 2 | x*1/|x| for a vector. Each component is multiplied with the reciprocal of the square root of the vector's dot product with itself. |

NRM_PP | - | 1 | The result is in partial precision (FP16). |

In the next part, we will have a look on how to optimize for this pipeline, and what hardware was the archetype for NV40.