CineFX (NV30) Inside
August 31, 2003 / by Demirug / page 1 of 7
This translation has been made possible by our community (mainly Xmas) who wanted me to provide this article's information to more than just the german-speaking audience. We'd like to apologize for our language skills.
Note: Should the first three pages of the article be too technical and too complicated for your liking, still be sure to have a look at page 4 (bottom) and following. You will find an easy to understand performance analysis of the CineFX architecture and a comparison to ATi's R3x0 part.
Ever since the release of NV30 with it's new CineFX pixel engine people have been wondering about its internal structure. The developer of the chip, nVidia, has been reluctant to answer questions about internal details in the past, seeking refuge in well-sounding phrases carrying only small bits of information.
Even the architecture specific OpenGL API Extensions nVidia presented for programming the CineFX engine could not shed light on those details. Therefore many people tried to lift the veil of mystery, equipped with coarse information and several theoretical benchmarks taken from the real object. The author of this article has been amongst those people, too. But with regard to these efforts, one source of information has been almost completely ignored. It is easily understandable that noone looked there because in the past virtually no information on brand new chips has been found there.
Patent offices! In fact, the world patent with the registration number WO02103638 was published December 27, 2002, carrying the title "PROGRAMMABLE PIXEL SHADING ARCHITECTURE". Officially, of course, this patent doesn't relate to NV30 or CineFX. But that is not uncommon because nVidia never linked their patent texts to certain chips or marketing names in the past. Still, since there are enough indicators that this patent covers CineFX and NV30, we will regard it as such.
In the following pages we're going to analyze this patent. We're especially interested in answering the well-known questions about NV30:
Why does NV30 perform so poorly when executing 1.4 or 2.0 pixel shaders which have been developed on ATi hardware?
How can NV30 take advantage of shaders optimized for it?
Where does the chip still have hidden performance potential that could be revealed by the driver?
Since the patent doesn't give an exact description of all technical details and since we cannot be 100% sure that NV30 design meticulously resembles the patent, the conclusions shown here are not guaranteed to be absolutely correct.
Let's start our hunt for answers with an overview on the complete CineFX pipeline which can be found as figure 4 in the patent script. For clarity reasons we don't use the original drawings from the patent. Sadly, the accessible scans are of poor quality and partly hard to read.
A look at this image shows that the complexity of the architecture generally has been underestimated a bit. There are far more units and loopbacks than expected by known speculations. Looking very closely, one can see another surprise: The rasterizer doesn't deliver pixels but so-called "quads" to the pipeline.
Such a quad consists of four pixels which form a 2*2 grid, according to nVidia. What we see here in this image is the whole pipeline like it exists in the chip, exactly one time. In the patent those four pixels per quad are only given as an example. Which comes at no surprise, given that a different number of pixels per quad doesn't change the basic architecture.
As can be seen the whole pipeline consists of several functional units and FIFO (First In First Out) buffers. The FIFO buffers are there to make sure necessary data is at the right spot, at the right time. The size of such a buffer can be of great relevance to the performance. A (too) small buffer can lead to pipeline stalls, big buffers waste transistors that could be put to good use in other parts of the chip. A balanced solution has to be determined.
Now let's have a glance at the functional units of the pixel pipeline before we start a detailed analysis of the functionality.
Triangle Unit: This part calculates values that are constant across a triangle. Essentially these are the factors that the pipeline uses to interpolate colors, texture coordinates, etc. later on.
Shader Core: A complex FPU that has a special relevance when it comes to producing texture coordinates.
Texture: This unit performs all tasks related to sampling of textures. This includes calculating the correct positions inside a texture as well as merging and filtering of single texel values.
Shader back end: A mostly hard wired unit that is responsible for interpolation of vertex colors and several preparation tasks for the following stage.
Combiners: This unit has been included in nVidia chips for a long time. Though it has undergone several modifications from since then. With regard to its implementation in the CineFX pipeline, nVidia has been remarkably silent. We'll present the solution we consider most likely further down in this article.