Inside nVidia NV40
April 14, 2004 / by aths / page 1 of 6
Let us jump back in time: 2002 AD. It's almost summer, and nVidia's GeForce4 Ti receives highest honours from the press. While there obviously are some flaws (e.g. poor DVD-quality compared to its competitors, low anisotropic performance in conjunction with bilinear filtering), the GF4 provides best overall-performance, best performance with antialiasing enabled, very good texture-quality with anisotropic filtering, and excellent compatibility for gamers. Now, the development of NV40 starts: A completely new video card generation takes nVidia about 20 to 22 months to develop (while a refresh-chip can be done in 8 - 9 months.)
Back to the present. nVidia didn't manage to repeat the GF4's success with its successors, the FX series. While the 5800 can savely be labeled "a desaster", the follow-up chipsets also do not fully meet the requirements of the enthusiast gamer market.
ATI's R300 hit nVidia hard. The majority of our readers considers ATI the king of the hill. We cannot guess how this may change with today's launch of NV40, but we can provide benchmarks, screenshots, and a slew of in-depth details to make up your own mind. We will talk about Pixel- and Vertexshaders 3.0 (Shader Model 3.0), the new antialiasing and anisotropic filtering algorithms, also bandwidth-limitations, and lots of other stuff you may not expect. Let's start with the pipeline configuration.
26 February, Munich, nVidia German headquarters: We receive the first details briefing about NV40 from David Kirk.
16 - Does this number really matter?
In short: We have to stop simply counting "pipelines." For DirectX9 pixelshaders, an R300-pipe can filter one bilinear texture sample and additionally perform up to two arithmetic instructions per clock. Let's abbreviate this to 1T+2M. (1 texture sample, 2 math.) A NV35-pipe can filter two bilinear texture samples and perform up to three arithmetic instructions (in most cases only up to two arithmetic instructions), which corresponds to 2T+3M in a best-case scenario. (Normally 2T+2M.) As you can see, already now it really doesn't make much sense counting the number of pipelines only without considering the power of a single pipe.
Before we have a closer look at NV40's pipes, let's have a quick overview: A single NV40-pipe can sample one bilinear sample only, but also performs up to four arithmetic instructions! In our terms: 1T+4M. This is a major shift from texture sampling to arithmetic power. But it makes sense: While nigh all DirectX8-pixelshaders are fillrate-bound (due to the limitations in DX8-hardware, long arithmetics are not possible anyway), DirectX9-hardware provides both the precision and the general-purpose instruction set to allow for long shader calculations. NV40 is designed for such future games and applications.
In fact, older games are completely bandwidth-bound on NV40 (aka GeForce 6800 Ultra). To play out NV40's advantages, you have to run modern shaders. We will come to those bandwidth issues later.
Some time ago, nVidia claimed that NV30 got eight pipelines. This counting is because there is a neat trick in the raster operation units (ROP). Since NV30, a single ROP has worked as if it were two, if it was required to calculate Z passes only. Of course this not doubles the actual number of pipes. But it is a clever way to increase the ROP power where it's mostly needed: A such improved ROP is not as good as two "normal" ones, but can raise the Z-only fillrate by a factor of 2. This is quite handy for titles such as the upcoming Doom3.
Luciano Alibrandi (Europe PR, products) and David Kirk (Chief Scientist).
What is a pipe - and what is not?
A Z-only value has no color (it is not black, but without any color at all), so a "zixel" is not a pixel. That's why we consider the old 8-pipe-talk about the NV30 FUD. According to our definition, what is not able to render a pixel is not a "common pipe." nVidia's endeavors of a redefinition concerning the number of pipelines are motivated only by the fact that Radeon 9700 simply had more "common pipes."
To make things even more complicated, we have to state that a common pipe isn't a real pipeline. NV30's pixelpart e.g. consists of only one single pipeline. This single pipeline can render four pixels at the same time. But it is common usage to speak of four pipes if four pixels are rendered per clock. The thing we have to be concerned with regarding these four pixels (ordered in a 2x2-block, called "quad") is the efficiency loss on boundaries of triangles. The full quad is rendered, even if only one pixel of the quad is part of the triangle.
But, in opposition to some persistent rumours, NV40 actually does come up with full 16 "common pipes", i.e. it is a 4 "quad-pipeline" design.
To recapitulate, we need to know how the chip performs, not how many "pipes" it's made out of. We will now detail a NV40-pipe and discuss subsequently how all this quad-stuff influences the real performance (that is, what most people care about.)