Inside ATI's R420
May 4, 2004 / by aths / Page 3 of 4
The pipeline
Our pipe picture is appropriate for R300, R360 and R420. Because it is difficult to compare two different architectures (NV vs/ ATI), we had to simplify the images somewhat. More details are exposed in the tables, even more in the written text. Let's have a look at the schemata now:
Left hand R420's, right hand NV40's pipe, each somehow simplified.
At a first glance, NV40's pipe looks much "bigger" (and therefore, better) than R420's, but the image does not tell the whole story. Let's compare the details:
R300/R350/R360/R420 | NV40 | |
Texture-Ops per Pipe | 1 | 1 |
Separate Tex-Unit | Yes | No |
Split | 4:0, 3:1 | 4:0, 3:1, 2:2 |
All-purpose-FPU? | Yes | No |
Number of FPUs | 1 | 2 |
R420 has it's separated texture logic, while any texture operation blocks at least a part of a shader unit in the NV40 pipe. NV40 very successfully counters with an additional FPU. (Or shader unit, we use both terms synonymously.)
Because NV40 can also "bundle" two vector2 instructions (instead of Vec3 + Scalar only) it looks like the NV40 pipe should be more efficient. But we have to keep in mind that every shader unit is limited to it's own, specific set of operations. (As an exception, a multiply can be done in both units.) R420, on the other hand, operates as a general-purpose shader unit. It is of course easier to rearrange the shader code (the mathematical expression must not be changed!) if any FPU can do anything. R420's shader unit does in fact not "split," it consists of two independent units, a vector3 and a scalar FPU, which can be combined for vector4. This raises chances of maximizing the workload compared to an splittable FPU which is limited to some instructions. We will detail this topic in an upcoming article.
So the Radeon pipe is very efficient, while Nvidia comes with a more brute-force solution. GeForce FX and Series 6, on the other hand, offers faster calculations of some special functions compared to ATI's part. Actually, two restricted (not general-purpose) shader-FPUs like those found in NV40 easily outweigh a more flexible, but "single" FPU. For an average "real world" shader with long arithmetic parts, NV40 should be faster – compared clock-to-clock. ATI has to counter with higher clock speeds, and in fact they do. If we have the chance, we will try to run very long arithmetic shaders to compare the actual calculation speed of both architectures. It looks like the shader compiler for NV40 is still not fully optimized, so to make a fair comparison we still have to wait for a while.
Lets look at some more details:
R300/R350/R360/R420 | NV40 | |
TMUs per Pipe | 1 | 1 |
Dedicated scale-logic for the TMU | No | Yes |
Mini-FPUs for arithmetics | Yes | Yes (one for each shader unit) |
The NV40 pipe is clearly built to do as much as possible per clock (albeit some units cannot be loaded), while the R420 pipe (and of course the R300 pipe) was designed to avoid as much "idle-clocks" as possible. Given these facts, you'll have to make up your own mind what you prefer.