Inside ATI's R420
May 4, 2004 / by aths / Page 1 of 4
In short, R420 (Radeon X800) is a speed-upgrade of R360 (Radeon 9800), while the R360 was a speed-upgrade of R300 (Radeon 9700). For R360, ATI increased the clock rates (and also included an overdrive-option) but R420 got 50% more vertex and 100% more pixelshader power per cycle, plus it is, once again, clocked higher than its predecessor. Also, R420 comes with some nice technical improvements. We will reveal all of them in a moment.
To understand the workings of R420 we first need to have another go at R300. This chip was ATI's third shot to reach for the top. R100 ATI's original Radeon was very advanced, feature-wise, compared to any other product the time. For example, while Nvidia's NSR (Shading Rasterizer) used to be state of the art in the year 2000, ATI's "pixel tapestry" engine was way more flexible and powerful. R100 already delivered dependent reads and thus very handy bumpmapping-techniques such as EMBM, while GeForce2 only provided Dot3-BM (which R100 handled, too.)
The second shot was R200 (Radeon 8500). This card easily beat the mighty GeForce3 again, if we look at the features. This Radeon had more precision, combiners far more flexible, but also delivered impressive performance. The only drawbacks were antialiasing (slow due to using supersampling, and using an inefficient grid due to a bug in the Smoothvision subsystem), and anisotropic filtering (still limited to bilinear, only covering angles in the vicinity of 90°, still some issues regarding LOD calculations which lead to texture shimmering under some circumstances.)
Almost all disadvantages where removed with R300. This chip offers 24 bit floating point calculations throughout the pixelpipe, while it's competitor still handled 9 bit fixed point only at the time. (Both GeForce3/4 Ti and Radeon 9700 perform certain pixel-calculations in FP32.) Also, Radeon 9700 was the first card with a 256-bit wide DDR memory interface plus an effienct crossbar controller. Beside raw power, Radeon 9700 was also the first card to feature DirectX9 acceleration, provided improved anisotropic texture quality compared to Radeon 8500 (reduced angle dependencies, trilinear AF), and superb antialiasing which is without match even today. With this chip and ATI's product refreshes based on R360/RV360, the Canadians easily became number one among of 3D enthusiasts focussing on games, world-wide.
In the past, ATI delivered not only the best feature set but also the performance to actually use it. Another big GPU developer tends to deliver features first, performance later. While this still holds true for the latter even today, ATI changed their objectives somewhat. R300's impact was only in part self-made: ATI also profited from Nvidia's weaknesses at the time. Undoubtedly, R300 is an excellent chip, but its creation was only possible by accepting compromises. No IHV can conjure up a magic GPU; every engineer has to decide what has to be sacrificed to strengthen other features.
ATI's compromise was precision. As this issue is quite unnoticeable in almost any current game and application, we can consider it a "theoretical issue". Anyway, ATI's basic technology used in R420 is about two years old. That's a long time in this graphics business. We will discuss this topic later and look at the new features first. Then we'll have a closer look at the pipeline.
For obvious reasons we are quite fond of the name, but also like the feature itself :) It is a technique for compressing normal maps. Normal maps contain the information for bumpmapping and are commonly stored in textures. Because normal maps can be quite large, it is a good idea to compress them. Previously used texture compression algorithms work with two reference colors and interpolate them. Applied to normal vectors, this leads to blocky artefacts because the interpolated vectors will be shorter compared to uncompressed maps.
3Dc in some ways works similar to the S3TC. The texture is split up into tiles of 4x4 texels. For every such tile, two reference colors (or normals in this case) are saved. Other colors (or, again, normals) are interpolated. Some compression formats are using 4 values in total (2 reference + 2 interpolated), others 8 values (2 reference + 6 interpolated.) The latter method needs more memory, since 3 bits instead of 2 are needed, but also results in higher quality. 3Dc is using 3 bits for every saved texel (or, again, normal. For the compression algorithm it's just "data".)
Normal maps don't necessarily have to store all 3 normal values for X, Y and Z axis. Usually, any normal has a normalized length of 1.0. This value implied, we can discard a single axis-value (because we keep three informations about the normal vector). Both restoring of the Z value (this is not stored in 3Dc) and the re-normalization can be done in the pixelshader. Naturally, this costs some performance.
3Dc needs to be supported by the developers. The chip can only decompress such textures, the content has to be provided already compressed. ATI offers such tools free of charge. So, it is no big deal to implement it even for existing games: The application / engine has to determinate hardware support for 3Dc, then utilize a special FourCC texture format, and somebody has to compress all existing normal maps. Due to the lower memory consumption, this comes in handy for cards with "only" 128 MB. In addition, lower storage space means less bandwidth requirements, which can also result in a nice performance boost.
We are still trying to figure out the exact advantages of 3Dc over DXT5 with using renorm via the pixeshaders, apart from the obvious memory savings. It looks a bit strange to us, how ATI claims a compression ratio of 4:1. A tile with 4 x 4 = 16 values compressed with 3Dc takes 16 bytes. Since a common normal map has not necessarily to store alpha, these 16 values are using only 48 bytes . The actual ratio is 3:1 then. The trick with reconstructing Z is possible independent by 3Dc, so the effective compression is 2:1 only. DirectX already expose two-channel texture formats usable for storing normal maps.
The term "temporal" may not be correct, strictly speaking, but ATI uses it anyway as a marketing term for R420's alternating grid antialiasing technology. Every odd frame number gets a given subpixel mask, while for any even frame number another mask is applied. Due to the afterglow inherit to every monitor, these two masks effectively combine, providing a better "virtual subpixel mask", resulting in higher edge equivalent resolution (EER).
Honestly, we doubt this is a good idea. Let's have a closer look at "Temporal" Antialiasing: Using it at 2x AA is not advisable, because the differences between the frames will be too large. You would have to use a monitor at very high refresh rates and play a game at gigantic frame rates to make it work right. With the power you spend on making the games run this fast, it would be a better idea to use 4x (or 6x) AA instead of 2x "Temporal" AA to begin with. 4x with alternating mask could deliver the same quality as 8x sparsed, theoretically.
As far as we know, R420's internal AA grid is still 12x12, so it is not possible to fully take advantage of the technology to approach such an efficient "virtual sparsed" grid. 6x could be used to offer a virtual 12x mode, but 6x still eats serious amounts of performance. If the frame rate drops, the driver has to disable the temporal feature to avoid a noticeable "jitter" to the edges. On the other hand, if there is (more than) enough performance left for 6x, why not use the excessive power for additional effects or running the game at higher resolution?
In short, this technology depends on afterglow, while actually this is an undesired "feature" and display manufacturers are constantly trying to minimize afterglow. While TFT displays have long "afterglow", T-AA is not a good choice for TFT screens due to their non-linear changes of colors during the glow phase. Also, independent of display type, any polygon in movement gains no additional quality, because due to its "temporal" nature, this AA technology can be only raise the quality of edges standing still. Also, to make sure that every pixel receives its alternating mask, you have to enable VSync, which often results in a big performance impact, at least if you can't enable Triple Buffering.
Alas, this chapter deals with technology only. Maybe some people like Temporal AA, it's their own decision. Technologically, though, we think it's fair to consider this "improved" AA as a gimmick with no future. But as we've been known to make false judgements at times only time will tell. We want to add, though, that "Temporal" AA is purely a driver trick, using "programmable" (changeable) multisampling subpixel positions, and should work with any R300-based VGA, too. (In fact, ATI announced this feature will be delivered in future drivers for R300-based cards, too.)