Doom 3: Why is Radeon X800 Series behind GeForce6?
July 30, 2004 / by aths / Page 1 of 1
In order to come straight to the point, this article takes for granted that the benchmarks of id Software published on HardOCP are representative. Although there is only this one source at the moment, we do not see any reason to doubt its expressiveness. Unfortunately the benchmark results are exploited by some people in an unreflected manner for advertising purposes, and depreciated by others.
But you can look at it from whatever angle you like: The GeForce 6800 series is right at the front by approximately 20 to 25 percent. This is striking: Clock normalised it shows that the NV40 GPU roughly does 50 % more work than the R420 GPU.
For the moment, the good performance of the GeForce 6800 series does not seem to fit into the usual picture: Everywhere, where modern engines are used, the Radeon 9800 /Pro/XT display adapters performed much better than the GeForceFX 5900/5950 display adapters. The new generation (X800 vs. 6800) provides a more balanced image, the GeForce 6800 series can bring into play its strength in some games but without delimitating clearly from others. In case of other games it is final that peak performances comes from ATI. Doom 3 now is the "runaway" from this well known picture.
What is the reason?
So, the first idea would be that John Carmack has optimized especially well to GeForce-cards and has not yet exhausted the potential of the Radeons. But this is objectively false. As we will still see, John Carmack has shown a brilliant performance as far as the optimization on Radeon-cards is concerned.
The second idea would be that features which are only existent on modern GeForce-Cards, as for example Ultra Shadow (from NV35 on), are responsible for the performance. And last but not least one could assume that the NV40 shader performance has an effect. However, both of them are only side effects. Ultra Shadow discharges the pixel units, which are clocked more slowly compared to Radeon-cards. From the point of shader capabilities, Doom 3 is rather easily satisfied referring to hardware of the DirectX9-class. Even the Radeon 8500 (DirectX 8.1) may calculate one light source for one pixel in one pass, and any hardware of the DirectX9-class (with reasonable OpenGL-drivers) all the more (as the Doom 3 engine uses OpenGL, a comparison with DirectX tech levels will always be more or less a lame comparison).
Multi-pass solutions minimize quality (as in the meantime it is rounded to the lower frame buffer-accuracy) and the performance (as additional writing- and reading cycles will result for the display adapters RAM). Only GeForce1 to GeForce4 Ti will be forced to a light calculation for the multi-pass (Doom 3 calculates even shadows, a complete frame always requires more than one pass and this charges CPU and band width). The higher arithmetic pixel-shader performance with the GeForce 6800 series will have an effect especially with long shaders, but you will search for such shaders in vain in Doom 3. In the Doom 3 case, the Radeon X800 series compensates by its higher clock what the GeForce 6800 series gets by a higher per-clock performance.
We kept things deliberately a little vague, as there are no detailed analysis benchmark results, yet. Today we are just interested in the "big image", please forgive us mistakes in the details. But there is one thing which is already distinguishing clearly: The main advantage of the GeForce 6800 series in the case of Doom 3 is to apply the shader performance there where it is needed. However, this has nothing to do with the Shader Model 3.0 in the case of Doom 3.
Doom 3 renders the Z-buffer at first in order to have the Z-information already available before rendering the color. For the exact working out of shadows à la Doom 3 this is nevertheless necessary. Thanks to Early-Z with the finished Z-buffer modern cards can exclude a lot of invisible pixels from rendering and thus limit the waste of performance by (unavoidable) overdraw.
The NV40 GPU as any other GeForce-card starting from the GeForce3, always tests entire quads for visibility before the quad enters the pipeline (or not). One quad is a block of 2x2 pixels. GeForce3 to GeForceFX may drop up to 4 quads per clock (in other words 16 pixels). The GeForce 6800 series manages 16 quads per clock, though, i.e. 64 pixels. That sounds a lot, but this is what a Radeon 9700 has been doing for a long time already. Radeon-cards use a hierarchical test: At first very large tiles (of 8x8 pixels) are tested. If such a tile is not completely dropped, it will be divided into four 4x4 pixels large tiles, in the end, Radeon-cards also handles quads. The R420/423 GPUs can drop up to 256 pixels per clock. This is, as far as we know, the record.
To recapitulate: One the one hand, Radeon-cards have got a superior Early-Z-Occlusion performance. But on the other hand, the Radeon-mechanism is more vulnerable. Radeon-cards are not designed for Doom 3. In Z-tests according to the Doom 3 method Hiercharchical Z does not always work, then pixels can only be dropped on the quad-level. Performance continues to decrease with 4x-Antialiasing while the GeForce FX/6 may drop pixels irrespective of the AA-mode with same efficiency.
In a word: In every game both cards do some "invisible" work. During the rendering process, some pixels are overdrawn. As the change of textures takes a lot of time and a pre-sorting of the objects is very CPU-loaded, Doom 3 (as any other game) renders with an overdraw factor of more than 2. As a whole, more than double pixels are worked out than you can see in the end, as quite a lot is overdrawn. As the Z-buffer is already ready in Doom 3 before you start the color-pass, most of the invisible pixels can be dropped, however, when Early-Z-Occlusion is supported. The one who is able to sort out invisible pixels as quickly as possible will be the winner. The rather lower Early-Z performance of the GeForce 6800 series is faced with a "stable" functioning. Even in case of a rather unusual Z-test à la Doom 3 the performance is maintained.
Quick throwing away is decisive
Let's look into it a bit closer: In the beginning Doom 3 produces the Z-buffer, whereas the Z-test is considered to be successful, if the value to be compared is smaller than the value already in the Z-buffer. With the first pass, working out the shading, the test is "successful" if the new value is higher than the old one. But HyperZ does only function when the mode to test Z is not altered within the frame. Anyhow, each Radeon quad pipeline can further reject a quad when the test for the whole quad shows that nothing has to be rendered of the quad. The 16 pipelines of the R420/423 GPUs are four quadpipes, so that a maximum of 4 quads, so to say 16 pixels, may be rejected.
But a GeForce-card from GeForce3 on gets this done, as well, although GeForce-cards up to GeForceFX 5950 Ultra included have got one single quad pipeline. Early-Z is carried out in front of the pipelines in case of GeForce GPUs. The GeForce 6800 series now is able to drop 16 quads per clock, this is to say 64 pixels. Hence, in the "non working out" the NV40 GPU under Doom 3 is four times quicker than a R420/R423 GPUs. As far as the Z-test method is not altered exactly during the rendering process, the R420/R423 GPUs may use its HyperZ to sort out up to four 8x8 tiles (so to say up to 256 pixels) per clock.
A Radeon X800 Pro with only three quad pipelines (so to say 12 pixel pipelines) can only drop 12 pixels per clock into the Doom 3 calculating pass, where HierZ does not work. In addition, it causes no end of trouble for the card that the R420/R423 GPUs has been designed that way that single quad pipelines may be deactivated, but the memory controller seems to be optimized for four quads per clock (this has to be considered as theory at the moment, only exact tests, with optimized drivers at the best, may clarify this question).
From 4x Antialiasing on the Radeon-cards suffer from the fact that Z-Cull and Stencil-Cull are bound to the ROP logics. In detail this is to say the culling (cutting) performance drops. Stencil-Cull enables to save the lighting calculation, as far as the quad is in the shadow. At least from the NV40 GPU on, but probably even from the NV35 GPU on, this is done in front of the pixel pipelines in case of GeForce-cards. So it is independent on the Antialiasing-mode.
It is different with the GeForce 6800 with 12 pipelines: As the ROP units are not bound to the pixel pipelines with the NV40 design, this card could further be operated with 16 ROPs - what is actually done. In spite of activated 4x Antialiasing the GeForce 6800 is still able to test up to 8 pixels per clock on Z/Stencil (the Radeon X800 Pro makes a maximum of 6 pixels with 12 pipelines).
NV40: Better ROP utilization ratio in Doom 3
The above mentioned is true for the shadow-pass. It goes without saying that Hierarchical Z works during the lighting calculation. Here the Radeon hardware even drops more than the GeForce hardware. But in case of the color-pass it is most important not to allow the invisible quads into the pipelines. And it does not matter whether dropping is quick or slow, as the working out of colors requires a lot of clocks. Let's suppose (this number is a wild guess), it would take 16 clocks to determine a color. Then, a quick or an up to four times slow dropping makes e.g. 17 to 20 clocks at an average. The relative difference is small. Don't forget that the NV40 GPU, thanks to the more flexible pipeline and mightier hardware-instructions is able to work out more per-clock than the R420/R423 GPUs (so that a color-pass per quad consequently needs less clocks).
In the Z- and Stencil tests during the shadow-pass the situation is like this: These tests are very quick, but you have to drop quite a lot. Those who get tired when sorting out invisible pixels, will block the entire rendering process. Or let's take it the other way round, only those who will cut the invisible thanks to special optimizations, may completely exploit his Z/Stencil units.
By the way, Nvidia has come up with some news, in order to make as much Z/Stencil power as possible out of a limited transistor quantity. NV30 and NV35 have got only one quad pipeline, they may render up to four pixels per-clock. In order to have not completely collapsed the filling rate with activated Antialiasing, these chips have got two ROPs per pixel pipeline. With the help of the 8 ROPs these chips may, as far as no Antialiasing is active, carry out 8 Z/Stencil-tests (i.e. 2 quads per-clock).
But this is what a Radeon 9700 or 9800 can manage, as well, as it has got 8 pipelines. Although the Radeon hardware disposes of 2 ROPs per pipe, the second unit may only be used with Antialiasing. Theoretically, there would be an advantage with Antialiasing as far as the Z/Stencil power is concerned. Practically, this cannot be exploited in Doom 3 due to the low occlusion performance of the Radeon. In the case of the color-pass, working out of the pixel takes such a long time that the latency where the ROPs require more clocks, can be hidden completely.
The NV40 GPU has got 16 ROPs divided into groups of 4 which can be arranged to the quad pipelines, if required. Even better: In case of NV40 a ROP may wave the alpha blending operation and carry out another Z/Stencil. This makes 32 Z/Stencil-tests per clock. If Antialiasing is active, a R420/R423 GPU might offer this, as well (16 pipelines à 2 ROPs), but there will be many idle clocks in the Doom 3 shadow-pass during the Z/Stencil tests: Before new (visible) values get into the pipeline, the ROPs have run empty long before. But the NV40 GPU, on the other hand, drops the uninteresting quads as quickly that the raster operations will be utilized much better.
Conclusion
Finally, we note: The theoretical very high culling performance of the Radeon-cards will be limited by several facts.
The Hierarchical Z-buffer starts with very large tiles. It is improbable especially in "marginal regions" that entire 8x8-tiles are dropped. In practice orientated situations, the actual occlusion performance is clearly to be settled below the theoretical maximum.
Rendering proceedings which have been unusual up to now and which require to alter the Z-test during working out, prevent the use of Hierarchical Z. But the remaining culling performance of the Radeon-cards is quite low. Doom 3 uses some passes where HierZ is switched off then.
The culling performance which is low anyway drops with 4x ob 6x Antialiasing with deactivated HierZ to 50 ob 33 percent.
All this is burdened on the performance of the Radeon-cards. Doom 3 uses "only" bumpmapping, but this excessively. Each quad which can be excluded from working out, delivers a precious performance. The quicker quads are dropped the earlier pipelines will be able to dedicate to the working out of visible pixels. Contrary, more recent GeForce-cards are designed in such a way that can manage this even under Doom 3 conditions without decrease in performance. The older GeForce4 Ti may use Early-Z even with activated Antialiasing (contrary to the even older GeForce3). That is why we assume that you can still dive into the Doom 3 world with a GeForce4 Ti and have still a good graphics quality.
The other new "Doom 3 Features" from Nvidia (from GeForceFX on, but especially from GeForceFX 5900 aka NV35) first of all are responsible that the lower clock is compensated. If the retail version allows for the disconnection of the use of certain features, we are going to see in detailed tests what indeed can be reached by certain hardware techniques. Don't forget the influence of the CPU which is extremely challenged with shadows à la Doom 3. Who wants to make his system fit for Doom 3, will be possibly well advised to upgrade the CPU.
In view of the technique of the Radeon-chips which really do not match with Doom 3, it seems to be rather amazing to us, what John Carmack and id Software were able to get out as a performance. As far as we can estimate, you will have a lot of Doom 3 fun even with a Radeon 9700 if the rest of the system fits.
But it seems at the moment, apart from a high-end system, a GeForce of the 6800 series seems to be a must for Doom 3 freaks who are eager for top performances. According to the current knowledge we dare to doubt that ATI will be able to close the current performance gap with new drivers. The NV40 GPU even offers another Doom 3 potential which has not been used until now. And there is still the legendary optimization work for some top titles which are especially looked after by Nvidia, for example Quake III.
We proceed from the assumption that there will not be any changes in the lead of Nvidia under Doom 3. Principally this should even rub off on other games with the Doom 3 engine, but there may result, at least partially, completely different results.
We would like to thank Demirug for providing us with technical detail information, especially about the occlusion mechanisms and Xmas for their technical advise.
Important update from September 20, 2004:
Newest insights invalidate parts of the theory in this article. We stated, that R420 is clearly slower in discarding pixels during the Z/Stencil-Pass. As a result, the whole rendering process would be slowed down. This doesn't seem to apply as strongly as we stated. The NV40 might have a lower peak-performance when discarding pixels, but possesses finer granularity. In the end there might very well be a small advantage for the NV40. There's also some situations where R420 can't use its HierarchicalZ mechanism.
Still the differences in performance per clock can't be fully explained with just that. Nvidia seems to have used optimizations in their drivers, similar to those which already helped to win Q3A benches for quite a long time now. The driver does intercept and replace CPU demanding calls with less demanding ones. Special optimizations probably favor both, the Q3A and Doom 3 engine. In addition, Nvidia could have shown off quite some creativity in terms of replacing shaders. The company seems to have found a solution that allows for replacement of shaders without affecting image quality.
This currently leads to the following two conclusions: First, performance in Doom 3 is not as much a result of NV40 architecture as we have previously stated. Second, ATI might therefore be able to speed up Radeon based graphic cards as well. It's possible that Nvidia is able to further increase their optimization efforts though. As a result, general gaming performance can't be attained by the means of few games representing all other games anymore. Increased performance in Doom 3 for either ATI or Nvidia doesn't have to be the results of increased utilization of the respective architecture. It could be, but then again, it could just be optimizations inside the driver.
As for a general statement towards replacement of shaders: One can't demand shaders to be executed "as they are" nowadays. In general, built-in shader optimizations, which don't change the final visual results are always valid. It gets more complicated in terms of hand optimized techniques. Still, as long as there's no counterproof, we assume that Nvidia is only making use of things to improve performance, that have been overseen during the programming phase of the game. Anyway, hand optimizations for any benchmark is something to take very seriously, because Doom 3 is one of the default benchmarks for almost any review out there.