NVIDIA explained why the GeForce RTX 30-series accelerators so jumped in performanceNVIDIA explained why the GeForce RTX 30-series accelerators so jumped in performance

NVIDIA explained why the GeForce RTX 30-series accelerators so jumped in performance

NVIDIA unveiled the next generation of Ampere gaming graphics cards on September 1, but the initial presentation was barely technical. Now, a few days later, the company has released documentation that clarifies where the impressive performance advantage that sets the GeForce RTX 30-series graphics card apart from its predecessors comes from.

Many immediately noticed that the official specifications of the GeForce RTX 3090, GeForce RTX 3080, and GeForce RTX 3070 on the NVIDIA website indicated an overwhelming number of CUDA processors.

As it turned out, the doubling of the FP32 performance of Ampere gaming processors compared to Turing is actually taking place, and it is associated with a change in the architecture of the basic building blocks of GPUs – stream processors (SM).

While the SM in the Turing generation GPUs had one computational path for floating-point operations, in Ampere, each stream processor received two paths, which in total can perform up to 128 FMA operations per clock compared to 64 for Turing. At the same time, half of the available Ampere executive devices are capable of executing both integer (INT) operations and 32-bit floating-point operations (FP32), while the other half of the devices are designed exclusively for FP32 operations. This approach is used to save the transistor budget, assuming that the game load generates significantly more FP32- than INT-operations. However, in Turing, there were no combined actuators at all.

At the same time, in order to provide the strengthened stream processors with the required amount of data, NVIDIA increased the L1 cache in SM by a third (from 96 to 128 KB), and also doubled its throughput.

Another major improvement in Ampere is that CUDA, RT, and Tensor Cores can now run completely in parallel. This allows the graphics engine, for example, to use DLSS to scale one frame, and at the same time calculate the next frame on the CUDA and RT cores, reducing downtime of functional nodes and increasing overall performance.

To this, it should be added that the second generation RT cores, which are implemented in Ampere, can calculate ray intersections of triangles twice as fast as they did in Turing. And the new third-generation tensor kernels have doubled the mathematical performance when working with sparse matrices.

Doubling the speed of calculating triangle intersections in Ampere should significantly affect the performance of GeForce RTX 30-series accelerators in games with support for ray tracing. According to NVIDIA, it was this characteristic that was the bottleneck in the Turing architecture, while the performance of the calculation of the intersections of the rays of the bounding parallelepipeds was not satisfactory. Now the balance of performance in tracing is optimized, and moreover, in Ampere, both types of operations with rays (with triangles and parallelepipeds) can be performed in parallel.

In addition, new functionality has been added for RT cores in Ampere to interpolate the position of triangles. This can be used to blur objects in motion when not all triangles in the scene are in a constant position.

To illustrate all of the above, NVIDIA showed a head-to-head comparison of how the Turing and Ampere GPUs workload in ray tracing in Wolfenstein Youngblood at 4K resolution. As follows from the presented illustration, Ampere noticeably wins in frame building speed both due to faster mathematical FP32 calculations, and thanks to second-generation RT cores, as well as parallel operation of heterogeneous GPU resources.

In addition, for practical support of the above, NVIDIA presented additional test results for the GeForce RTX 3090, GeForce RTX 3080, and GeForce RTX 3070. According to them, the GeForce RTX 3070 is about 60% ahead of the GeForce RTX 2070 at a resolution of 1440p, and this picture is observed as in games with RTX support, and with traditional rasterization, in particular, in Borderlands 3.

The performance of the GeForce RTX 3080 is twice as good as the GeForce RTX 2080 at 4K resolution. However, in this case, in Borderlands 3 without RTX support, the advantage of the new map is not two-fold, but about 80 percent.

And the older card, GeForce RTX 3090, in NVIDIA’s own tests shows about a half-fold advantage over Titan RTX.

Full reviews of the reference design GeForce RTX 3080 are expected to be published on September 14, according to technical journalists. Three days later, on September 17, it will be allowed to publish the test data of the production GeForce RTX 3080 models from the company’s partners. Thus, the appearance on the Web of the results of independent tests of representatives of the GeForce RTX 30 series is very little to wait.