AMD RDNA 2 beats NVIDIA Ampere: 34% lower memory latency

AMD RDNA 2 beats NVIDIA Ampere: 34% lower memory latency

The competition between the AMD RDNA 2 and the NVIDIA Ampere is very close and both sides have been hit hard. However, on the latency in accessing the cache memory, AMD RDNA 2 beats the rival quite clearly

These are the two graphic architectures that led many joys because of the performance and value for money compared to the previous generation, accompanied by so many pains due to supply problems. We are obviously talking about NVIDIA Ampere and AMD RDNA 2. Both new technologies and innovations have been introduced on both sides. On the one hand, NVIDIA has brought the second generation of accelerators to the ray tracing e i new Tensors cores that enable the use of DLSS 2.0. AMD, on the other hand, has introduced first generation ray tracing – even lower than that of NVIDIA -, but also SAM e l’Infinity Cache. This is a battle to the last fps, but today in the battle over cache memory latency between AMD RDNA 2 and NVIDIA Ampere, the former triumphs.

AMD RDNA 2 beats NVIDIA Ampere: 34% lower memory latency

Cache Latency: AMD RDNA 2 beats NVIDIA Ampere

The cache is a key part for the operation of high performance components. In fact, the biggest bottleneck inside modern PCs are RAM memories that have an access time much higher than the cycle time of a processor. To avoid making you wait too long, data with a high probability of being read in the near future are placed in very fast memories called caches. Unfortunately these memories cannot reach large capacities because they are very expensive to make and consume large areas of silicon. That’s why generally the size varies from a few tens of KB up to MB, but difficult to go further. The cache is fast enough for the CPU to run at very high frequencies. However, the small size still limits the potential performance. To solve the problem it was therefore decided to enter a cache memory hierarchy, gradually slower but with higher dimensions. On modern CPUs, there are also 3 or 4 levels of cache.

AMD RDNA 2 beats NVIDIA Ampere: 34% lower memory latency

Even the latest GPUs they were forced to insert cache layers into the architecture. In fact, the video memory is not fast enough to “feed” a large amount of cores working at ever higher frequencies. If the size of the VRAM matters up to a certain point – to avoid having to load game “pieces” from the SSD -, speed is equally critical in order to continuously supply the cores with data. In fact, the top of the NVIDIA range, given the huge amount of cores they use, mount the very fast GDDR6X to maximize the bandwidth, otherwise you would not be able to make the GPU work at 100%. AMD RDNA 2 instead has brought out Infinity Cache, an additional layer of large cache that acts as a buffer between the GDDR6 VRAM and the lower level caches and that allows you to extend the average bandwidth “perceived” by the GPU.

The battle over cache latency

The memory latency performance of AMD and NVIDIA Ampere’s RDNA 2 GPU architectures has been tested by Chips and Cheese, who are already accustomed to this type of specific test. The discovery was really interesting: AMD RDNA 2 would have an advantage of more than 30% on the latency of the cache system. The source used a custom benchmark based on pointer search in OpenCL to measure cache performance and memory latency on the latest generation GPUs based on the NVIDIA Ampere and AMD RDNA 2 architectures. In the benchmarks, AMD Radeon RX 6800 XT (RDNA GPU 2) and NVIDIA GeForce RTX 3090 (Ampere GPU) were compared in a progressive head-to-head. Memory Cache and Latency Benchmark Shows AMD’s RDNA 2 Architecture Outperforms NVIDIA’s Ampere GPUs, offering lower latency despite having to go through additional cache layers. Using the Infinity cache only adds 20ns compared to the L2 cache and despite everything we are still below the latency shown by NVIDIA. What does it mean? With the same VRAM bandwidth, AMD’s GPUs will waste fewer cycles and therefore perform more calculations at the same time.

AMD RDNA 2 beats NVIDIA Ampere: 34% lower memory latency

The stated reason is that the GA102 GPU based on NVIDIA Ampere it is simply a much larger GPU. Although the cache system is simpler, the large distances that data must travel inside the chip (the propagation speed of an electrical signal is not infinite) translate into a latency of over 100 ns (from L1 to L2 ). Instead RDNA 2 has a latency of only 66ns. Note that the AMD Navi 21 GPU is much smaller and features a 4MB L2 cache, while the NVIDIA GA102 GPU has a 6MB L2 cache. The NVIDIA A100 Ampere GPU for HPC features a huge 40MB L2 cache which certainly helps reduce data access latencies:

RDNA 2 caching is fast and there is plenty of it. Compared to Ampere, latency is low across the board. Infinity Cache only adds about 20 ns on an L2 hit and has lower latency than Ampere’s L2. Surprisingly, RDNA 2’s VRAM latency is roughly the same as Ampere’s, even though RDNA 2 is controlling two more layers of cache.

In contrast, Nvidia sticks to a more conventional GPU memory subsystem with only two levels of cache and high L2 latency. Switching from L1 to L2 takes over 100 ns. RDNA’s L2 latency is around 66ns compared to L0, even with an L1 cache between them. To get around the GA102’s huge chip, it seems like it takes a lot of cycles.

This may explain AMD’s excellent performance at lower resolutions. RDNA 2’s low-latency L2 and L3 caches can offer an advantage with smaller workloads, where the occupation is too low to hide latency. NVIDIA’s Ampere chips in comparison require more parallelism to shine.

In other words: if the cache memory can always be updated with valid data (i.e. we have a high probability of cache hit), then the performance of AMD GPUs will be higher because the cache pipeline is faster. This happens when smaller chunks of data move which are better able to “enter” the caches. While if the cache pipeline fills up, the bottleneck becomes the VRAM memory bandwidth and here NVIDIA Ampere could stay a step ahead. This benchmark is very specific and obviously its effects on real gaming performance are not easy to estimate as there are so many factors at play.. It also seems that the problem is related to the size of the chip, so in the mid-low range things could change.

However, it also highlights the different strategy embraced by NVIDIA and AMD: the first focuses on high parallelization with a high number of cores and greater proximity to the huge VRAM, the second instead maintains a serial approach with a few cores that work at high frequencies and a low latency cache architecture. That’s all from the hardware section, keep following us!