GPU Performance Comparison for Large Language Models

Gábor Bíró January 11, 2025
2 min read

The rapid development of Large Language Models (LLMs) poses new challenges in the field of computing. A crucial question for me is how GPUs perform when running these models. In this post, I aim to examine the performance of various GPUs through the concepts of TFLOPS (trillion floating-point operations per second) and TOPS (trillion operations per second). I will present the capabilities of individual models using a clear table, supplemented with brief explanations.

GPU Performance Comparison for Large Language Models
Source: Own work

TOPS (Tera Operations Per Second) and FLOPS (Floating Point Operations Per Second) are two important metrics for characterizing GPU performance, but they relate to different types of computational operations, especially when running and training LLMs (Large Language Models).

TOPS (Tera Operations Per Second)

  • TOPS generally measures the performance of integer operations (INT8, INT16, INT32, etc.).
  • It is typically used for AI accelerators (e.g., Tensor Cores, NPUs, TPUs) because LLM inference (output generation, prediction) often employs fixed-point operations, which are more efficient than floating-point calculations.
  • For inference, INT8 or INT4 operations are used because they reduce computational and memory requirements without significantly degrading model performance. Therefore, the advertised performance of AI accelerators is often specified in TOPS.
  • Example: A GPU might have a performance of 200 TOPS for INT8 operations, meaning it can perform 200 trillion integer operations per second.

FLOPS (Floating Point Operations Per Second)

  • FLOPS measures the execution speed of floating-point operations (FP16, FP32, FP64).
  • It is crucial for LLM training because large models require FP16 or FP32 precision for accurate weight and gradient calculations.
  • Example: A modern GPU might have 20 TFLOPS (TeraFLOPS) FP32 performance, meaning it can perform 20 trillion floating-point operations per second.
  • For very large models (e.g., GPT-4 or Gemini), FP16 (half-precision floating-point numbers) and bfloat16 (BF16) operations are also used because they are faster while still being sufficiently accurate for training.

GPU Tensor/AI
Cores
FP32 (TFLOPS) FP16 (TFLOPS) BF16 (TFLOPS) INT8 (TOPS) VRAM (GB) Mem. Bandwidth (GB/s) Power Consumption (W)
NVIDIA H200 SXM 528 67 1,979 1,979 3,958 141 (HBM3e) 4,800 600-700
NVIDIA H100 SXM 576 67 1,979 1,979 3,958 80
(HBM3)
3,350 350-700
NVIDIA H100 PCIe 576 51 1,513 1,513 3,026 80
(HBM3)
2,000 350-700
NVIDIA A100 PCIe 432 19.5 312 312 624 80
(HBM2e)
1,935 250-400
RTX 6000 ADA 568 91.1 48
(GDDR6 ECC)
960 300
NVIDIA L40s 568 91.6 48
(GDDR6 ECC)
864 350
RTX A6000 336 38.7 48
(GDDR6)
768 250
NVIDIA RTX 5090 680 104.8 450 900 32
(GDDR7x)
1,790 575
NVIDIA RTX 4090 512 82.6 330 660 24
(GDDR6x)
1,008 450
NVIDIA RTX 3090 328 40 285 24 936 350
NVIDIA RTX 2080 544 14.2 108 11 616 260
AMD MI300X 61 654? 1,307 2,615 192 (HBM3) 5,200 750
Gábor Bíró January 11, 2025