GPU Performance Comparison for Large Language Models
2 min read
The rapid development of Large Language Models (LLMs) poses new challenges in the field of computing. A crucial question for me is how GPUs perform when running these models. In this post, I aim to examine the performance of various GPUs through the concepts of TFLOPS (trillion floating-point operations per second) and TOPS (trillion operations per second). I will present the capabilities of individual models using a clear table, supplemented with brief explanations.

Source: Own work
TOPS (Tera Operations Per Second) and FLOPS (Floating Point Operations Per Second) are two important metrics for characterizing GPU performance, but they relate to different types of computational operations, especially when running and training LLMs (Large Language Models).
TOPS (Tera Operations Per Second)
- TOPS generally measures the performance of integer operations (INT8, INT16, INT32, etc.).
- It is typically used for AI accelerators (e.g., Tensor Cores, NPUs, TPUs) because LLM inference (output generation, prediction) often employs fixed-point operations, which are more efficient than floating-point calculations.
- For inference, INT8 or INT4 operations are used because they reduce computational and memory requirements without significantly degrading model performance. Therefore, the advertised performance of AI accelerators is often specified in TOPS.
- Example: A GPU might have a performance of 200 TOPS for INT8 operations, meaning it can perform 200 trillion integer operations per second.
FLOPS (Floating Point Operations Per Second)
- FLOPS measures the execution speed of floating-point operations (FP16, FP32, FP64).
- It is crucial for LLM training because large models require FP16 or FP32 precision for accurate weight and gradient calculations.
- Example: A modern GPU might have 20 TFLOPS (TeraFLOPS) FP32 performance, meaning it can perform 20 trillion floating-point operations per second.
- For very large models (e.g., GPT-4 or Gemini), FP16 (half-precision floating-point numbers) and bfloat16 (BF16) operations are also used because they are faster while still being sufficiently accurate for training.
GPU | Tensor/AI Cores |
FP32 (TFLOPS) | FP16 (TFLOPS) | BF16 (TFLOPS) | INT8 (TOPS) | VRAM (GB) | Mem. Bandwidth (GB/s) | Power Consumption (W) |
---|---|---|---|---|---|---|---|---|
NVIDIA H200 SXM | 528 | 67 | 1,979 | 1,979 | 3,958 | 141 (HBM3e) | 4,800 | 600-700 |
NVIDIA H100 SXM | 576 | 67 | 1,979 | 1,979 | 3,958 | 80 (HBM3) |
3,350 | 350-700 |
NVIDIA H100 PCIe | 576 | 51 | 1,513 | 1,513 | 3,026 | 80 (HBM3) |
2,000 | 350-700 |
NVIDIA A100 PCIe | 432 | 19.5 | 312 | 312 | 624 | 80 (HBM2e) |
1,935 | 250-400 |
RTX 6000 ADA | 568 | 91.1 | 48 (GDDR6 ECC) |
960 | 300 | |||
NVIDIA L40s | 568 | 91.6 | 48 (GDDR6 ECC) |
864 | 350 | |||
RTX A6000 | 336 | 38.7 | 48 (GDDR6) |
768 | 250 | |||
NVIDIA RTX 5090 | 680 | 104.8 | 450 | 900 | 32 (GDDR7x) |
1,790 | 575 | |
NVIDIA RTX 4090 | 512 | 82.6 | 330 | 660 | 24 (GDDR6x) |
1,008 | 450 | |
NVIDIA RTX 3090 | 328 | 40 | 285 | 24 | 936 | 350 | ||
NVIDIA RTX 2080 | 544 | 14.2 | 108 | 11 | 616 | 260 | ||
AMD MI300X | 61 | 654? | 1,307 | 2,615 | 192 (HBM3) | 5,200 | 750 |