The Most Common Numerical Formats for LLMs
Efficient operation of Large Language Models (LLMs) heavily relies on the appropriate storage and processing of their weights. The chosen numerical format directly impacts the model's memory requirements, computational speed, and accuracy. Over the years, FP32 has gradually been supplemented or replaced by FP16 and BF16 for training, while INT8 and even lower-bit quantized formats are increasingly common for optimizing inference.

Artificial intelligence, especially deep learning, involves a vast amount of computation. The numerical formats used in these computations (i.e., how numbers are stored and manipulated by the computer) directly influence:
- Speed: Lower-precision formats (fewer bits) enable faster computations.
- Memory Footprint: Fewer bits require less memory, which is crucial for loading and running large models.
- Power Consumption: Processing fewer bits generally requires less energy.
- Accuracy: Higher-precision formats (more bits) yield more accurate results, but often at the cost of speed, memory, and power.
The goal is to find the optimal balance between accuracy and efficiency. For LLMs, the most common numerical formats for storing weights are half-precision floating-point (FP16) and bfloat16 (BF16). For inference optimization through quantization, INT8 and even lower-bit formats are increasingly used.
Key Numerical Formats in AI:
- FP32 (32-bit Floating-Point)
- Historically, this was the default format in deep learning.
- Offers high precision but demands significant memory and computational resources.
- Rarely used directly for storing weights in modern large LLMs.
- FP16 (16-bit Floating-Point / Half-Precision)
- Introduced by Nvidia with the Pascal architecture (2016).
- Supported by Nvidia Tensor Cores and other GPUs, enabling faster computations.
- Reduces memory usage and computational requirements compared to FP32. The reduction in precision is often acceptable for training and running deep learning models.
- Precise enough for many large models, but in some cases (e.g., very small gradients), the loss of precision can cause issues like underflow.
- BF16 (bfloat16 / Brain Floating Point)
- Also 16 bits like FP16, but with a different internal structure: more bits for the exponent and fewer for the mantissa (fractional part).
- This gives BF16 a dynamic range (the difference between the largest and smallest representable numbers) closer to FP32 than FP16.
- This mitigates underflow issues and allows BF16 to be used effectively during training with less precision loss compared to FP16 in certain scenarios.
- Nvidia introduced BF16 support with the Ampere architecture (2020).
- FP8 (8-bit Floating-Point)
- Debuted in Nvidia's Hopper (H100) and Blackwell (B100, B200, GB200) architectures.
- Comes in two variants: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits).
- E4M3 provides higher precision, while E5M2 offers a wider dynamic range.
- Significantly accelerates both training and inference, often with acceptable precision loss compared to FP16.
- INT8 (8-bit Integer, for Quantized Models)
- Highly efficient in terms of memory and computation, potentially up to 4x faster than FP16 operations.
- Incurs some precision loss, but can be effectively managed through careful quantization techniques.
- Common on edge devices and mobile platforms, and supported by dedicated AI accelerators (e.g., Nvidia TensorRT, Qualcomm AI Engine).
- Less commonly used during training, as the precision loss can be more problematic for gradient calculations.
- It's common practice to quantize models originally trained in FP32 or FP16, converting weights and activations to INT8 for inference.
- INT4 / INT2 (Quantized, Low-Bit Formats)
- Increasingly common in recent models and optimizations (e.g., for Llama models, GPT-4).
- Drastically reduces memory usage and speeds up inference.
- Primarily used for inference; generally not suitable for training.
The Importance of Hardware Support for Numerical Formats
Hardware support for numerical formats is critical for GPUs (and other AI accelerators) because it fundamentally determines computational efficiency.
- Optimized Execution Units: When a GPU supports a format in hardware (e.g., FP16, BF16, FP8), it means dedicated circuits (execution units, like multiply-accumulate units) on the chip are specifically designed for that format. These circuits perform operations directly in hardware, which is orders of magnitude faster than software emulation.
- Efficient Data Movement: Hardware support optimizes not just computation but also data movement. The GPU's memory system (registers, caches, global memory) and data buses are aligned with the supported formats. This means fewer bits need to be moved, reducing memory bandwidth requirements, latency, and power consumption.
- Maximizing Parallelism: GPUs derive their power from massive parallelism. Hardware support allows more operations to be performed simultaneously on data in the supported format. For example, if a GPU supports 16-bit operations in hardware, it might perform two 16-bit operations in parallel in the place of one 32-bit operation, potentially doubling throughput for those operations.
- Energy Efficiency: Dedicated circuits are not only faster but also more power-efficient. Fewer transistors need to switch to perform the same operation compared to less specialized hardware or software emulation, resulting in lower power consumption and heat generation.
Summary
For training, FP16 or BF16 are prevalent. For inference, many models now use INT8 or even INT4 quantization for faster execution and lower memory footprint. The evolution of numerical formats in AI is an ongoing optimization process. Lower-precision formats enable the building of faster, more efficient, and potentially cheaper AI systems, but the trade-offs between precision, dynamic range, and accuracy must be carefully considered. Newer hardware architectures (like Ampere, Ada Lovelace, Hopper, Blackwell) support an increasing number of efficient numerical formats, further accelerating AI development. In the future, we can expect the emergence of even more specialized numerical formats tailored for AI workloads.