The Most Common Numerical Formats for LLMs

Gábor Bíró • January 22, 2025

4 min read

Efficient operation of Large Language Models (LLMs) heavily relies on the appropriate storage and processing of their weights. The chosen numerical format directly impacts the model's memory requirements, computational speed, and accuracy. Over the years, FP32 has gradually been supplemented or replaced by FP16 and BF16 for training, while INT8 and even lower-bit quantized formats are increasingly common for optimizing inference.

The Most Common Numerical Formats for LLMs

Source: Own work

Artificial intelligence, especially deep learning, involves a vast amount of computation. The numerical formats used in these computations (i.e., how numbers are stored and manipulated by the computer) directly influence:

Speed: Lower-precision formats (fewer bits) enable faster computations.
Memory Footprint: Fewer bits require less memory, which is crucial for loading and running large models.
Power Consumption: Processing fewer bits generally requires less energy.
Accuracy: Higher-precision formats (more bits) yield more accurate results, but often at the cost of speed, memory, and power.

The goal is to find the optimal balance between accuracy and efficiency. For LLMs, the most common numerical formats for storing weights are half-precision floating-point (FP16) and bfloat16 (BF16). For inference optimization through quantization, INT8 and even lower-bit formats are increasingly used.

Key Numerical Formats in AI:

FP32 (32-bit Floating-Point)
- Historically, this was the default format in deep learning.
- Offers high precision but demands significant memory and computational resources.
- Rarely used directly for storing weights in modern large LLMs.
FP16 (16-bit Floating-Point / Half-Precision)
- Introduced by Nvidia with the Pascal architecture (2016).
- Supported by Nvidia Tensor Cores and other GPUs, enabling faster computations.
- Reduces memory usage and computational requirements compared to FP32. The reduction in precision is often acceptable for training and running deep learning models.
- Precise enough for many large models, but in some cases (e.g., very small gradients), the loss of precision can cause issues like underflow.
BF16 (bfloat16 / Brain Floating Point)
- Also 16 bits like FP16, but with a different internal structure: more bits for the exponent and fewer for the mantissa (fractional part).
- This gives BF16 a dynamic range (the difference between the largest and smallest representable numbers) closer to FP32 than FP16.
- This mitigates underflow issues and allows BF16 to be used effectively during training with less precision loss compared to FP16 in certain scenarios.
- Nvidia introduced BF16 support with the Ampere architecture (2020).
FP8 (8-bit Floating-Point)
- Debuted in Nvidia's Hopper (H100) and Blackwell (B100, B200, GB200) architectures.
- Comes in two variants: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits).
- E4M3 provides higher precision, while E5M2 offers a wider dynamic range.
- Significantly accelerates both training and inference, often with acceptable precision loss compared to FP16.
INT8 (8-bit Integer, for Quantized Models)
- Highly efficient in terms of memory and computation, potentially up to 4x faster than FP16 operations.
- Incurs some precision loss, but can be effectively managed through careful quantization techniques.
- Common on edge devices and mobile platforms, and supported by dedicated AI accelerators (e.g., Nvidia TensorRT, Qualcomm AI Engine).
- Less commonly used during training, as the precision loss can be more problematic for gradient calculations.
- It's common practice to quantize models originally trained in FP32 or FP16, converting weights and activations to INT8 for inference.
INT4 / INT2 (Quantized, Low-Bit Formats)
- Increasingly common in recent models and optimizations (e.g., for Llama models, GPT-4).
- Drastically reduces memory usage and speeds up inference.
- Primarily used for inference; generally not suitable for training.

The Importance of Hardware Support for Numerical Formats

Hardware support for numerical formats is critical for GPUs (and other AI accelerators) because it fundamentally determines computational efficiency.

Optimized Execution Units: When a GPU supports a format in hardware (e.g., FP16, BF16, FP8), it means dedicated circuits (execution units, like multiply-accumulate units) on the chip are specifically designed for that format. These circuits perform operations directly in hardware, which is orders of magnitude faster than software emulation.
Efficient Data Movement: Hardware support optimizes not just computation but also data movement. The GPU's memory system (registers, caches, global memory) and data buses are aligned with the supported formats. This means fewer bits need to be moved, reducing memory bandwidth requirements, latency, and power consumption.
Maximizing Parallelism: GPUs derive their power from massive parallelism. Hardware support allows more operations to be performed simultaneously on data in the supported format. For example, if a GPU supports 16-bit operations in hardware, it might perform two 16-bit operations in parallel in the place of one 32-bit operation, potentially doubling throughput for those operations.
Energy Efficiency: Dedicated circuits are not only faster but also more power-efficient. Fewer transistors need to switch to perform the same operation compared to less specialized hardware or software emulation, resulting in lower power consumption and heat generation.

Summary

For training, FP16 or BF16 are prevalent. For inference, many models now use INT8 or even INT4 quantization for faster execution and lower memory footprint. The evolution of numerical formats in AI is an ongoing optimization process. Lower-precision formats enable the building of faster, more efficient, and potentially cheaper AI systems, but the trade-offs between precision, dynamic range, and accuracy must be carefully considered. Newer hardware architectures (like Ampere, Ada Lovelace, Hopper, Blackwell) support an increasing number of efficient numerical formats, further accelerating AI development. In the future, we can expect the emergence of even more specialized numerical formats tailored for AI workloads.

Recommended

OpenAI Partners with Stack Overflow

May 7, 2024 • 4 min read

OpenAI and Stack Overflow have announced a partnership aimed at enhancing AI model capabilities by incorporating the community's vast technical knowledge. This collaboration grants OpenAI access to the Stack Overflow API, providing a reliable database for AI development and helping to improve model performance, particularly for programming and technical queries.

The Energy Storage

May 13, 2025 • 6 min read

One of the greatest paradoxes of the 21st century is that while humanity has access to virtually infinite energy sources in the form of sun and wind, one of its most pressing challenges is ensuring the security of its energy supply.

Table Tennis Playing Robot

August 12, 2024 • 2 min read

Even a table tennis match is no longer a challenge for Google DeepMind's new robot! AI is proving its ability to handle complex tasks requiring rapid decisions in more and more fields.

AI in the Aisles: Kroger's Dynamic Pricing and Its Implications

August 14, 2024 • 3 min read

Kroger's latest AI-powered dynamic pricing system has sparked mixed reactions, particularly due to concerns surrounding data privacy and inequality. How does this impact customer trust, and what ethical questions does the new technology raise?

Do We Get Better Answers Querying Models in English?

December 30, 2024 • 7 min read

When using Large Language Models (LLMs) like GPT-4o or Claude Sonnet, a common question arises, particularly for the vast number of users worldwide who interact with these tools in languages other than English: which language should one use to achieve the most effective results? While the multilingual capabilities of these models allow for effective communication in numerous languages, their performance often seems diminished compared to interactions conducted purely in English. This exploration delves into why that might be the case and when switching to English could be beneficial.

Bioluminescent Petunia: The Glowing Flower

February 15, 2024 • 2 min read

Known as the "firefly petunia," this glowing petunia is a genetically modified plant that continuously emits a green light, thanks to genes derived from a luminous mushroom.

Google Researchers Simulate Digital Primordial Soup

July 27, 2024 • 3 min read

Researchers at Google have simulated the emergence of self-replicating digital life forms in an experiment that could offer insights into how biological life began on Earth. According to New Scientist, the study involved creating a virtual "primordial soup" where random data interacted over millions of generations, leading to the spontaneous formation of self-replicating programs.