LLM Model Size, Memory Requirements, and Quantization

Gábor Bíró • November 12, 2024

3 min read

Large Language Models (LLMs), such as GPT-3, LLaMA, or PaLM, are neural networks of enormous size. Their size is typically characterized by the number of parameters (e.g., 7b, 14b, 72b, meaning 7 billion, 14 billion, 72 billion parameters). A parameter is essentially a weight or bias value within the network. These parameters are learned during training and collectively represent the model's "knowledge," determining how it processes information and generates outputs. Modern LLMs possess billions, sometimes even hundreds of billions, of parameters.

LLM Model Size, Memory Requirements, and Quantization

Source: Own work

Hundreds of billions of parameters translate into substantial memory requirements:

Storage: The model's parameters must be stored on persistent storage, like a hard drive or SSD.
Loading: To run the model (perform inference), the parameters need to be loaded into the memory of the GPU (or other accelerator).
Computation: During model execution, the GPU needs constant access to these parameters to perform calculations.

Example:

Let's assume a model has 175 billion parameters, and each parameter is stored in FP32 (32-bit floating-point) format.

One FP32 number occupies 4 bytes (32 bits / 8 bits per byte).
175 billion parameters * 4 bytes/parameter = 700 billion bytes = 700 GB.

Therefore, just storing the model parameters requires 700 GB of space! Loading and running the model requires at least this much VRAM (Video RAM) on the GPU. This is why high-end GPUs with large amounts of VRAM (like NVIDIA A100, H100) are necessary for running large-scale LLMs. If, instead of 4 bytes, each parameter occupied only 1 byte (as with the INT8 format), the memory requirement in gigabytes would roughly equal the number of parameters in billions. For example, a 175B parameter model using INT8 would require approximately 175 GB of VRAM.

Quantization: Reducing Memory Requirements

Quantization is a technique aimed at reducing the model's size and memory footprint, usually by sacrificing an acceptable amount of precision. During quantization, the model's parameters (weights and sometimes activations) are converted to a lower-precision numerical format.

How does quantization work?

Original Format: Models are typically trained using FP32 or FP16 (16-bit floating-point) formats.
Target Format: During quantization, parameters are converted to formats like INT8 (8-bit integer), FP8, or other lower-precision types.
Mapping: Quantization involves creating a mapping between the range of values in the original format (e.g., FP32) and the range of values in the target format (e.g., INT8). This mapping defines how to represent the original values using the limited range of the target format and can be linear or non-linear.
Rounding: Based on the mapping, the original values are "rounded" to the nearest representable value in the target format.
Information Loss: This rounding process inevitably leads to some loss of information, which can result in a decrease in the model's accuracy. The challenge in quantization lies in minimizing this loss of precision.

Example (INT8 Quantization):

FP32: One number occupies 4 bytes.
INT8: One number occupies 1 byte.

If we quantize a 175 billion parameter model from FP32 to INT8, the model size shrinks from 700 GB to 175 GB! This is a significant saving, making it possible to run the model on smaller, less expensive GPUs (albeit often with a slight decrease in performance).

Quantization Methods:

Post-Training Quantization (PTQ): Quantization is performed after the model has been fully trained. This is the simplest method but may lead to a greater loss in accuracy.
Quantization-Aware Training (QAT): Quantization operations are simulated or incorporated into the training process itself. The model learns to compensate for the precision loss caused by quantization. This typically yields better accuracy than PTQ but requires more time and computational resources for training.

Summary

Quantization is an essential technique for efficiently running large-scale LLMs. It allows for a significant reduction in model size and memory requirements, making these powerful models accessible to a wider range of users and hardware. However, quantization involves a trade-off with accuracy, so selecting the appropriate quantization method and numerical format for the specific task is crucial. Hardware support (e.g., efficient INT8 operations on GPUs) is key for running quantized models quickly and effectively. The evolution of numerical formats (FP32, FP16, BF16, INT8, FP8) and their hardware support is directly linked to quantization, collectively enabling the creation and deployment of increasingly large and complex LLMs.

Recommended

Grok-1 LLM Partly Goes Open Source

March 18, 2024 • 3 min read

In March 2024, xAI announced it was open-sourcing its Grok-1 large language model, aligning with Elon Musk's stated intention to make advanced AI technologies broadly accessible and challenge the closed approach of competitors like OpenAI.

Quantum Memory: The Critical Component Powering the Quantum Internet

April 29, 2024 • 4 min read

The vision of a quantum internet—a network leveraging the strange laws of quantum mechanics for revolutionary communication capabilities—hinges on the development of several key technologies. Among these, quantum memory stands out as a truly indispensable component. Essential for the practical operation of quantum networks, quantum memory provides the crucial capability to store fragile quantum information, acting as a vital interface between communication links and local processing nodes within the network.

OpenAI's Five-Level Roadmap to Artificial General Intelligence (AGI)

July 10, 2024 • 4 min read

OpenAI recently unveiled its internal five-level roadmap for achieving Artificial General Intelligence (AGI). This milestone framework outlines the company's vision for developing AI that could potentially revolutionize the field and surpass human capabilities across various domains. Bloomberg reporter Rachel Metz first brought this plan to light, detailing the stages and potential metrics OpenAI might use to track its AGI development progress.

Tesla Optimus

July 8, 2024 • 5 min read

Elon Musk and Tesla have once again entered a new field, this time the world of humanoid robots. The Tesla Optimus project aims to revolutionize robotics and create robots capable of performing numerous tasks in industry and beyond. Although opinions on the project are mixed, one thing is certain: the Optimus robots have already captured the world's attention and hold significant potential.

Cognitive Computing

September 17, 2024 • 3 min read

The world of technology constantly introduces exciting new developments that change our lives and how we work. Among the most promising and intriguing of these is cognitive computing. But what exactly is it, and why is it so important?

OpenAI Launches GPT-4o: Faster, Cheaper, and Natively Multimodal

May 14, 2024 • 2 min read

OpenAI recently unveiled its latest flagship language model, GPT-4o. The name, derived from "omni," signifies a major leap forward in artificial intelligence, as the model is natively capable of handling text, audio, and vision inputs and outputs. This inherently multimodal approach unlocks new possibilities for both developers and users, further solidifying OpenAI's position at the forefront of AI innovation.

The Limits of Our Tribal Brain in a Modern World

June 30, 2025 • 10 min read

How many friends do you really have? The number of your Facebook connections might run into the hundreds or even thousands, but with how many people do you maintain a truly deep and meaningful relationship?