zfn9
Published on July 5, 2025

Running Scaled Transformer Models with 8-bit Precision Using Hugging Face and bitsandbytes

Scaling large language models often involves trade-offs. As these models grow in size and capabilities, they require more memory, computing power, and energy. For many developers and researchers, training or running these models can be both technically and financially demanding. An effective strategy to reduce these costs, without sacrificing performance, is the use of 8-bit matrix multiplication.

This technique is not entirely new, but its application in modern transformer models—particularly with tools like Hugging Face Transformers, Accelerate, and bitsandbytes—makes it more practical and efficient than ever. In this article, we’ll explore how 8-bit precision aids in scaling transformers, how these tools integrate seamlessly, and their practical applications.

Understanding 8-bit Matrix Multiplication

Matrix multiplication forms the backbone of transformer models, occurring frequently during training and inference within attention and feedforward layers. Traditionally, these multiplications utilize 16-bit or 32-bit floating-point numbers, offering precision but demanding extensive memory and computational resources. 8-bit matrix multiplication uses 8-bit integers instead, significantly reducing memory usage and accelerating operations by utilizing smaller data sizes and optimizing hardware features.

One might expect this reduction in precision to negatively impact model performance. However, the technique involves maintaining higher precision for inputs and outputs (e.g., 16-bit or 32-bit), while performing the internal matrix multiplication—the most resource-intensive operation—in 8-bit. This approach, a form of quantization, has been refined to minimize accuracy loss.

The bitsandbytes library, developed by Tim Dettmers, provides a highly efficient implementation of 8-bit optimizers and matrix multiplication routines. This library is widely adopted, especially for large model training and inference. Hugging Face’s ecosystem, with its Transformers and Accelerate libraries, integrates well with bitsandbytes, allowing developers to leverage these optimizations with minimal code changes.

Benefits of Using bitsandbytes with Hugging Face Transformers and Accelerate

With large models like LLaMA, Falcon, or GPT variants, memory often becomes a bottleneck when loading them on a single GPU or across multiple GPUs. Using bitsandbytes, you can load a model in 8-bit mode, allowing it to fit into a much smaller memory space. This is particularly beneficial during inference and training when using 8-bit optimizers.

Hugging Face Transformers provides model classes and pre-trained weights widely used in research and industry. Typically, models are loaded in 32-bit or 16-bit precision by default. However, with a few configuration adjustments, you can load them in 8-bit using bitsandbytes, potentially cutting memory usage by more than half.

The Accelerate library from Hugging Face further simplifies distributed training, mixed precision setups, and device placement, automating much of the boilerplate required to get models onto the appropriate devices. It seamlessly integrates with bitsandbytes, forming a compact and efficient stack for running large-scale transformer models on consumer-grade hardware.

How 8-bit Matrix Multiplication Works Under the Hood

To understand how this works in practice, let’s delve into the core idea. Standard floating-point matrix multiplication involves two matrices with 16-bit or 32-bit numbers. These are multiplied and summed, requiring high memory bandwidth and computational power. Quantization converts these values to 8-bit integers before multiplication. The multiplication is performed in integer space, and results are then dequantized—converted back to floating-point—for further processing.

Quantization involves scaling and offset values to preserve the range and distribution of the original data. In bitsandbytes, quantization is dynamically applied per layer, with scales learned during training if necessary, ensuring the model maintains accuracy despite reduced precision.

Different modes of 8-bit quantization exist: symmetric, asymmetric, static, and dynamic. Bitsandbytes focuses on techniques that optimize transformers, especially scenarios where matrix weights can be pre-quantized and stored in 8-bit form. During inference, this means loading pre-quantized weights and performing matrix multiplication directly in 8-bit. For training, 8-bit optimizers track gradients and weight updates in compressed form.

A notable feature of bitsandbytes is its ability to selectively quantize parts of the model. For instance, you might load attention layers in 8-bit while maintaining full precision for the output layer. This flexibility allows control over the balance between performance and resource savings.

Applying It in Code: Loading a Model in 8-bit

To load a popular transformer model, such as LLaMA or BLOOM, using Hugging Face in 8-bit, follow these steps. First, ensure you have the required libraries:

pip install transformers accelerate bitsandbytes

Then, you can use the following pattern in your code:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "bigscience/bloom-1b1"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("What's the capital of France?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Here, load_in_8bit=True enables bitsandbytes integration, loading the model in 8-bit precision, which drastically reduces memory usage. device_map="auto" utilizes Accelerate to distribute the model across available devices, whether one or multiple GPUs.

This example demonstrates the ease of adopting 8-bit inference without extensive code modifications or dealing with low-level details. It enables running billion-parameter models on a single GPU with as little as 16 GB of memory.

For training with 8-bit optimizers, use similar patterns with Trainer or custom training loops, referencing the 8-bit optimizers provided by bitsandbytes. The training process remains familiar but with enhanced efficiency and reduced memory overhead.

Conclusion

8-bit matrix multiplication, combined with bitsandbytes, Hugging Face Transformers, and Accelerate, simplifies the efficient operation of large transformer models. It significantly reduces memory and computational demands while maintaining performance. This approach empowers developers to utilize powerful models on more modest hardware. If you’re working with transformers, exploring 8-bit precision could help you scale more effectively while keeping costs and hardware requirements low—it’s a practical and surprisingly effective strategy.