Published on July 5, 2025

Running Scaled Transformer Models with 8-bit Precision Using Hugging Face and bitsandbytes

Scaling large language models often involves trade-offs. As these models grow in size and capabilities, they require more memory, computing power, and energy. For many developers and researchers, training or running these models can be both technically and financially demanding. An effective strategy to reduce these costs, without sacrificing performance, is the use of 8-bit matrix multiplication.

This technique is not entirely new, but its application in modern transformer models—particularly with tools like Hugging Face Transformers, Accelerate, and bitsandbytes—makes it more practical and efficient than ever. In this article, we’ll explore how 8-bit precision aids in scaling transformers, how these tools integrate seamlessly, and their practical applications.

Understanding 8-bit Matrix Multiplication

Matrix multiplication forms the backbone of transformer models, occurring frequently during training and inference within attention and feedforward layers. Traditionally, these multiplications utilize 16-bit or 32-bit floating-point numbers, offering precision but demanding extensive memory and computational resources. 8-bit matrix multiplication uses 8-bit integers instead, significantly reducing memory usage and accelerating operations by utilizing smaller data sizes and optimizing hardware features.

One might expect this reduction in precision to negatively impact model performance. However, the technique involves maintaining higher precision for inputs and outputs (e.g., 16-bit or 32-bit), while performing the internal matrix multiplication—the most resource-intensive operation—in 8-bit. This approach, a form of quantization, has been refined to minimize accuracy loss.

The bitsandbytes library, developed by Tim Dettmers, provides a highly efficient implementation of 8-bit optimizers and matrix multiplication routines. This library is widely adopted, especially for large model training and inference. Hugging Face’s ecosystem, with its Transformers and Accelerate libraries, integrates well with bitsandbytes, allowing developers to leverage these optimizations with minimal code changes.

Benefits of Using bitsandbytes with Hugging Face Transformers and Accelerate

With large models like LLaMA, Falcon, or GPT variants, memory often becomes a bottleneck when loading them on a single GPU or across multiple GPUs. Using bitsandbytes, you can load a model in 8-bit mode, allowing it to fit into a much smaller memory space. This is particularly beneficial during inference and training when using 8-bit optimizers.

Hugging Face Transformers provides model classes and pre-trained weights widely used in research and industry. Typically, models are loaded in 32-bit or 16-bit precision by default. However, with a few configuration adjustments, you can load them in 8-bit using bitsandbytes, potentially cutting memory usage by more than half.

The Accelerate library from Hugging Face further simplifies distributed training, mixed precision setups, and device placement, automating much of the boilerplate required to get models onto the appropriate devices. It seamlessly integrates with bitsandbytes, forming a compact and efficient stack for running large-scale transformer models on consumer-grade hardware.

How 8-bit Matrix Multiplication Works Under the Hood

To understand how this works in practice, let’s delve into the core idea. Standard floating-point matrix multiplication involves two matrices with 16-bit or 32-bit numbers. These are multiplied and summed, requiring high memory bandwidth and computational power. Quantization converts these values to 8-bit integers before multiplication. The multiplication is performed in integer space, and results are then dequantized—converted back to floating-point—for further processing.

Quantization involves scaling and offset values to preserve the range and distribution of the original data. In bitsandbytes, quantization is dynamically applied per layer, with scales learned during training if necessary, ensuring the model maintains accuracy despite reduced precision.

Different modes of 8-bit quantization exist: symmetric, asymmetric, static, and dynamic. Bitsandbytes focuses on techniques that optimize transformers, especially scenarios where matrix weights can be pre-quantized and stored in 8-bit form. During inference, this means loading pre-quantized weights and performing matrix multiplication directly in 8-bit. For training, 8-bit optimizers track gradients and weight updates in compressed form.

A notable feature of bitsandbytes is its ability to selectively quantize parts of the model. For instance, you might load attention layers in 8-bit while maintaining full precision for the output layer. This flexibility allows control over the balance between performance and resource savings.

Applying It in Code: Loading a Model in 8-bit

To load a popular transformer model, such as LLaMA or BLOOM, using Hugging Face in 8-bit, follow these steps. First, ensure you have the required libraries:

pip install transformers accelerate bitsandbytes

Then, you can use the following pattern in your code:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "bigscience/bloom-1b1"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("What's the capital of France?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Here, load_in_8bit=True enables bitsandbytes integration, loading the model in 8-bit precision, which drastically reduces memory usage. device_map="auto" utilizes Accelerate to distribute the model across available devices, whether one or multiple GPUs.

This example demonstrates the ease of adopting 8-bit inference without extensive code modifications or dealing with low-level details. It enables running billion-parameter models on a single GPU with as little as 16 GB of memory.

For training with 8-bit optimizers, use similar patterns with Trainer or custom training loops, referencing the 8-bit optimizers provided by bitsandbytes. The training process remains familiar but with enhanced efficiency and reduced memory overhead.

Conclusion

8-bit matrix multiplication, combined with bitsandbytes, Hugging Face Transformers, and Accelerate, simplifies the efficient operation of large transformer models. It significantly reduces memory and computational demands while maintaining performance. This approach empowers developers to utilize powerful models on more modest hardware. If you’re working with transformers, exploring 8-bit precision could help you scale more effectively while keeping costs and hardware requirements low—it’s a practical and surprisingly effective strategy.

IMPACT
A New Chapter for fastai: Integration with Hugging Face Hub

How the fastai library is now integrated with the Hugging Face Hub, making it easier to share, access, and reuse machine learning models across different tasks and communities
TECHNOLOGIES
Transformer Agent by Hugging Face: Redefining AI Workflows

Discover how Hugging Face's Transformer Agent combines models and tools to handle real tasks like file processing, image analysis, and coding.
BASICTHEORY
JFrog Integrates with Hugging Face and Nvidia; Introduces JFrog ML

JFrog launches JFrog ML, a revolutionary MLOps platform that integrates Hugging Face and Nvidia, unifying AI development with DevSecOps practices to secure and scale machine learning delivery.
BASICTHEORY
A Clear Guide for Accessing Falcon 3 LLM for Research and Apps

Discover how to download and use Falcon 3 with simple steps, tools, and setup tips for developers and researchers.
BASICTHEORY
Explore AI for Free: 5 Online Playgrounds Anyone Can Use Right Now

Try these 5 free AI playgrounds online to explore language, image, and audio tools with no cost or coding needed.
TECHNOLOGIES
3 Ways to Generate Hyper-Realistic Faces Using Stable Diffusion

Using ControlNet, fine-tuning models, and inpainting techniques helps to create hyper-realistic faces with Stable Diffusion
BASICTHEORY
JFrog integrates with Hugging Face, Nvidia; intros JFrog ML

JFrog launches JFrog ML through the combination of Hugging Face and Nvidia, creating a revolutionary MLOps platform for unifying AI development with DevSecOps practices to secure and scale machine learning delivery.
TECHNOLOGIES
Unleashing AI Potential: How Hugging Face and PyCharm Collaborate in AI Projects

Exploring the power of integrating Hugging Face and PyCharm in model training, dataset management, and debugging for machine learning projects with transformers.
BASICTHEORY
The Power Behind AI: Understanding Transformers and Attention Mechanisms

Discover how transformers and attention mechanisms power today's AI advancements. Learn how self-attention and transformer architecture are shaping large language models.
BASICTHEORY
CNNs vs. Transformers: Which AI Model Works Best for Your Task

How do Transformers and Convolutional Neural Networks differ in deep learning? This guide breaks down their architecture, advantages, and ideal use cases to help you understand their role in AI
APPLICATIONS
New deep learning techniques take center stage

Discover how cutting-edge deep learning techniques advance AI with improved training accuracy, efficiency, and real-world impact
BASICTHEORY
The Power Behind AI: Understanding Transformers and Attention Mechanisms

Understand how transformers and attention mechanisms power today’s AI. Learn how self-attention and transformer architecture are shaping large language models.

Latest Articles

TECHNOLOGIES
How AI Helps Volvo Tackle Safety Challenges in a New Way

Discover how AI helps Volvo tackle safety by predicting risks, personalizing protection, and improving Volvo car safety for drivers around the world.
TECHNOLOGIES
Empowering Small Businesses: Ericsson's Innovations at MWC 2025

Ericsson highlights small business technology at Mobile World Congress 2025, showcasing practical 5G, cloud, and IoT solutions designed to help small enterprises thrive with affordable, easy-to-use tools.
IMPACT
Securing the Future: Deepfakes, Crypto-Agility, and the Role of Hybrid Strategies

How cybersecurity in 2025 is being reshaped by hybrid strategies, deepfake detection, and crypto-agility to meet the challenges of smarter, faster digital threats.
IMPACT
How Agentic AI is Transforming Cybersecurity and Shaping Policy in the UK

Discover how agentic AI is driving sophisticated cyberattacks and how the UK's AI Opportunities Action Plan is shaping industry reactions to these risks and opportunities.
IMPACT
Business Leaders Share How AI is Shaping Work at AI Summit New York

Discover how AI is transforming business at the AI Summit New York, with insights into opportunities, challenges, and the future for companies adopting AI.
TECHNOLOGIES
Humanoid AI Robots Revolutionize Service Roles at CES 2025

Humanoid AI robots stole the spotlight at CES 2025, showcasing full-service abilities in hospitality, healthcare, retail, and home settings with lifelike interaction and readiness for real-world use.
TECHNOLOGIES
ChatGPT Gov: OpenAI's New AI Assistant for Government Agencies

OpenAI introduces ChatGPT Gov, a secure AI platform designed to meet the strict requirements of US government agencies, enhancing public service efficiency while protecting sensitive data.
TECHNOLOGIES
DeepSeek Challenger: OpenAI’s New Approach to Smarter AI

Discover how the DeepSeek Challenger Model by OpenAI is transforming AI with enhanced intelligence, transparency, and reliability across various sectors.
TECHNOLOGIES
Top 7 Ways Emerging Technologies Transform Super Bowl LIX Experience

How emerging technologies are transforming Super Bowl LIX, from smarter strategies and enhanced safety to immersive fan experiences, both in the stadium and at home.
TECHNOLOGIES
Super Bowl Security: The Role of AI, Facial Recognition, and No-Drone Zones

Discover how AI, facial recognition, and no-drone zones enhanced security at the Super Bowl, illustrating the future of event safety technology.
TECHNOLOGIES
New AI Deal Brings Safer Self-Driving Cars Closer to Reality

A leading automaker has partnered with an AI company to bring smarter, safer driving to the roads. Learn how this deal for AI tech for self-driving cars is shaping the future of transportation.
TECHNOLOGIES
How AI and Quantum Computing Drive Sustainable Battery Upcycling

How AI and quantum computing are transforming sustainable battery upcycling, making material recovery cleaner, smarter, and more efficient for a circular battery economy.