Published on July 12, 2025

Accelerate Hugging Face Training on Google TPUs with PyTorch/XLA

Working with large language models isn’t just about architecture anymore — it’s also about where and how you train them. If you’ve ever waited hours for your model to finish a single epoch or checked your cloud bill and wondered whether deep learning is only for those with deep pockets, then this will interest you. Training Hugging Face models on PyTorch/XLA TPUs changes the game for both speed and cost. Here’s how.

What Happens When Hugging Face Meets PyTorch/XLA on TPUs?

TPUs, or Tensor Processing Units, are Google’s answer to the growing demand for accelerated computing. Unlike GPUs, TPUs come with a different backend called XLA (Accelerated Linear Algebra), which speaks its own dialect of optimization. PyTorch/XLA acts as the bridge, translating PyTorch operations into something TPUs understand.

Now, bring Hugging Face into the mix. These models aren’t lightweight. BERT, T5, GPT-2 — they can balloon into billions of parameters. Traditionally, such a scale meant using pricey GPU clusters and long training times. But combine Hugging Face with PyTorch/XLA on TPUs, and you’ll notice two things: speed picks up, and the bills go down.

The Actual Speedup: What’s Really Faster?

Before the buzzwords blur the facts, let’s look at the actual performance. TPUs aren’t just faster for the sake of being faster. They’re structured differently. Think of them as high-speed conveyor belts instead of forklifts. They work best when the workload is batched and uniform, which, conveniently, is exactly what model training needs.

Key Performance Improvements

Training time drops by as much as 40–60% on comparable datasets.
Batch sizes can scale up without hitting memory constraints.
Gradient accumulation becomes smoother due to better parallelism.

Take the same BERT-base model and train it on a TPU v3–8 with PyTorch/XLA, and you’ll wrap up in less than half the time it would’ve taken on a single A100 GPU — and without the GPU’s hourly cost.

Step-by-Step: How to Set Up Hugging Face on PyTorch/XLA TPUs

Setting this up is not plug-and-play, but it’s also not something you need a PhD for. Here’s how you get from zero to TPU-powered training, one step at a time.

Step 1: Choose the Right Environment

You’ll want a TPU-enabled VM from Google Cloud Platform (GCP). The most common setup involves either TPU v2 or v3 with a Debian-based environment. When setting up the VM, make sure to select a PyTorch-xla image, not just PyTorch.

Alternatively, you can spin up a TPU notebook directly from Google Colab with TPU runtime, though it’s better suited for smaller experiments.

Step 2: Install Hugging Face and PyTorch/XLA

Your environment needs three essentials:

Hugging Face Transformers
torch_xla for TPU operations
Datasets (if you’re pulling from the datasets library)

Install them with:

pip install transformers datasets
pip install torch==1.13.1 torch_xla==1.13 -f https://storage.googleapis.com/libtorchxla-releases/wheels/tpuvm/torch_xla.html

Ensure versions align with TPU compatibility to avoid training crashes.

Step 3: Set Up the Model and Tokenizer

Pick your Hugging Face model — say bert-base-uncased — and load it as usual. The difference starts when sending the model to the device. Instead of the usual .cuda(), use .to(device), where device is xm.xla_device().

import torch_xla.core.xla_model as xm
from transformers import BertForSequenceClassification, BertTokenizer

device = xm.xla_device()
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
model.to(device)

Step 4: Wrap the Training Loop with XLA Tools

The training loop needs PyTorch/XLA utilities to sync across TPU cores and allow efficient data sharding.

Instead of DataLoader, use MpDeviceLoader. Wrap your training loop inside xm.optimizer_step(optimizer) rather than the typical optimizer.step().

from torch_xla.distributed.parallel_loader import MpDeviceLoader

train_loader = MpDeviceLoader(train_dataset, device)
for batch in train_loader:
    optimizer.zero_grad()
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    xm.optimizer_step(optimizer)

This minor restructuring unlocks all the parallelism TPU offers without needing to re-architect your model.

Step 5: Multi-Core Training (Optional but Powerful)

TPU v3-8 offers 8 cores. If you want real speed, use them all. This means wrapping your training script with xmp.spawn, which runs training in parallel across all cores.

import torch_xla.distributed.xla_multiprocessing as xmp

def train_fn(index):
    # Include steps 3 and 4 here
    pass

xmp.spawn(train_fn, nprocs=8)

Each core gets its own process, training independently while syncing gradients behind the scenes. It feels like magic but runs like clockwork.

Where the Cost Advantage Comes From

This isn’t just about time. TPUs offer significant pricing efficiency. A TPU v3-8 on GCP costs less per hour than four A100 GPUs. But because of the speed advantage and better scaling, jobs finish sooner.

So while you might pay $8 per hour for a TPU and $20 for multiple GPUs, the real difference appears when you calculate the total cost per training run. Many find themselves cutting down expenses by 30–50%, especially when training models on large datasets or experimenting with multiple configurations.

Also worth noting — many TPU trials or community notebooks are either free or low-cost, making them ideal for prototyping before committing to larger projects.

Wrapping Up

Putting Hugging Face models on PyTorch/XLA with TPUs isn’t just about speed or cost — it’s about efficiency. You get the kind of performance that used to require expensive clusters, all while writing nearly the same code as before. With just a few adjustments to your training script and the right setup, you’re working smarter, not harder. And in machine learning, that’s a rare win.

So next time you’re staring at a progress bar that hasn’t moved in hours, remember — TPUs might be what gets it done faster and cheaper. Hope you find this article worth reading. Stay tuned for more interesting yet helpful guides.

TECHNOLOGIES
Speed Up Hugging Face Training with Optimum and ONNX Runtime

Tired of slow model training on Hugging Face? Learn how Optimum and ONNX Runtime work together to cut down training time, improve stability, and speed up inference—with almost no code rewrite required.
BASICTHEORY
Get to Know StarCoder: The Code-First AI That’s Actually Useful

What if your coding assistant understood scope, style, and logic—without needing constant hand-holding? StarCoder delivers clean code, refactoring help, and real explanations for devs.
APPLICATIONS
Key Insights from Hugging Face's Comments on AI Accountability

Explore how Hugging Face defines AI accountability, advocates for transparent model and data documentation, and proposes context-driven governance in their NTIA submission.
IMPACT
How Snorkel AI and Hugging Face Empower Businesses with Foundation Models

Adapt Hugging Face's powerful models to your company's data without manual labeling or a massive ML team. Discover how Snorkel AI makes it feasible.
IMPACT
Why Hugging Face's New Chinese Blog is a Game-Changer for AI Collaboration

Curious about Hugging Face's new Chinese blog? Discover how it bridges the language gap, connects AI developers, and provides valuable resources in the local language—no more translation barriers.
IMPACT
The Impact of Gradio Joining Hugging Face on Machine Learning Interfaces

Gradio is joining Hugging Face in a move that simplifies machine learning interfaces and model sharing. Discover how this partnership makes AI tools more accessible for developers, educators, and users.
IMPACT
Hugging Face Hub Search Upgrade: What You Need to Know

Experience supercharged searching on the Hugging Face Hub with faster, smarter results. Discover how improved filters and natural language search make Hugging Face model search easier and more accurate.
IMPACT
Community, Models, and Ideas: Summer at Hugging Face

How Summer at Hugging Face brings new contributors, open-source collaboration, and creative model development to life while energizing the AI community worldwide.
APPLICATIONS
Optimize Transformer Training with Ray Tune

Struggling to nail down the right learning rate or batch size for your transformer? Discover how Ray Tune’s smart search strategies can automatically find optimal hyperparameters for your Hugging Face models.
BASICTHEORY
Explore Datasets Faster with DuckDB on Hugging Face

Looking for a faster way to explore datasets? Learn how DuckDB on Hugging Face lets you run SQL queries directly on over 50,000 datasets with no setup, saving you time and effort.
IMPACT
Fine-Tune Large Models with Hugging Face's PEFT

Think you can't fine-tune large language models without a top-tier GPU? Think again. Learn how Hugging Face's PEFT makes it possible to train billion-parameter models on modest hardware with LoRA, AdaLoRA, and prompt tuning.
IMPACT
Federated Learning with Hugging Face and Flower: A Practical Guide

Learn how to implement federated learning using Hugging Face models and the Flower framework to train NLP systems without sharing private data.

Latest Articles

IMPACT
How Substra Ensures Privacy While Enabling AI Collaboration

How can you build intelligent systems without compromising data privacy? Substra allows organizations to collaborate and train AI models without sharing sensitive data.
BASICTHEORY
Q8-Chat: Compact AI Powered by Xeon for Real-Time Performance

Curious how you can run AI efficiently without GPU-heavy models? Discover how Q8-Chat brings real-time, responsive AI performance using Xeon CPUs with minimal overhead
BASICTHEORY
Why safetensors Is the Secure Standard for AI Model Formats

Wondering if safetensors is secure? An independent audit confirms it. Discover why safetensors is the safe, fast, and reliable choice for machine learning models—without the risks of traditional formats.
TECHNOLOGIES
Tiny Robots, Big Relief: The New Frontier in Sinus Infection Treatment

Can microscopic robots really clear sinus infections from the inside out? Discover how magnetic microbots are revolutionizing sinus health by targeting infections with surgical precision.
APPLICATIONS
Train a Language Model from Scratch with Transformers and Tokenizers

Want to build your own language model from the ground up? Learn how to prepare data, train a custom tokenizer, define a Transformer architecture, and run the training loop using Transformers and Tokenizers.
BASICTHEORY
Transformers and Autoformer: Revolutionizing Time Series Forecasting

How can Transformers, originally built for language tasks, be adapted for time series forecasting? Explore how Autoformer is taking it to the next level with its unique architecture.
TECHNOLOGIES
Tour de France 2025: How Tech is Rewriting the Rules of the Road

How is technology transforming the world's most iconic cycling race? From real-time rider data to AI-driven strategies, Tour de France 2025 proves that endurance and innovation now ride side by side.
TECHNOLOGIES
Privacy-Preserving Sentiment Analysis with Homomorphic Encryption

Want to analyze sensitive text data without compromising privacy? Learn how homomorphic encryption enables sentiment analysis on encrypted inputs—no decryption needed.
TECHNOLOGIES
Hugging Face Inference Guide: APIs, Endpoints, TGI, and SageMaker

Looking to deploy machine learning models effortlessly? Dive into Hugging Face’s inference tools—from user-friendly APIs to scalable large language model solutions with TGI and SageMaker.
BASICTHEORY
How the Hugging Face Hub Empowers GLAMs to Share Cultural Data

Wondering how the Hugging Face Hub can help cultural institutions share their resources? Discover how it empowers GLAMs to make their data accessible, discoverable, and collaborative with ease.
TECHNOLOGIES
Nvidia Surpasses Apple: How AI's Backbone Became the World's Most Valuable Company

What happens when infrastructure outpaces innovation? Nvidia just overtook Apple to become the world’s most valuable company—and the reason lies deep inside the AI engines powering tomorrow.
TECHNOLOGIES
PaddlePaddle Joins Hugging Face: What It Means for Developers

Curious about PaddlePaddle's leap onto Hugging Face? Discover how this powerful deep learning framework just got easier to access, deploy, and share through the world’s biggest AI hub.