Published on July 14, 2025

Image Similarity with Hugging Face: A Practical Guide Using Transformers

Ever looked at two images and thought, “These feel the same somehow,” even when they aren’t identical? That gut feeling actually has a technical side, and no, you don’t need to be a machine learning whiz to get it. Thanks to Hugging Face’s Datasets and Transformers libraries, working with image similarity is far more accessible than it sounds. Let’s break it down into something you can work with—and maybe even enjoy doing.

You’ll be surprised how much you can achieve with just a few lines of code. The hard part—model training, data wrangling, optimization—has already been done for you. What’s left is plugging the right pieces together and making them do what you need.

Why Image Similarity Matters

Image similarity isn’t just about finding identical copies. It’s about finding images that look alike to the human eye, even if the pixels don’t match one-to-one. Whether you’re building a duplicate photo cleaner, recommending similar products, or clustering visuals for a design tool, this is the stuff that helps machines “see” better.

Traditionally, image similarity relied on things like histograms or pixel comparisons. But those techniques fall short when images have different resolutions, lighting, or angles. That’s where Transformers come in—they help models compare the meaning behind an image, not just the colors.

Setting the Stage: Loading the Dataset

Before you can compare anything, you need something to compare. Hugging Face makes this step surprisingly light. The datasets library allows you to pull popular image datasets with just a few lines of code—no messy downloading or extracting.

from datasets import load_dataset

dataset = load_dataset("beans", split="train")

Yep, that’s it. This loads a small dataset of bean plant images, which is perfect if you’re just testing things out. Need something bigger? There are hundreds of image datasets available, and you can swap one line to pick another. It really is that flexible.

However, the images are not in PIL.Image format by default. So if you’re using any model from transformers, you’ll want to convert them first.

from PIL import Image
import numpy as np

def convert(example):
    example["image"] = Image.fromarray(np.array(example["image"]))
    return example

dataset = dataset.map(convert)

Done. Now you’ve got real images ready to process.

Bringing in Transformers for Vision Tasks

When people hear “Transformers,” they often think of text. But over the past few years, vision transformers (ViTs) have caught up. These models treat images like sequences—the same way they handle text, which means they can “read” images and understand the patterns just as well.

To get started, pick a pre-trained model and a feature extractor. The feature extractor helps prepare the image so that the model can make sense of it.

from transformers import AutoFeatureExtractor, AutoModel
import torch

extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
model = AutoModel.from_pretrained("google/vit-base-patch16-224")

Now let’s grab two images from your dataset and run them through the model to see how similar they are.

image1 = dataset[0]["image"]
image2 = dataset[1]["image"]

inputs1 = extractor(images=image1, return_tensors="pt")
inputs2 = extractor(images=image2, return_tensors="pt")

with torch.no_grad():
    emb1 = model(**inputs1).last_hidden_state.mean(dim=1)
    emb2 = model(**inputs2).last_hidden_state.mean(dim=1)

similarity = torch.nn.functional.cosine_similarity(emb1, emb2)
print(similarity.item())

That number you see? It tells you how similar the two images are—closer to 1 means more alike. Closer to 0 means they’re pretty different.

Building a Simple Similarity Search Tool

If you’re looking to expand on this idea and build a working image similarity system, here’s a simple guide. No fluff—just what you need:

Step 1: Load and Preprocess Your Dataset

Start with any dataset that suits your needs. Make sure your images are in a format compatible with the model. Use the datasets library and handle preprocessing with PIL and NumPy as shown earlier.

Step 2: Choose and Load a Pre-Trained Vision Transformer

Stick to models like google/vit-base-patch16-224 or any other ViT from Hugging Face’s library. Pair it with the corresponding feature extractor to ensure inputs are correctly shaped and normalized.

Step 3: Extract Embeddings

This part is key. Run each image through the model and store the resulting vector. These embeddings are what you’ll compare later. Use .mean(dim=1) to get a single vector per image.

def get_embedding(image):
    inputs = extractor(images=image, return_tensors="pt")
    with torch.no_grad():
        output = model(**inputs).last_hidden_state.mean(dim=1)
    return output

Step 4: Compare Using Cosine Similarity

Cosine similarity works well because it compares the angle between two vectors rather than their magnitude. This makes it less sensitive to brightness and contrast differences.

To compare an input image to every image in your dataset:

from torch.nn.functional import cosine_similarity

def find_similar(query_image, dataset_embeddings, dataset_images, top_k=5):
    query_emb = get_embedding(query_image)
    sims = cosine_similarity(query_emb, dataset_embeddings)
    top_indices = sims.argsort(descending=True)[:top_k]
    return [dataset_images[i] for i in top_indices]

This gives you the top matches. You can plug this into a UI or even just save the top images to disk.

Things to Keep in Mind

Let’s not overcomplicate it. You don’t need to train your own model unless you have a very niche dataset. Pre-trained models, especially ones trained on ImageNet or similar collections, already “understand” general visual patterns.

Also, your embeddings should be calculated just once and stored. Doing it on the fly for each comparison will slow everything down. A quick save to disk using torch.save() can help you keep things smooth.

Final Thoughts

Image similarity doesn’t need to feel intimidating. Hugging Face offers everything—from curated datasets to powerful Transformer models—in a way that’s surprisingly practical. With just a few steps, you can build something useful that doesn’t just see images, but starts to understand them. Whether it’s organizing a gallery, improving search results, or building something entirely new, you’ve got the tools. Now it’s about trying it out and seeing where it takes you.

IMPACT
Hugging Face Hub Search Upgrade: What You Need to Know

Experience supercharged searching on the Hugging Face Hub with faster, smarter results. Discover how improved filters and natural language search make Hugging Face model search easier and more accurate.
IMPACT
How to Use Hugging Face Datasets for Image Search

Learn how to perform image search with Hugging Face datasets using Python. This guide covers filtering, custom searches, and similarity search with vision models.
IMPACT
Training Vision Transformer Models for Image Classification with Hugging Face

How to fine-tune ViT for image classification using Hugging Face Transformers. This guide covers dataset preparation, preprocessing, training setup, and post-training steps in detail.
IMPACT
Train ControlNet Using Diffusers: A Step-by-Step Guide for Developers

Want to build a ControlNet that follows your structure exactly? Learn how to train your own ControlNet using Hugging Face Diffusers—from dataset prep to inference—in a streamlined, hands-on workflow.
APPLICATIONS
Train a Language Model from Scratch with Transformers and Tokenizers

Want to build your own language model from the ground up? Learn how to prepare data, train a custom tokenizer, define a Transformer architecture, and run the training loop using Transformers and Tokenizers.
BASICTHEORY
How the Hugging Face Hub Empowers GLAMs to Share Cultural Data

Wondering how the Hugging Face Hub can help cultural institutions share their resources? Discover how it empowers GLAMs to make their data accessible, discoverable, and collaborative with ease.
TECHNOLOGIES
PaddlePaddle Joins Hugging Face: What It Means for Developers

Curious about PaddlePaddle's leap onto Hugging Face? Discover how this powerful deep learning framework just got easier to access, deploy, and share through the world’s biggest AI hub.
APPLICATIONS
Optimize Transformer Training with Ray Tune

Struggling to nail down the right learning rate or batch size for your transformer? Discover how Ray Tune’s smart search strategies can automatically find optimal hyperparameters for your Hugging Face models.
BASICTHEORY
Explore Datasets Faster with DuckDB on Hugging Face

Looking for a faster way to explore datasets? Learn how DuckDB on Hugging Face lets you run SQL queries directly on over 50,000 datasets with no setup, saving you time and effort.
IMPACT
Fine-Tune Large Models with Hugging Face's PEFT

Think you can't fine-tune large language models without a top-tier GPU? Think again. Learn how Hugging Face's PEFT makes it possible to train billion-parameter models on modest hardware with LoRA, AdaLoRA, and prompt tuning.
IMPACT
Federated Learning with Hugging Face and Flower: A Practical Guide

Learn how to implement federated learning using Hugging Face models and the Flower framework to train NLP systems without sharing private data.
BASICTHEORY
How to Use the Hugging Face API in Unity for Real-Time AI

What happens when you bring natural language AI into a Unity scene? Learn how to set up the Hugging Face API in Unity step by step—from API keys to live UI output, without any guesswork.

Latest Articles

APPLICATIONS
Understanding Apache Kafka: Real-World Applications and How to Install

Explore Apache Kafka use cases in real-world scenarios and follow this detailed Kafka installation guide to set up your own event streaming platform.
TECHNOLOGIES
Step-by-Step Guide to Building CI/CD Pipelines with Azure DevOps

How to use DevOps Azure to create CI and CD pipelines with this detailed, step-by-step guide. Set up automated builds and deployments efficiently using Azure DevOps tools.
BASICTHEORY
A Clear Guide to Hierarchical Clustering in Machine Learning

How hierarchical clustering in machine learning helps uncover data patterns by building nested groups. Understand its types, dendrogram visualization, advantages, and drawbacks.
TECHNOLOGIES
McKinsey Says AI Adds $560B in Innovation—Here’s Where It’s Coming From

Is AI the innovation engine your company’s missing? McKinsey’s $560B estimate isn’t hype—it’s backed by how AI is accelerating product cycles, decision-making, and operational redesign across industries.
TECHNOLOGIES
How AI and Quantum Computing Are Teaming Up to Solve the Impossible

Discover how artificial intelligence and quantum computing are combining forces to tackle complex problems no system could solve alone—and what it means for the future of computing.
TECHNOLOGIES
This Startup Raised $105M to Give Robots a Real Brain—Here's How

What if robots could learn like humans—through memory, context, and real-world experience? A new robotics startup just raised $105M to make that a reality, and its approach could redefine the future of automation
TECHNOLOGIES
Image Similarity with Hugging Face: A Practical Guide Using Transformers

Ever wondered how to measure visual similarity between images using Transformers? Learn how to build a simple yet powerful image similarity pipeline with Hugging Face’s datasets and ViT models.
IMPACT
Ultra-Fast ControlNet with Diffusers: Real-Time Image Conditioning Without the Wait

Still waiting around for ControlNet to generate images? Discover how the new Diffusers integration makes real-time, high-quality image conditioning possible—even on mid-range GPUs.
IMPACT
Train ControlNet Using Diffusers: A Step-by-Step Guide for Developers

Want to build a ControlNet that follows your structure exactly? Learn how to train your own ControlNet using Hugging Face Diffusers—from dataset prep to inference—in a streamlined, hands-on workflow.
IMPACT
How Substra Ensures Privacy While Enabling AI Collaboration

How can you build intelligent systems without compromising data privacy? Substra allows organizations to collaborate and train AI models without sharing sensitive data.
BASICTHEORY
Q8-Chat: Compact AI Powered by Xeon for Real-Time Performance

Curious how you can run AI efficiently without GPU-heavy models? Discover how Q8-Chat brings real-time, responsive AI performance using Xeon CPUs with minimal overhead
BASICTHEORY
Why safetensors Is the Secure Standard for AI Model Formats

Wondering if safetensors is secure? An independent audit confirms it. Discover why safetensors is the safe, fast, and reliable choice for machine learning models—without the risks of traditional formats.