Ever looked at two images and thought, “These feel the same somehow,” even when they aren’t identical? That gut feeling actually has a technical side, and no, you don’t need to be a machine learning whiz to get it. Thanks to Hugging Face’s Datasets and Transformers libraries, working with image similarity is far more accessible than it sounds. Let’s break it down into something you can work with—and maybe even enjoy doing.
You’ll be surprised how much you can achieve with just a few lines of code. The hard part—model training, data wrangling, optimization—has already been done for you. What’s left is plugging the right pieces together and making them do what you need.
Image similarity isn’t just about finding identical copies. It’s about finding images that look alike to the human eye, even if the pixels don’t match one-to-one. Whether you’re building a duplicate photo cleaner, recommending similar products, or clustering visuals for a design tool, this is the stuff that helps machines “see” better.
Traditionally, image similarity relied on things like histograms or pixel comparisons. But those techniques fall short when images have different resolutions, lighting, or angles. That’s where Transformers come in—they help models compare the meaning behind an image, not just the colors.
Before you can compare anything, you need something to compare. Hugging Face makes this step surprisingly light. The datasets library allows you to pull popular image datasets with just a few lines of code—no messy downloading or extracting.
from datasets import load_dataset
dataset = load_dataset("beans", split="train")
Yep, that’s it. This loads a small dataset of bean plant images, which is perfect if you’re just testing things out. Need something bigger? There are hundreds of image datasets available, and you can swap one line to pick another. It really is that flexible.
However, the images are not in PIL.Image format by default. So if you’re using any model from transformers, you’ll want to convert them first.
from PIL import Image
import numpy as np
def convert(example):
example["image"] = Image.fromarray(np.array(example["image"]))
return example
dataset = dataset.map(convert)
Done. Now you’ve got real images ready to process.
When people hear “Transformers,” they often think of text. But over the past few years, vision transformers (ViTs) have caught up. These models treat images like sequences—the same way they handle text, which means they can “read” images and understand the patterns just as well.
To get started, pick a pre-trained model and a feature extractor. The feature extractor helps prepare the image so that the model can make sense of it.
from transformers import AutoFeatureExtractor, AutoModel
import torch
extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
model = AutoModel.from_pretrained("google/vit-base-patch16-224")
Now let’s grab two images from your dataset and run them through the model to see how similar they are.
image1 = dataset[0]["image"]
image2 = dataset[1]["image"]
inputs1 = extractor(images=image1, return_tensors="pt")
inputs2 = extractor(images=image2, return_tensors="pt")
with torch.no_grad():
emb1 = model(**inputs1).last_hidden_state.mean(dim=1)
emb2 = model(**inputs2).last_hidden_state.mean(dim=1)
similarity = torch.nn.functional.cosine_similarity(emb1, emb2)
print(similarity.item())
That number you see? It tells you how similar the two images are—closer to 1 means more alike. Closer to 0 means they’re pretty different.
If you’re looking to expand on this idea and build a working image similarity system, here’s a simple guide. No fluff—just what you need:
Start with any dataset that suits your needs. Make sure your images are in a format compatible with the model. Use the datasets library and handle preprocessing with PIL and NumPy as shown earlier.
Stick to models like google/vit-base-patch16-224
or any other ViT from Hugging Face’s library. Pair it with the corresponding feature extractor to ensure inputs are correctly shaped and normalized.
This part is key. Run each image through the model and store the resulting vector. These embeddings are what you’ll compare later. Use .mean(dim=1)
to get a single vector per image.
def get_embedding(image):
inputs = extractor(images=image, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).last_hidden_state.mean(dim=1)
return output
Cosine similarity works well because it compares the angle between two vectors rather than their magnitude. This makes it less sensitive to brightness and contrast differences.
To compare an input image to every image in your dataset:
from torch.nn.functional import cosine_similarity
def find_similar(query_image, dataset_embeddings, dataset_images, top_k=5):
query_emb = get_embedding(query_image)
sims = cosine_similarity(query_emb, dataset_embeddings)
top_indices = sims.argsort(descending=True)[:top_k]
return [dataset_images[i] for i in top_indices]
This gives you the top matches. You can plug this into a UI or even just save the top images to disk.
Let’s not overcomplicate it. You don’t need to train your own model unless you have a very niche dataset. Pre-trained models, especially ones trained on ImageNet or similar collections, already “understand” general visual patterns.
Also, your embeddings should be calculated just once and stored. Doing it on the fly for each comparison will slow everything down. A quick save to disk using torch.save()
can help you keep things smooth.
Image similarity doesn’t need to feel intimidating. Hugging Face offers everything—from curated datasets to powerful Transformer models—in a way that’s surprisingly practical. With just a few steps, you can build something useful that doesn’t just see images, but starts to understand them. Whether it’s organizing a gallery, improving search results, or building something entirely new, you’ve got the tools. Now it’s about trying it out and seeing where it takes you.
Experience supercharged searching on the Hugging Face Hub with faster, smarter results. Discover how improved filters and natural language search make Hugging Face model search easier and more accurate.
Learn how to perform image search with Hugging Face datasets using Python. This guide covers filtering, custom searches, and similarity search with vision models.
How to fine-tune ViT for image classification using Hugging Face Transformers. This guide covers dataset preparation, preprocessing, training setup, and post-training steps in detail.
Want to build a ControlNet that follows your structure exactly? Learn how to train your own ControlNet using Hugging Face Diffusers—from dataset prep to inference—in a streamlined, hands-on workflow.
Want to build your own language model from the ground up? Learn how to prepare data, train a custom tokenizer, define a Transformer architecture, and run the training loop using Transformers and Tokenizers.
Wondering how the Hugging Face Hub can help cultural institutions share their resources? Discover how it empowers GLAMs to make their data accessible, discoverable, and collaborative with ease.
Curious about PaddlePaddle's leap onto Hugging Face? Discover how this powerful deep learning framework just got easier to access, deploy, and share through the world’s biggest AI hub.
Struggling to nail down the right learning rate or batch size for your transformer? Discover how Ray Tune’s smart search strategies can automatically find optimal hyperparameters for your Hugging Face models.
Looking for a faster way to explore datasets? Learn how DuckDB on Hugging Face lets you run SQL queries directly on over 50,000 datasets with no setup, saving you time and effort.
Think you can't fine-tune large language models without a top-tier GPU? Think again. Learn how Hugging Face's PEFT makes it possible to train billion-parameter models on modest hardware with LoRA, AdaLoRA, and prompt tuning.
Learn how to implement federated learning using Hugging Face models and the Flower framework to train NLP systems without sharing private data.
What happens when you bring natural language AI into a Unity scene? Learn how to set up the Hugging Face API in Unity step by step—from API keys to live UI output, without any guesswork.
Explore Apache Kafka use cases in real-world scenarios and follow this detailed Kafka installation guide to set up your own event streaming platform.
How to use DevOps Azure to create CI and CD pipelines with this detailed, step-by-step guide. Set up automated builds and deployments efficiently using Azure DevOps tools.
How hierarchical clustering in machine learning helps uncover data patterns by building nested groups. Understand its types, dendrogram visualization, advantages, and drawbacks.
Is AI the innovation engine your company’s missing? McKinsey’s $560B estimate isn’t hype—it’s backed by how AI is accelerating product cycles, decision-making, and operational redesign across industries.
Discover how artificial intelligence and quantum computing are combining forces to tackle complex problems no system could solve alone—and what it means for the future of computing.
What if robots could learn like humans—through memory, context, and real-world experience? A new robotics startup just raised $105M to make that a reality, and its approach could redefine the future of automation
Ever wondered how to measure visual similarity between images using Transformers? Learn how to build a simple yet powerful image similarity pipeline with Hugging Face’s datasets and ViT models.
Still waiting around for ControlNet to generate images? Discover how the new Diffusers integration makes real-time, high-quality image conditioning possible—even on mid-range GPUs.
Want to build a ControlNet that follows your structure exactly? Learn how to train your own ControlNet using Hugging Face Diffusers—from dataset prep to inference—in a streamlined, hands-on workflow.
How can you build intelligent systems without compromising data privacy? Substra allows organizations to collaborate and train AI models without sharing sensitive data.
Curious how you can run AI efficiently without GPU-heavy models? Discover how Q8-Chat brings real-time, responsive AI performance using Xeon CPUs with minimal overhead
Wondering if safetensors is secure? An independent audit confirms it. Discover why safetensors is the safe, fast, and reliable choice for machine learning models—without the risks of traditional formats.