Published on April 25, 2025

Inside Jamba 1.5: Transformer and Mamba Meet in One Architecture

In the world of large language models (LLMs), innovation is driven by the quest for enhanced efficiency, scalability, and the ability to manage longer context windows. AI21 Labs has taken a significant step forward with the release of Jamba 1.5, featuring the groundbreaking hybrid Mamba-Transformer architecture.

Jamba 1.5 is crafted to excel in natural language tasks, offering superior memory management, speed, and contextual understanding. It combines the structured state space modeling (SSM) capabilities of Mamba with the global attention features of the Transformer.

This hybrid architecture allows it to process up to 256,000 tokens—an industry-leading context window for open-source models. In this post, we explore what makes Jamba 1.5 unique , how its hybrid architecture functions, and why it is crucial for the future of AI development and deployment.

What Is Jamba 1.5? A Hybrid Language Model

Jamba 1.5 is an instruction-tuned language model that combines two architectures: the traditional Transformer and the more recent Mamba SSM. Unlike models that rely solely on attention mechanisms, Jamba leverages both state space models and attention layers, offering improved performance across long-context tasks and low-latency environments.

Jamba 1.5 is available in two main variants:

Jamba 1.5 Large : 94 billion active parameters (398B total)
Jamba 1.5 Mini : 12 billion active parameters (52B total)

Despite their size differences, both models benefit from the same hybrid foundation, allowing them to perform diverse NLP tasks—from summarization and translation to question answering and text classification—with exceptional efficiency.

The Hybrid Mamba-Transformer Architecture Explained

The core of Jamba 1.5’s strength lies in how it merges two distinct design philosophies into a hybrid architecture. Here’s how this architecture is structured:

1. Base Composition

Jamba 1.5 is constructed using 9 modular blocks, each containing 8 layers. These layers follow a 1:7 ratio—meaning for every Transformer attention layer, there are seven Mamba layers. This design allows the model to benefit from the long-range, low-memory characteristics of Mamba while retaining the attention capabilities of Transformer layers for global pattern recognition.

2. Mixture-of-Experts (MoE) Module

The architecture integrates a Mixture-of-Experts (MoE) mechanism. It consists of 16 expert models, of which only the top 2 are activated per token. This enables dynamic routing and ensures specialized processing for different input types, boosting performance while keeping computation efficient.

3. Quantization with ExpertsInt8

To enhance memory efficiency, Jamba 1.5 uses ExpertsInt8 quantization for both its MoE and MLP layers. This allows it to operate in 8-bit precision without compromising on throughput, significantly reducing memory load—particularly important for real-time or resource-constrained deployments.

4. Attention and Context Handling

With 64 attention heads for queries and 8 key-value heads, Jamba 1.5 maintains high attention capacity. Most importantly, it supports a context window of 256K tokens, currently the largest among publicly available open-source models. Traditional Transformers struggle with long sequences due to memory- intensive key-value (KV) caching. Jamba addresses this with architectural optimizations that reduce KV cache memory while preserving sequence integrity.

5. Activation Stabilization with Auxiliary Losses

To ensure consistent performance across extremely deep architectures and long sequences, Jamba 1.5 incorporates auxiliary loss functions that help stabilize activation magnitudes. When combining Mamba and Transformer layers, variations in information flow through the network can lead to unstable gradients or vanishing activations.

Why the Hybrid Architecture Matters?

The hybrid architecture of Jamba 1.5 addresses some of the biggest limitations of earlier LLMs:

Memory Efficiency : Mamba’s state-space design dramatically reduces the memory overhead that comes with handling long sequences in Transformers.
Context Length : By supporting 256K tokens, Jamba 1.5 can model entire books, research papers, or multi-turn conversations without truncation.
Dynamic Specialization : MoE allows the model to selectively engage different sub-networks, improving quality without a full increase in compute cost.
Faster Inference : The balance of fewer attention layers and more Mamba layers results in lower latency for processing inputs.

This combination of advantages makes the model especially suitable for high- performance NLP tasks across domains like healthcare, legal, academic research, and customer service automation.

Efficiency and Hardware Compatibility

One major concern with modern LLMs is whether they can run efficiently on real-world hardware. Many models demand multiple GPUs and expensive infrastructure. Jamba 1.5 is designed to be more accessible.

It supports Flash Attention and Paged Attention for faster inference.
It is optimized for single-GPU inference using DeepSpeed and vLLM frameworks.
Because Mamba layers are lighter, memory consumption is lower than pure Transformer models.

This makes Jamba 1.5 a great option for startups, independent developers, and small businesses looking to utilize powerful AI without incurring huge infrastructure costs.

Model Variants and Accessibility

AI21 Labs has released two publicly accessible versions of Jamba 1.5:

Jamba 1.5 Mini (12B) : Lightweight, accessible for local or low-compute environments.
Jamba 1.5 Large (94B) : High-capacity model for demanding enterprise use cases.

Both models are instruction-tuned and multilingual, supporting nine languages: English, Portuguese, Hebrew, German, Italian, Dutch, Spanish, Arabic, and French.

Developers and researchers can access Jamba 1.5 via:

AI21 Studio API
Hugging Face Model Hub
AI21’s Web Chat Interface

Jamba 1.5 can also be integrated into applications using Python with simple API calls, enabling usage in platforms like chatbots, text analytics tools, and content generation services.

How Jamba 1.5 Compares to Standard Transformers

Feature	Traditional Transformers	Jamba 1.5 Hybrid Model
Architecture	Attention-only	Mamba + Transformer
Context Length	Typically 2K–32K tokens	Up to 256K tokens
Memory Usage	High	Lower with Mamba and Int8
Latency	Moderate to High	Lower (fewer attention layers)
Specialized Computation (MoE)	No	Yes (dynamic routing)
Quantization	Optional (often FP16)	Built-in ExpertsInt8

Conclusion

Jamba 1.5 represents a significant leap forward in large language model architecture. By merging the Transformer’s powerful attention mechanism with the Mamba model’s ability to handle long sequences efficiently, AI21 Labs has created a model that sets a new benchmark in open-source LLMs. Its hybrid structure is more than just a technical achievement—it’s a solution to real- world challenges in scaling language models. With 256K context support, modular MoE components, and efficient quantization, Jamba 1.5 is optimized for both performance and practicality.

APPLICATIONS
Revolutionizing AI with OLMoE: Open Mixture-of-Experts in Action

Explore the architecture and real-world use cases of OLMoE, a flexible and scalable Mixture-of-Experts language model.
APPLICATIONS
How UltraCamp uses AI to build thoughtful customer connections

Discover how UltraCamp uses AI-driven customer engagement to create personalized, automated interactions that improve support
BASICTHEORY
What is Artificial Intelligence? A Beginner's Guide to AI Basics

Learn what Artificial Intelligence (AI) is, how it works, and its applications in this beginner's guide to AI basics.
APPLICATIONS
Artificial Intelligence for Noobs

Learn artificial intelligence's principles, applications, risks, and future societal effects from a novice's perspective
TECHNOLOGIES
Conversational Chatbots Can Revolutionize Your Sales Process, Here’s How

Conversational chatbots that interact with customers, recover carts, and cleverly direct purchases will help you increase sales
IMPACT
Understanding AI’s Impact on Creative Writing

AI as a personalized writing assistant or tool is efficient, quick, productive, cost-effective, and easily accessible to everyone.
TECHNOLOGIES
A Beginner’s Guide to Joint, Marginal, and Conditional Probability

This guide breaks down joint, marginal, and conditional probability using beginner-friendly examples and plain language.
BASICTHEORY
5 Generative AI Stocks to Watch for Investment Opportunities 2025

These 5 generative AI stocks are making waves in 2025—see which companies are leading AI growth and investor interest.
BASICTHEORY
Discover the Best 5 Generative AI Breakthroughs Worth Trying

Explore 5 powerful generative AI tools making headlines in 2025. Discover what’s new and how you can use them today.
BASICTHEORY
Ray: The Smartest Way to Scale AI and Machine Learning Workloads

Ray helps scale AI and ML apps effortlessly with distributed Python tools for training, tuning, and deployment.
APPLICATIONS
Digital Twin Technology: Real-World Uses, Types, and Key Benefits

Learn what digital twins are, explore their types, and discover how they improve performance across various industries.
IMPACT
AI in Blogging: Pros and Cons You Need to Know

Explore the pros and cons of AI in blogging. Learn how AI tools affect SEO, content creation, writing quality, and efficiency

Latest Articles

IMPACT
AI Revolution: Streamlining Model Deployment with Hugging Face & FriendliAI Collaboration

Insight into the strategic partnership between Hugging Face and FriendliAI, aimed at streamlining AI model deployment on the Hub for enhanced efficiency and user experience.
TECHNOLOGIES
How to Deploy and Fine-Tune DeepSeek Models on AWS for Scalable AI Solutions

Deploy and fine-tune DeepSeek models on AWS using EC2, S3, and Hugging Face tools. This comprehensive guide walks you through setting up, training, and scaling DeepSeek models efficiently in the cloud.
TECHNOLOGIES
Beyond BERT: Discover the New Standard in Language Modeling

Explore the next-generation language models, T5, DeBERTa, and GPT-3, that serve as true alternatives to BERT. Get insights into the future of natural language processing.
TECHNOLOGIES
Understanding the EU AI Act: A Guide for Open Source Developers

Explore the impact of the EU AI Act on open source developers, their responsibilities and the changes they need to implement in their future projects.
TECHNOLOGIES
Unleashing AI Potential: How Hugging Face and PyCharm Collaborate in AI Projects

Exploring the power of integrating Hugging Face and PyCharm in model training, dataset management, and debugging for machine learning projects with transformers.
TECHNOLOGIES
Boost Your Static Embedding Training Speed by 400x Using Sentence Transformers

Learn how to train static embedding models up to 400x faster using Sentence Transformers. Explore how contrastive learning and smart sampling techniques can accelerate embedding generation and improve accuracy.
TECHNOLOGIES
Unveiling SmolVLM's Compact 250M and 500M Vision-Language Models

Discover how SmolVLM is revolutionizing AI with its compact 250M and 500M vision-language models. Experience strong performance without the need for hefty compute power.
TECHNOLOGIES
Optimizing AI Training: CFM’s Method of Enhancing Small Models with Large Model Insights

Discover CFM’s innovative approach to fine-tuning small AI models using insights from large language models (LLMs). A case study in improving speed, accuracy, and cost-efficiency in AI optimization.
BASICTHEORY
Exploring AI's Influence on Reading Habits: Transforming Information Processing with TL;DR Tools

Discover the transformative influence of AI-powered TL;DR tools on how we manage, summarize, and digest information faster and more efficiently.
TECHNOLOGIES
Visual Input: The Game-Changer in AI Agents' Perception

Explore how the integration of vision transforms SmolAgents from mere scripted tools to adaptable systems that interact with real-world environments intelligently.
BASICTHEORY
Exploring SmolVLM: A Compact Vision-Language Model with Mighty Performance

Explore the lightweight yet powerful SmolVLM, a distinctive vision-language model built for real-world applications. Uncover how it balances exceptional performance with efficiency.
APPLICATIONS
Smolagents: Simplifying Agent Development with a Clean Approach

Delve into smolagents, a streamlined Python library that simplifies AI agent creation. Understand how it aids developers in constructing intelligent, modular systems with minimal setup.