Together.ai releases Mamba-3, an open-source state space model built for inference that outperforms Mamba-2 and matches Transformer decode speeds at 16K sequencesTogether.ai releases Mamba-3, an open-source state space model built for inference that outperforms Mamba-2 and matches Transformer decode speeds at 16K sequences

Mamba-3 SSM Drops With Inference-First Design Beating Transformers at Decode

2026/03/18 01:48
Okuma süresi: 3 dk
Bu içerikle ilgili geri bildirim veya endişeleriniz için lütfen [email protected] üzerinden bizimle iletişime geçin.

Mamba-3 SSM Drops With Inference-First Design Beating Transformers at Decode

James Ding Mar 17, 2026 17:48

Together.ai releases Mamba-3, an open-source state space model built for inference that outperforms Mamba-2 and matches Transformer decode speeds at 16K sequences.

Mamba-3 SSM Drops With Inference-First Design Beating Transformers at Decode

Together.ai has released Mamba-3, a state space model architecture designed from the ground up for inference workloads rather than training efficiency. The open-source release marks a philosophical shift in how linear architectures are built, arriving as agentic AI workflows have pushed inference demand to unprecedented levels.

At 16,384 sequence length, Mamba-3's SISO variant clocks prefill+decode at 140.61 seconds versus 149.02 seconds for Mamba-2 and a staggering 976.50 seconds for Llama-3.2-1B running on vLLM. That's nearly 7x faster than the Transformer baseline on the same H100 GPU hardware.

Why Inference Matters Now

The timing isn't accidental. While Mamba-2 bet big on training speed back in mid-2024—delivering 2-8x faster training than its predecessor—the landscape has shifted dramatically. Reinforcement learning with verifiable rewards for coding and math requires massive rollout generation. Tools like Codex, Claude Code, and OpenClaw have made inference the bottleneck, not pretraining.

Previous linear architectures simplified their underlying mechanisms to accelerate training, leaving the inference step "too simple" and memory-bound. GPUs weren't computing—they were mostly shuffling data around.

Three Core Improvements

Mamba-3 addresses this through changes rooted in classical control theory rather than trendy deep learning interpretations:

Exponential-trapezoidal discretization creates a more expressive recurrence. This eliminates the short causal convolution that plagued Mamba-1 and Mamba-2—a component that had become standard across linear models since H3 and RWKV-4 popularized it.

Complex-valued SSM systems expand state-tracking capabilities. The model can now handle synthetic tasks like parity and arithmetic reasoning that Mamba-2 couldn't reliably solve.

Multi-input, multi-output (MIMO) architecture runs multiple SSMs in parallel. The MIMO variant boosts downstream accuracy by over 1 percentage point at 1B scale compared to standard Mamba-3, with a crucial catch: training takes longer, but decode latency stays flat.

That last point deserves emphasis. Training is compute-bound; inference is memory-bound. Adding FLOPs per timestep barely touches inference latency because idle GPU cores simply pick up the work.

Benchmark Results

On downstream language modeling evaluations, Mamba-3 outperforms both Mamba-2 and Gated DeltaNet across pretrained model scales. The SISO variant matches Mamba-2's architecture shapes exactly while delivering better accuracy. MIMO pushes further ahead.

Retrieval tasks tell a more nuanced story. Pure linear models naturally underperform Transformers here—that fixed-size state can't match an ever-growing KV cache for exact recall. But Mamba-3 holds its own among sub-quadratic alternatives, and MIMO improves retrieval without increasing state size.

The team predicts hybrid models combining linear layers with global self-attention will dominate language modeling going forward. Their experiments show this combination beats vanilla Transformers on retrieval while maintaining efficiency gains.

Open Source From Day One

Kernels are available at the mamba-ssm repository, built across Triton, TileLang, and CuTe DSL depending on the operation. The stack reflects pragmatic engineering: Triton for standard architecture development, TileLang for fine-grained memory control on MIMO prefill, and CuTe DSL for maximizing Hopper GPU performance during decode.

NVIDIA's recent Nemotron 3 Super release, which uses Mamba-2 layers in a hybrid configuration, suggests enterprise interest in SSM architectures is accelerating. Mamba-3's inference-first approach could accelerate adoption in production environments where token generation speed directly impacts costs and user experience.

The full paper is available on arXiv, with a second blog post covering the mathematical foundations of the three core improvements expected to follow.

Image source: Shutterstock
  • mamba-3
  • state space models
  • ai inference
  • together.ai
  • open source
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen [email protected] ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

Shiba Inu Shibariumscan Hits 45% Indexing Progress

Shiba Inu Shibariumscan Hits 45% Indexing Progress

The post Shiba Inu Shibariumscan Hits 45% Indexing Progress appeared on BitcoinEthereumNews.com. Shiba Inu’s ecosystem is showing steady technical progress as infrastructure
Paylaş
BitcoinEthereumNews2026/03/18 04:30
VanEck Targets Stablecoins & Next-Gen ICOs

VanEck Targets Stablecoins & Next-Gen ICOs

The post VanEck Targets Stablecoins & Next-Gen ICOs appeared on BitcoinEthereumNews.com. Welcome to the US Crypto News Morning Briefing—your essential rundown of the most important developments in crypto for the day ahead. Grab a coffee because the firms shaping crypto’s future are not just building products, but also trying to reshape how capital flows. Crypto News of the Day: VanEck Maps Next Frontier of Crypto Venture Investing VanEck, a Wall Street player known for financial “firsts,” is pushing that legacy into Web3. The firsts include pioneering US gold funds and launching one of the earliest spot Bitcoin ETFs. Sponsored Sponsored “Financial instruments have always been a kind of tokenization. From seashells to traveler’s checks, from relational databases to today’s on-chain assets. You could even joke that VanEck’s first gold mutual funds were the original ‘tokenized gold,’” Juan C. Lopez, General Partner at VanEck Ventures, told BeInCrypto. That same instinct drives the firm’s venture bets. Lopez said VanEck goes beyond writing checks and brings the full weight of the firm. This extends from regulatory proximity to product experiments to founders building the next phase of crypto infrastructure. Asked about key investment priorities, Lopez highlighted stablecoins. “We care deeply about three questions: How do we accelerate stablecoin ubiquity? What will users want to do with them once highly distributed? And what net new assets can we construct now that we have sophisticated market infrastructure?” Lopez added. However, VanEck is not limiting itself to the hottest narrative, acknowledging that decentralized finance (DeFi) is having a renaissance. The VanEck executive also noted that success will depend on new approaches to identity and programmable compliance layered on public blockchains. Backing Legion With A New Model for ICOs Sponsored Sponsored That compliance-first angle explains VanEck Ventures’ recent co-lead of Legion’s $5 million seed round alongside Brevan Howard. Legion aims to reinvent token fundraising by making early-stage access…
Paylaş
BitcoinEthereumNews2025/09/18 03:52
The Role of Reference Points in Achieving Equilibrium Efficiency in Fair and Socially Just Economies

The Role of Reference Points in Achieving Equilibrium Efficiency in Fair and Socially Just Economies

This article explores how a simple change in the reference point can achieve a Pareto-efficient equilibrium in both free and fair economies and those with social justice.
Paylaş
Hackernoon2025/09/17 22:30