Together AI's kernel research team delivers major GPU optimization breakthroughs, cutting inference latency from 281ms to 77ms for enterprise AI deployments. (ReadTogether AI's kernel research team delivers major GPU optimization breakthroughs, cutting inference latency from 281ms to 77ms for enterprise AI deployments. (Read

Together AI Kernels Team Achieves 3.6x Performance Gains on NVIDIA Hardware

2026/04/02 03:17
4 min di lettura
Per feedback o dubbi su questo contenuto, contattateci all'indirizzo [email protected].

Together AI Kernels Team Achieves 3.6x Performance Gains on NVIDIA Hardware

Timothy Morano Apr 01, 2026 19:17

Together AI's kernel research team delivers major GPU optimization breakthroughs, cutting inference latency from 281ms to 77ms for enterprise AI deployments.

Together AI Kernels Team Achieves 3.6x Performance Gains on NVIDIA Hardware

The team behind FlashAttention has quietly become one of the most consequential groups in AI infrastructure. Together AI's kernel research unit, now about 15 engineers strong, is solving a problem most people don't even know exists: the massive performance gap between AI models and the hardware running them.

Their latest win? Taking a voice AI company's time-to-first-token from 281ms down to 77ms—a 3.6x improvement that translated to 7.2x better unit economics.

The Hidden Bottleneck

Here's what most AI discourse misses: having great models and expensive GPUs doesn't guarantee performance. The bottleneck sits in between—the kernel layer that translates mathematical operations into actual silicon instructions.

"The gap between what researchers design and what actually runs fast on hardware is vast," explains Dan Fu, who leads a parallel research lab at UCSD. Get kernels right and you unlock hardware's full potential. Get them wrong and your expensive GPUs sit partially idle.

For companies building AI-native products, this isn't academic. When inference costs run 2x higher than necessary, or when latency breaks the user experience, kernel optimization becomes existential.

One Week Versus One Year

The team's capabilities showed clearly when NVIDIA's Blackwell GPUs arrived in March 2025. NVIDIA had spent a year with dozens of engineers optimizing kernels for the new architecture. Together AI had a week.

Their secret weapon: ThunderKittens, a library developed with Stanford researchers that reduces kernel code from 1,000+ lines of CUDA to roughly 100-200 lines. The abstraction layer is built around NVIDIA's tensor cores, the specialized matrix multiplication units on modern GPUs.

Within seven days of hardware access, the team had some of the fastest FP4 and FP8 GEMM kernels available for Blackwell, achieving up to 2x speedups over cuBLAS on H100s.

Real-World Impact

The voice AI case study illustrates what this means in production. The customer had a hard constraint: time-to-first-64-tokens above roughly 100ms breaks conversational flow. Their B200 deployment was hitting 281ms.

Together's team hand-optimized a "Megakernel" implementation—running an entire model in a single kernel, targeting the HBM bandwidth ceiling of NVIDIA H100s. Results on Llama-3.2-1B: 77ms. On Qwen 2.5 1.5B: 127ms, down from 292ms.

The approach traces back to FlashAttention's original insight. That Memorial Day 2022 paper proved the AI establishment wrong about attention being fully optimized. By applying database systems principles—data locality, memory hierarchies—to transformer attention, the team achieved 2-3x speedups where previous sparsity methods showed only 10% real gains.

Academic-Industry Pipeline

The team operates through an unusual model. Dan Fu runs his UCSD lab on higher-risk fundamental research. Together AI co-founder Tri Dao is at Princeton. Simran Arora is at Caltech. Ideas get de-risked in academia, then productionized at Together AI. PhD students join the company. Interns work on longer-term research in academic labs.

This produces engineers who bridge theory and production—people who, as Fu puts it, "lose sleep over memory access patterns" and "find beauty in data flow diagrams."

The work isn't glamorous. No announcements when a kernel optimization lands. Just faster training times, lower costs, higher throughput. But these margins determine whether AI-native products feel instant or sluggish, whether unit economics work or don't, whether companies scale to millions of users or plateau at thousands.

For enterprise AI deployments where every millisecond matters—and every percentage point of efficiency translates to significant cost savings—this invisible infrastructure layer may be where the real competitive advantage lies.

Image source: Shutterstock
  • together ai
  • gpu optimization
  • nvidia
  • ai infrastructure
  • machine learning
Opportunità di mercato
Logo Major
Valore Major (MAJOR)
$0.06068
$0.06068$0.06068
-1.70%
USD
Grafico dei prezzi in tempo reale di Major (MAJOR)
Disclaimer: gli articoli ripubblicati su questo sito provengono da piattaforme pubbliche e sono forniti esclusivamente a scopo informativo. Non riflettono necessariamente le opinioni di MEXC. Tutti i diritti rimangono agli autori originali. Se ritieni che un contenuto violi i diritti di terze parti, contatta [email protected] per la rimozione. MEXC non fornisce alcuna garanzia in merito all'accuratezza, completezza o tempestività del contenuto e non è responsabile per eventuali azioni intraprese sulla base delle informazioni fornite. Il contenuto non costituisce consulenza finanziaria, legale o professionale di altro tipo, né deve essere considerato una raccomandazione o un'approvazione da parte di MEXC.

Roll the Dice & Win Up to 1 BTC

Roll the Dice & Win Up to 1 BTCRoll the Dice & Win Up to 1 BTC

Invite friends & share 500,000 USDT!