NVIDIA and Nebius benchmarks show GPU fractioning achieves 86% user capacity on 0.5 GPU allocation, enabling 3x more concurrent users for mixed AI workloads. (ReadNVIDIA and Nebius benchmarks show GPU fractioning achieves 86% user capacity on 0.5 GPU allocation, enabling 3x more concurrent users for mixed AI workloads. (Read

NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation

2026/02/19 02:31
Okuma süresi: 3 dk

NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation

Darius Baruo Feb 18, 2026 18:31

NVIDIA and Nebius benchmarks show GPU fractioning achieves 86% user capacity on 0.5 GPU allocation, enabling 3x more concurrent users for mixed AI workloads.

NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation

NVIDIA's Run:ai platform can deliver 77% of full GPU throughput using just half the hardware allocation, according to joint benchmarking with cloud provider Nebius released February 18. The results demonstrate that enterprises running large language model inference can dramatically expand capacity without proportional GPU investment.

The tests, conducted on clusters with 64 NVIDIA H100 NVL GPUs and 32 NVIDIA HGX B200 GPUs, showed fractional GPU scheduling achieving near-linear performance scaling across 0.5, 0.25, and 0.125 allocations.

Hard Numbers from Production Testing

At 0.5 GPU allocation, the system supported 8,768 concurrent users while maintaining time-to-first-token under one second—86% of the 10,200 users supported at full allocation. Token generation hit 152,694 tokens per second, compared to 198,680 at full capacity.

Smaller models pushed these gains further. Phi-4-Mini running on 0.25 GPU fractions handled 72% more concurrent users than full-GPU deployment, achieving approximately 450,000 tokens per second with P95 latency under 300 milliseconds on 32 GPUs.

The mixed workload scenario proved most striking. Running Llama 3.1 8B, Phi-4 Mini, and Qwen-Embeddings simultaneously on fractional allocations tripled total concurrent system users compared to single-model deployment. Combined throughput exceeded 350,000 tokens per second at full scale with no cross-model interference.

Why This Matters for GPU Economics

Traditional Kubernetes schedulers allocate whole GPUs to individual models, leaving substantial capacity stranded. The benchmarks noted that even Qwen3-14B, the largest model tested at 14 billion parameters, occupies only 35% of an H100 NVL's 80GB capacity.

Run:ai's scheduler eliminates this waste through dynamic memory allocation. Users specify requirements directly; the system handles resource distribution without preconfiguration. Memory isolation happens at runtime while compute cycles distribute fairly among active processes.

This timing coincides with broader industry moves toward GPU partitioning. SoftBank and AMD announced validation testing on February 16 for similar fractioning capabilities on AMD Instinct GPUs, where single GPUs can split into up to eight logical devices.

Autoscaling Without Latency Spikes

Nebius tested automatic scaling with Llama 3.1 8B configured to add GPUs when concurrent users exceeded 50. Replicas scaled from 1 to 16 with clean ramp-up, stable utilization during pod warm-up, and negligible HTTP errors.

The practical implication: enterprises can run multiple inference models on existing GPU inventory, scale dynamically during peak demand, and reclaim idle capacity during off-hours for other workloads. For organizations facing fixed GPU budgets, fractioning transforms capacity planning from hardware procurement into software configuration.

Run:ai v2.24 is available now. NVIDIA plans to discuss the Nebius implementation at GTC 2026.

Image source: Shutterstock
  • nvidia
  • gpu
  • ai infrastructure
  • llm inference
  • run:ai
Piyasa Fırsatı
NodeAI Logosu
NodeAI Fiyatı(GPU)
$0.03069
$0.03069$0.03069
+2.53%
USD
NodeAI (GPU) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen [email protected] ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

Trading time: Tonight, the US GDP and the upcoming non-farm data will become the market focus. Institutions are bullish on BTC to $120,000 in the second quarter.

Trading time: Tonight, the US GDP and the upcoming non-farm data will become the market focus. Institutions are bullish on BTC to $120,000 in the second quarter.

Daily market key data review and trend analysis, produced by PANews.
Paylaş
PANews2025/04/30 13:50
Why LYNO’s Presale Could Trigger the Next Wave of Crypto FOMO After SOL and PEPE

Why LYNO’s Presale Could Trigger the Next Wave of Crypto FOMO After SOL and PEPE

The post Why LYNO’s Presale Could Trigger the Next Wave of Crypto FOMO After SOL and PEPE appeared on BitcoinEthereumNews.com. Cryptocirca has never been bereft of hype cycles and fear of missing out (FOMO). The case of Solana (SOL) and Pepe (PEPE) is one of the brightest examples that early investments into the correct projects may yield the returns that are drifting. Today there is an emerging rival in the limelight—LYNO. LYNO is in its presale stage, and already it is being compared to former breakout tokens, as many investors are speculating that LYNO will be the next big thing to ignite the market in a similar manner. Early Bird Presale: Lowest Price LYNO is in the Early Bird presale and costs only $0.050 for each token; the initial round will rise to $0.055. To date, approximately 629,165.744 tokens have been sold, with approximately $31,458.287 of that amount going towards the $100,000 project goal.  The crypto presales allow investors the privilege to acquire tokens at reduced prices before they become available to the general market, and they tend to bring substantial returns in the case of great fundamentals. The final goal of the project: 0.100 per token. This gradual development underscores increasing investor confidence and it brings a sense of urgency to those who wish to be first movers. LYNO’s Edge in a Competitive Market LYNO isn’t just another presale token—it’s a powerful AI-driven cross-chain arbitrage platform designed to deliver real utility and long-term growth. Operating across 15+ blockchains, LYNO’s AI engine analyzes token prices, liquidity, volume, and gas fees in real-time to identify the most profitable trade routes. It integrates with bridges like LayerZero, Wormhole, and Axelar, allowing assets to move instantly across networks, so no opportunity is missed.  The platform also includes community governance, letting $LYNO holders vote on protocol upgrades and fee structures, staking rewards for long-term investors, buyback-and-burn mechanisms to support token value, and audited smart…
Paylaş
BitcoinEthereumNews2025/09/18 16:11
Nvidia’s Strategic Masterstroke: Deepening Early-Stage Ties with India’s Booming AI Startup Ecosystem

Nvidia’s Strategic Masterstroke: Deepening Early-Stage Ties with India’s Booming AI Startup Ecosystem

BitcoinWorld Nvidia’s Strategic Masterstroke: Deepening Early-Stage Ties with India’s Booming AI Startup Ecosystem NEW DELHI, INDIA – October 2025: Nvidia Corporation
Paylaş
bitcoinworld2026/02/20 09:30