The post NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism appeared on BitcoinEthereumNews.com. Joerg Hiller Oct 20, 2025 15:21 NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs. NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token. Expert Parallelism and Its Impact Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure. Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks. System Design and Architecture The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment. Addressing Communication Overhead Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible. Kernel Optimization and Load Balancing To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts… The post NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism appeared on BitcoinEthereumNews.com. Joerg Hiller Oct 20, 2025 15:21 NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs. NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token. Expert Parallelism and Its Impact Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure. Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks. System Design and Architecture The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment. Addressing Communication Overhead Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible. Kernel Optimization and Load Balancing To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts…

NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism



Joerg Hiller
Oct 20, 2025 15:21

NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs.

NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token.

Expert Parallelism and Its Impact

Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure.

Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks.

System Design and Architecture

The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment.

Addressing Communication Overhead

Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible.

Kernel Optimization and Load Balancing

To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts to prevent over- or under-utilization of GPUs, crucial for maintaining efficiency in real-time production systems.

Implications for AI Inference

Wide-EP on NVIDIA’s NVL72 systems provides a scalable solution for MoE models, reducing weight-loading pressure and improving GroupGEMM efficiency. In testing, large EP configurations demonstrated up to 1.8x higher per-GPU throughput compared to smaller setups, highlighting the potential for significant performance gains.

The advancements in Wide-EP not only improve throughput and latency but also enhance system economics by increasing concurrency and GPU efficiency. This positions NVIDIA’s NVL72 as a pivotal player in the cost-effective deployment of trillion-parameter models, offering developers, researchers, and infrastructure teams new opportunities to optimize AI workloads.

Image source: Shutterstock

Source: https://blockchain.news/news/nvidia-nvl72-revolutionizing-moe-model-scaling

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

House Judiciary Rejects Vote To Subpoena Banks CEOs For Epstein Case

House Judiciary Rejects Vote To Subpoena Banks CEOs For Epstein Case

The post House Judiciary Rejects Vote To Subpoena Banks CEOs For Epstein Case appeared on BitcoinEthereumNews.com. Topline House Judiciary Committee Republicans blocked a Democrat effort Wednesday to subpoena a group of major banks as part of a renewed investigation into late sex offender Jeffrey Epstein’s financial ties. Congressman Jim Jordan, R-OH, is the chairman of the committee. (Photo by Nathan Posner/Anadolu via Getty Images) Anadolu via Getty Images Key Facts A near party-line vote squashed the effort to vote on a subpoena, with Rep. Thomas Massie, R-Ky., who is leading a separate effort to force the Justice Department to release more Epstein case materials, voting alongside Democrats. The vote, if successful, would have resulted in the issuing of subpoenas to JPMorgan Chase CEO Jamie Dimon, Bank of America CEO Brian Moynihan, Deutsche Bank CEO Christian Sewing and Bank of New York Mellon CEO Robin Vince. The subpoenas would have specifically looked into multiple reports that claimed the four banks flagged $1.5 billion in suspicious transactions linked to Epstein. The failed effort from Democrats followed an FBI oversight hearing in which agency director Kash Patel misleadingly claimed the FBI cannot release many of the files it has on Epstein. Get Forbes Breaking News Text Alerts: We’re launching text message alerts so you’ll always know the biggest stories shaping the day’s headlines. Text “Alerts” to (201) 335-0739 or sign up here. Crucial Quote Dimon, who attended a lunch with Senate Republicans before the vote, according to Politico, told reporters, “We regret any association with that man at all. And, of course, if it’s a legal requirement, we would conform to it. We have no issue with that.” Chief Critic “Republicans had the chance to subpoena the CEOs of JPMorgan, Bank of America, Deutsche Bank, and Bank of New York Mellon to expose Epstein’s money trail,” the House Judiciary Democrats said in a tweet. “Instead, they tried to bury…
Share
BitcoinEthereumNews2025/09/18 08:02
Giành vương miện: Thi đấu để giành vinh quang & 1,5 triệu USDT tại Giải vô địch Futures Cổ phiếu Bitget

Giành vương miện: Thi đấu để giành vinh quang & 1,5 triệu USDT tại Giải vô địch Futures Cổ phiếu Bitget

Trận đấu giao dịch đỉnh cao đã bắt đầu! Giải vô địch Futures Cổ phiếu 2026, giải đấu giao dịch [...] The post Giành vương miện: Thi đấu để giành vinh quang & 1,
Share
Vneconomics2026/01/24 16:49
Schenex Machinery and the Practical Reality of Buying Used Heavy Equipment

Schenex Machinery and the Practical Reality of Buying Used Heavy Equipment

Heavy machinery is the lifeblood of construction sites, agriculture, shipping, and other industrial activities. Whether they are moving earth or material, machines
Share
Techbullion2026/01/24 17:20