NVIDIA releases cuTile.jl, enabling Julia developers to write high-performance GPU kernels using tile-based programming with near-parity Python performance. (ReadNVIDIA releases cuTile.jl, enabling Julia developers to write high-performance GPU kernels using tile-based programming with near-parity Python performance. (Read

NVIDIA Brings CUDA Tile Programming to Julia with cuTile.jl Release

2026/03/04 04:24
3 min di lettura
Per feedback o dubbi su questo contenuto, contattateci all'indirizzo [email protected].

NVIDIA Brings CUDA Tile Programming to Julia with cuTile.jl Release

James Ding Mar 03, 2026 20:24

NVIDIA releases cuTile.jl, enabling Julia developers to write high-performance GPU kernels using tile-based programming with near-parity Python performance.

NVIDIA Brings CUDA Tile Programming to Julia with cuTile.jl Release

NVIDIA has extended its tile-based GPU programming model to Julia developers with the release of cuTile.jl, an open-source package that achieves up to 100% performance parity with its Python counterpart on compute-intensive workloads.

The package, developed in collaboration with JuliaGPU, represents the latest expansion of CUDA Tile—what NVIDIA has called the most significant addition to CUDA programming since the platform launched in 2006. While Python developers gained access to the tile-based model earlier this year, Julia's scientific computing community can now tap into the same automatic hardware optimization.

Why Tile-Based Programming Matters

Traditional CUDA development forces programmers to manually manage threads, warps, and memory hierarchies. Tile-based programming flips this: developers describe operations on chunks of data, and the compiler handles hardware mapping automatically. This includes automatic access to Tensor Cores and Tensor Memory Accelerators—specialized hardware that previously required expert-level optimization.

The practical difference shows up in code complexity. A vector addition kernel in traditional CUDA.jl requires explicit thread indexing, bounds checking, and block configuration. The cuTile.jl equivalent reads more like standard array operations, with the compiler handling the low-level details.

Benchmark Results on Blackwell Hardware

Testing on an NVIDIA GeForce RTX 5080 (Blackwell architecture), cuTile.jl matched Python performance across core operations:

Vector addition hit 838 GB/s versus Python's 843 GB/s (99% parity). Matrix multiplication reached 50.9 TFLOPS against Python's 50.5 TFLOPS—actually slightly faster. Matrix transpose achieved 98% parity at 797 GB/s.

Batch matrix multiply showed the largest gap at 91% (43.0 vs 47.5 TFLOPS), while complex control-flow kernels like layer normalization and FFT still need optimization work.

Technical Implementation

cuTile.jl uses a custom Julia compiler that intercepts standard library calls—operations like sum, reshape, and basic arithmetic—and routes them to Tile IR operations. This produces the same bytecode format as cuTile Python, feeding into NVIDIA's tileiras compiler for final GPU machine code generation.

The design deliberately mirrors Python's API structure, making documentation and code examples portable between languages. But it embraces Julia conventions where appropriate: 1-based indexing, broadcast syntax with dots (.^, .-, ./), and native integration with CUDA.jl for array management.

Current Limitations

This remains experimental software. Not all cuTile features work yet. Iterator-based for loops either fail or generate inefficient code. APIs may change without warning. The package requires Blackwell GPUs (compute capability 12.0+) and CUDA 13 drivers—hardware that most developers don't have access to yet.

For Julia shops already invested in GPU computing through CUDA.jl, cuTile.jl offers a path toward simpler kernel development as Blackwell hardware becomes available. The package is available now through Julia's package manager at github.com/JuliaGPU/cuTile.jl.

Image source: Shutterstock
  • nvidia
  • cuda
  • julia
  • gpu programming
  • cutile
Opportunità di mercato
Logo NEAR
Valore NEAR (NEAR)
$1.35
$1.35$1.35
-2.98%
USD
Grafico dei prezzi in tempo reale di NEAR (NEAR)
Disclaimer: gli articoli ripubblicati su questo sito provengono da piattaforme pubbliche e sono forniti esclusivamente a scopo informativo. Non riflettono necessariamente le opinioni di MEXC. Tutti i diritti rimangono agli autori originali. Se ritieni che un contenuto violi i diritti di terze parti, contatta [email protected] per la rimozione. MEXC non fornisce alcuna garanzia in merito all'accuratezza, completezza o tempestività del contenuto e non è responsabile per eventuali azioni intraprese sulla base delle informazioni fornite. Il contenuto non costituisce consulenza finanziaria, legale o professionale di altro tipo, né deve essere considerato una raccomandazione o un'approvazione da parte di MEXC.

Potrebbe anche piacerti

The Channel Factories We’ve Been Waiting For

The Channel Factories We’ve Been Waiting For

The post The Channel Factories We’ve Been Waiting For appeared on BitcoinEthereumNews.com. Visions of future technology are often prescient about the broad strokes while flubbing the details. The tablets in “2001: A Space Odyssey” do indeed look like iPads, but you never see the astronauts paying for subscriptions or wasting hours on Candy Crush.  Channel factories are one vision that arose early in the history of the Lightning Network to address some challenges that Lightning has faced from the beginning. Despite having grown to become Bitcoin’s most successful layer-2 scaling solution, with instant and low-fee payments, Lightning’s scale is limited by its reliance on payment channels. Although Lightning shifts most transactions off-chain, each payment channel still requires an on-chain transaction to open and (usually) another to close. As adoption grows, pressure on the blockchain grows with it. The need for a more scalable approach to managing channels is clear. Channel factories were supposed to meet this need, but where are they? In 2025, subnetworks are emerging that revive the impetus of channel factories with some new details that vastly increase their potential. They are natively interoperable with Lightning and achieve greater scale by allowing a group of participants to open a shared multisig UTXO and create multiple bilateral channels, which reduces the number of on-chain transactions and improves capital efficiency. Achieving greater scale by reducing complexity, Ark and Spark perform the same function as traditional channel factories with new designs and additional capabilities based on shared UTXOs.  Channel Factories 101 Channel factories have been around since the inception of Lightning. A factory is a multiparty contract where multiple users (not just two, as in a Dryja-Poon channel) cooperatively lock funds in a single multisig UTXO. They can open, close and update channels off-chain without updating the blockchain for each operation. Only when participants leave or the factory dissolves is an on-chain transaction…
Condividi
BitcoinEthereumNews2025/09/18 00:09
IP Hits $11.75, HYPE Climbs to $55, BlockDAG Surpasses Both with $407M Presale Surge!

IP Hits $11.75, HYPE Climbs to $55, BlockDAG Surpasses Both with $407M Presale Surge!

The post IP Hits $11.75, HYPE Climbs to $55, BlockDAG Surpasses Both with $407M Presale Surge! appeared on BitcoinEthereumNews.com. Crypto News 17 September 2025 | 18:00 Discover why BlockDAG’s upcoming Awakening Testnet launch makes it the best crypto to buy today as Story (IP) price jumps to $11.75 and Hyperliquid hits new highs. Recent crypto market numbers show strength but also some limits. The Story (IP) price jump has been sharp, fueled by big buybacks and speculation, yet critics point out that revenue still lags far behind its valuation. The Hyperliquid (HYPE) price looks solid around the mid-$50s after a new all-time high, but questions remain about sustainability once the hype around USDH proposals cools down. So the obvious question is: why chase coins that are either stretched thin or at risk of retracing when you could back a network that’s already proving itself on the ground? That’s where BlockDAG comes in. While other chains are stuck dealing with validator congestion or outages, BlockDAG’s upcoming Awakening Testnet will be stress-testing its EVM-compatible smart chain with real miners before listing. For anyone looking for the best crypto coin to buy, the choice between waiting on fixes or joining live progress feels like an easy one. BlockDAG: Smart Chain Running Before Launch Ethereum continues to wrestle with gas congestion, and Solana is still known for network freezes, yet BlockDAG is already showing a different picture. Its upcoming Awakening Testnet, set to launch on September 25, isn’t just a demo; it’s a live rollout where the chain’s base protocols are being stress-tested with miners connected globally. EVM compatibility is active, account abstraction is built in, and tools like updated vesting contracts and Stratum integration are already functional. Instead of waiting for fixes like other networks, BlockDAG is proving its infrastructure in real time. What makes this even more important is that the technology is operational before the coin even hits exchanges. That…
Condividi
BitcoinEthereumNews2025/09/18 00:32
Ripple Concludes 700 Million XRP Escrow Lock for March

Ripple Concludes 700 Million XRP Escrow Lock for March

The post Ripple Concludes 700 Million XRP Escrow Lock for March appeared on BitcoinEthereumNews.com. XRP reacts with mild price surge  Ripple to relock 700 million
Condividi
BitcoinEthereumNews2026/03/04 05:34