This article explores how block-based parallelization improves the efficiency of probabilistic circuits by reducing both IO and computation overhead. Starting with fully connected sum layers, it explains how assigning indices, grouping node blocks, and padding with pseudo-nodes enable optimized kernel launches. Using dynamic programming for partitioning ensures minimal overhead while maximizing speed. Results show that larger block sizes cut IO operations dramatically, achieving up to 50x faster performance without significant cost from padded edges.This article explores how block-based parallelization improves the efficiency of probabilistic circuits by reducing both IO and computation overhead. Starting with fully connected sum layers, it explains how assigning indices, grouping node blocks, and padding with pseudo-nodes enable optimized kernel launches. Using dynamic programming for partitioning ensures minimal overhead while maximizing speed. Results show that larger block sizes cut IO operations dramatically, achieving up to 50x faster performance without significant cost from padded edges.

How Block-Based Parallelization Cuts IO and Computation Overhead

2025/08/25 07:11

Abstract and 1. Introduction

  1. Preliminaries and Related Work

  2. Key Bottlenecks in PC Parallelization

  3. Harnessing Block-Based PC Parallelization

    4.1. Fully Connected Sum Layers

    4.2. Generalizing To Practical Sum Layers

    4.3. Efficient Implementations by Compiling PC Layers

    4.4. Analysis: IO and Computation Overhead

  4. Optimizing Backpropagation with PC Flows

  5. Experiments

    6.1. Faster Models with PyJuice

    6.2. Better PCs At Scale

    6.3. Benchmarking Existing PCs

  6. Conclusion, Acknowledgements, Impact Statement, and References

A. Algorithm Details

B. Additional Technical Details

C. Experimental Details

D. Additional Experiments

\

4. Harnessing Block-Based PC Parallelization

This section takes gradual steps toward demonstrating how we can reduce both the IO and computation overhead using block-based parallelization. Specifically, we first utilize a fully connected sum layer to sketch the high-level idea (Sec. 4.1). Consequently, we move on to the general case, providing further details of the algorithm (Secs. 4.2, 4.3).

4.1. Fully Connected Sum Layers

Consider a fully connected sum layer comprised of M sum nodes, each connected to the same set of N product nodes as inputs. Under the parallelization strategy mentioned in

\ Figure 3. Illustration of block-based parallelization. A processor computes the output of 2 sum nodes, by iterating through blocks of 2 input product nodes and accumulating partial results.

\ Section 3, with a single sample, we have M processors each computing the output of a sum node. Since the layer is fully connected, every processor loads all N input log-probabilities, which results in M reloads of every input.

\

4.2. Generalizing To Practical Sum Layers

\

\ \ \ Figure 4. A sum layer (left) with a block-sparse parameter matrix (middle) is compiled into two kernels (right) each with a balanced workload. During execution, each kernel uses the compiled sum/prod/param indices to compute the outputs of m0, . . . , m5.

\ \ \

\ \ \

4.3. Efficient Implementations by Compiling PC Layers

We address both problems through a compilation process, where we assign every node an index, and precompute index tensors that enable efficient block-based parallelization. The first step is to partition the sum node blocks into groups, such that every node block within a group has a similar number of connected child node blocks. We then pad the children with pseudo-product node blocks with probability 0 such that all sum node blocks in a group have the same number of children. The partition is generated by a dynamic programming algorithm that aims to divide the layer into the smallest possible number of groups while ensuring that the fraction of added pseudo-node blocks does not exceed a predefined threshold. Due to space constraints, we elaborate the node block partitioning algorithm in Appendix A.1. We also discuss its optimality and time/memory efficiency.

\ \

\ \ \

\ \ Partitioning a layer into groups with the same number of children allows us to use different kernel launching hyperparameters according to the specific setup of every node group (e.g., number of nodes) to achieve better performance.

\ \

\ \ \

\

4.4. Analysis: IO and Computation Overhead

\

\ \ \ igure 5. Runtime and IO overhead of a sum layer from the PD structure (with 29K nodes and 30M edges). The results demonstrate significant performance gains from our block-based parallelization, even with small block sizes.

\ \ Results are shown in Figure 5. As the block size increases, both the forward and the backward pass become significantly faster. Notably, this is accompanied by a significant drop in IO overhead. Specifically, with a large block size, the kernel consumes 2x fewer reads/writes between the L2 cache and the HBM, and 25-50x fewer IO between the L1 and L2 cache. This corroborates the hypothesis stated in Section 3 that the extensive value reloads significantly slow down the computation.

\ \

\ \ the speedup obtained by having a larger block size outpaces the overhead caused by padded edges with zero parameters, which leads to speed-ups.

\ \

:::info Authors:

(1) Anji Liu, Department of Computer Science, University of California, Los Angeles, USA ([email protected]);

(2) Kareem Ahmed, Department of Computer Science, University of California, Los Angeles, USA;

(3) Guy Van den Broeck, Department of Computer Science, University of California, Los Angeles, USA;

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Superstate launches an on-chain direct issuance solution, enabling companies to raise funds in stablecoins to issue tokenized shares.

Superstate launches an on-chain direct issuance solution, enabling companies to raise funds in stablecoins to issue tokenized shares.

PANews reported on December 10th that Superstate, led by Compound founder Robert Leshner, announced the launch of "Direct Issuance Programs." This program allows publicly traded companies to raise funds directly from KYC-verified investors by issuing tokenized shares, with investors paying in stablecoins and settling instantly. The service will run on Ethereum and Solana, with the first offering expected to launch in 2026. The program requires no underwriters, complies with SEC regulations, and aims to promote the on-chaining of capital markets.
Share
PANews2025/12/10 21:07
Trump to start final Fed chair interviews beginning with Kevin Warsh

Trump to start final Fed chair interviews beginning with Kevin Warsh

The post Trump to start final Fed chair interviews beginning with Kevin Warsh appeared on BitcoinEthereumNews.com. President Donald Trump will begin the final interviews of candidates for the Federal Reserve chair this week, putting back on track the formal selection process that began this summer. “We’re going to be looking at a couple different people, but I have a pretty good idea of who I want,” Trump said Tuesday night aboard Air Force One to reporters. The interviews by Trump and Treasury Secretary Scott Bessent will begin with former Fed governor Kevin Warsh on Wednesday and also include Kevin Hassett, the director of the National Economic Council, at some point, according to two sources. It restarts the process that was derailed a bit last week when interviews with candidates were abruptly canceled. Trump said recently he knew who he was going to pick to replace current Chair Jerome Powell, and prediction markets overwhelmingly believed it would be Hassett. But his possible selection received some pushback from the markets recently, especially among fixed income investors concerned Hassett would only do Trump’s bidding and keep rates too low even if inflation snaps back. So it’s unclear if these interviews are a sign Trump has changed his mind or just the final stage of the formal process. CNBC first reported in October that Trump had narrowed the candidate list down to five people. Four of those five will be part of these final interviews. The group also includes current Governors Christopher Waller and Michelle Bowman as well as BlackRock fixed income chief Rick Rieder. The Fed will likely lower rates for a third time this year on Wednesday, but Powell, whose term as chair is up in May, is expected to strike a cautious tone at his post-meeting press conference on how much lower the central bank will go next year. The Fed’s latest forecast released in September called…
Share
BitcoinEthereumNews2025/12/10 21:07