This study examines Transformer architectures' reasoning limitations using global reasoning challenges and syllogism composition as a framework. The authors show that Transformers encounter an exponential rise in learning difficulty as task complexity increases by formalizing the cycle problem, a synthetic benchmark that necessitates long-chain logical inference. Distribution localization, a measure of how many tokens beyond the fundamental statistics are required to meaningfully correlate with the goal output, is the idea they put up to explain this.This study examines Transformer architectures' reasoning limitations using global reasoning challenges and syllogism composition as a framework. The authors show that Transformers encounter an exponential rise in learning difficulty as task complexity increases by formalizing the cycle problem, a synthetic benchmark that necessitates long-chain logical inference. Distribution localization, a measure of how many tokens beyond the fundamental statistics are required to meaningfully correlate with the goal output, is the idea they put up to explain this.

Why Transformers Struggle with Global Reasoning

Abstract and 1. Introduction

1.1 Syllogisms composition

1.2 Hardness of long compositions

1.3 Hardness of global reasoning

1.4 Our contributions

  1. Results on the local reasoning barrier

    2.1 Defining locality and auto-regressive locality

    2.2 Transformers require low locality: formal results

    2.3 Agnostic scratchpads cannot break the locality

  2. Scratchpads to break the locality

    3.1 Educated scratchpad

    3.2 Inductive Scratchpads

  3. Conclusion, Acknowledgments, and References

A. Further related literature

B. Additional experiments

C. Experiment and implementation details

D. Proof of Theorem 1

E. Comment on Lemma 1

F. Discussion on circuit complexity connections

G. More experiments with ChatGPT

\

1.3 Hardness of global reasoning

As discussed previously, the cycle task appears to be challenging for Transformers as it requires some global reasoning. Other tasks such as subset parities exhibit the same challenge. However the latter can be proved to be not efficiently learnable by various regular neural networks and noisy gradient descent, as one can get explicitly a class of functions (through orbit arguments [12, 13]) that has large statistical dimension [14] or low cross-predictability [12, 15] (see Appendix A.2). For the cycle task, we have a single distribution, and it is unclear how to use the invariances of Transformers to get arguments as in [12, 13], as the input distribution is not invariant under the symmetries of the model. We thus would like to develop a more general complexity measure that unifies why such tasks are hard for Transformer-like models and that formalizes the notion of ‘local reasoning barrier’ when models are trained from scratch. We also would like to understand how the

\ Figure 1: Illustration of the cycle task for n = 4 (left) and the complexity to learn it (right).

\ scratchpad methodologies that have proved helpful in various settings (see Section 3) can help here. This raises the questions:

\ (1) How can we formalize the ‘local reasoning barrier’ in general terms?

\ (2) Can we break the ‘local reasoning barrier’ with scratchpad methodologies?

1.4 Our contributions

We provide the following contributions:

– A general conjecture (Conjecture 1), backed by experimental results, that claims efficient weak learning is achievable by a regular Transformer if and only if the distribution locality is constant.

\ – A theorem (Theorem 1) that proves the negative side of the above conjecture, the locality barrier, in the instance of a variant of the cycle task under certain technical assumptions. (The cycle task is also put forward in the paper as a simple benchmark to test the global reasoning capabilities of models.)

\ • We then switch to the use of ‘scratchpads’ to help with the locality barrier:

\ – Agnostic scratchpad: we extend Theorem 1 to cases where a polynomial-size scratchpad is used by the Transformer, without any supervision of the scratchpad. I.e., the scratchpad gives additional memory space for the Transformer to compute intermediate steps. This shows that efficient weak learning is still not possible with such an agnostic scratchpad if the locality is non-constant. An educated guess about what to learn in the scratchpad based on some target knowledge is thus required.

\ – Educated scratchpad: we generalize the measure of locality to the ‘autoregressive locality’ to quantify when an educated scratchpad is able to break the locality of a task with subtasks of lower locality. We give experimental results showing that educated scratchpads with constant autoregressive locality allow Transformers to efficiently learn tasks that may originally have high locality. This gives a way to measure how useful a scratchpad can be to break a target into easier sub-targets.

\ – We introduce the notion of inductive scratchpad, a type of educated scratchpad that exploits ‘induction’ compared to a fully educated scratchpad. We show that when the target admits an inductive decomposition, such as for the cycle, arithmetic, or parity tasks, the inductive scratchpad both breaks the locality and improves the OOD generalization in contrast to fully educated scratchpads. This gives significant length generalization on additions (from 10 to 20 or from 4 to 26 depending on the method) and parities (from 30 to 50-55). For instance, using different methods, [17] can length generalize from 10 to 13 digits for additions, and [11] can get roughly 10 extra bits for parities with moderate accuracy.

\

:::info Authors:

(1) Emmanuel Abbe, Apple and EPFL;

(2) Samy Bengio, Apple;

(3) Aryo Lotf, EPFL;

(4) Colin Sandon, EPFL;

(5) Omid Saremi, Apple.

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

[1] Answering ‘yes/1’ if the syllogism can be obtained by composing input ones or ‘cannot tell/0’ otherwise.

\ [2] At the time of the experiments, ChatGPT was in particular not successful at these two tasks.

Market Opportunity
RISE Logo
RISE Price(RISE)
$0.005969
$0.005969$0.005969
+0.62%
USD
RISE (RISE) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Fed Decides On Interest Rates Today—Here’s What To Watch For

Fed Decides On Interest Rates Today—Here’s What To Watch For

The post Fed Decides On Interest Rates Today—Here’s What To Watch For appeared on BitcoinEthereumNews.com. Topline The Federal Reserve on Wednesday will conclude a two-day policymaking meeting and release a decision on whether to lower interest rates—following months of pressure and criticism from President Donald Trump—and potentially signal whether additional cuts are on the way. President Donald Trump has urged the central bank to “CUT INTEREST RATES, NOW, AND BIGGER” than they might plan to. Getty Images Key Facts The central bank is poised to cut interest rates by at least a quarter-point, down from the 4.25% to 4.5% range where they have been held since December to between 4% and 4.25%, as Wall Street has placed 100% odds of a rate cut, according to CME’s FedWatch, with higher odds (94%) on a quarter-point cut than a half-point (6%) reduction. Fed governors Christopher Waller and Michelle Bowman, both Trump appointees, voted in July for a quarter-point reduction to rates, and they may dissent again in favor of a large cut alongside Stephen Miran, Trump’s Council of Economic Advisers’ chair, who was sworn in at the meeting’s start on Tuesday. It’s unclear whether other policymakers, including Kansas City Fed President Jeffrey Schmid and St. Louis Fed President Alberto Musalem, will favor larger cuts or opt for no reduction. Fed Chair Jerome Powell said in his Jackson Hole, Wyoming, address last month the central bank would likely consider a looser monetary policy, noting the “shifting balance of risks” on the U.S. economy “may warrant adjusting our policy stance.” David Mericle, an economist for Goldman Sachs, wrote in a note the “key question” for the Fed’s meeting is whether policymakers signal “this is likely the first in a series of consecutive cuts” as the central bank is anticipated to “acknowledge the softening in the labor market,” though they may not “nod to an October cut.” Mericle said he…
Share
BitcoinEthereumNews2025/09/18 00:23
Will XRP Price Increase In September 2025?

Will XRP Price Increase In September 2025?

Ripple XRP is a cryptocurrency that primarily focuses on building a decentralised payments network to facilitate low-cost and cross-border transactions. It’s a native digital currency of the Ripple network, which works as a blockchain called the XRP Ledger (XRPL). It utilised a shared, distributed ledger to track account balances and transactions. What Do XRP Charts Reveal? […]
Share
Tronweekly2025/09/18 00:00
Exclusive interview with Smokey The Bera, co-founder of Berachain: How the innovative PoL public chain solves the liquidity problem and may be launched in a few months

Exclusive interview with Smokey The Bera, co-founder of Berachain: How the innovative PoL public chain solves the liquidity problem and may be launched in a few months

Recently, PANews interviewed Smokey The Bera, co-founder of Berachain, to unravel the background of the establishment of this anonymous project, Berachain's PoL mechanism, the latest developments, and answered widely concerned topics such as airdrop expectations and new opportunities in the DeFi field.
Share
PANews2024/07/03 13:00