This article introduces a novel arithmetical puzzle dataset designed to test and enhance AI reasoning capabilities. The puzzles involve manipulating integers through arithmetic operations to reach a target, with each number used exactly once. A data synthesis pipeline generates large-scale datasets, with controlled parameters for training, in-distribution testing, and out-of-distribution evaluation. Using the LLaMA architecture with LoRA fine-tuning, the study achieves efficient parameter reduction while benchmarking AI’s ability to generalize across numerical scales and abstract puzzle forms.This article introduces a novel arithmetical puzzle dataset designed to test and enhance AI reasoning capabilities. The puzzles involve manipulating integers through arithmetic operations to reach a target, with each number used exactly once. A data synthesis pipeline generates large-scale datasets, with controlled parameters for training, in-distribution testing, and out-of-distribution evaluation. Using the LLaMA architecture with LoRA fine-tuning, the study achieves efficient parameter reduction while benchmarking AI’s ability to generalize across numerical scales and abstract puzzle forms.

A Framework for Synthesizing Arithmetical Puzzle Datasets for Large Language Models

2025/08/24 00:35

:::info Authors:

(1) Haolong Li, Tongji Universiy and work done during internship at ByteDance ([email protected]);

(2) Yu Ma, Seed Foundation, ByteDance ([email protected]);

(3) Yinqi Zhang, East China Normal University and work done during internship at ByteDance ([email protected]);

(4) Chen Ye (Corresponding Author), ESSC Lab, Tongji Universiy ([email protected]);

(5) Jie Chen, Seed Foundation, ByteDance and a Project Leader ([email protected]).

:::

Abstract and 1 Introduction

2 Problem Definition

2.1 Arithmetical Puzzle Problem

2.2 Data Synthesizing

2.3 Dataset

3 Model

4 Experiments

4.1 Evaluation

4.2 Results

4.3 Case Studies

5 Conclusion and Acknowledgements

6 Limitations

7 Ethics Statement and References

\ A Appendix

A.1 Hyperparameter Settings

A.2 Evaluation of the Base Model

A.3 Case Study

A.4 Visualization of the Proposed Puzzle

2.1 Arithmetical Puzzle Problem

Arithmetical puzzle problem denotes a mathematical puzzle involving arithmetic operations and requires logical reasoning and numerical manipulation to derive a solution. The 24 Puzzle and Arithmetic Grid Puzzle are well-known examples of arithmetical puzzle problems.

\ In this paper, we propose a challenging arithmetical puzzle. Its objective is intricate yet precise: to deftly manipulate a set of given integers through a calculated sequence of arithmetic operations, to achieve a predetermined target integer. The problem strictly limits each integer to be used by one time exactly. For example, for the integers 3, 6, 7, 51, 58 and the target integer 4, one possible solution is: 58−51 = 7, 6−7 = −1, 3×−1 = −3, −3 + 7 = 4, as shown in Figure 5 in Appendix A.4.

\

2.2 Data Synthesizing

Given the arithmetical puzzle described above in Section 2.1, we create a data synthesizing pipeline to efficiently generate the proposed dataset.

\ Denote the set of candidate integers as X = {X1, X2, . . . , XN } and the target number as T, where N is the total number of candidate integers in a puzzle sample. Each candidate integer Xi is independently sampled from a uniform distribution Xi ∼ U(1, V ), where V is the upper bound of the sampled integers. To avoid data overlapping, we have strictly ensured that for each puzzle, the candidate integers are a set of distinct numbers. The arithmetic operators involved in this problem are ops = {+, −, ×, ÷} and all operations are limited to integer operations. For example, when solving the puzzle with a division operator, the operation should be considered in integer division like 14/3 = 4. The detailed steps of synthesizing data for this puzzle is described in Algorithm 1.

\ Besides, to construct the SFT dataset, the prompt is deliberately designed to excludes any natural language cues and instead focuses on purely symbolic language. See Table 1 for an example of the constructed prompt and response.

2.3 Dataset

We split the dataset into training and in-distribution and out-of-distribution test dataset by controlling the total number of candidate integers N and the upper bound of the sampled integers V . We set

\ \

\ \ V = 60 for the training dataset, and sampled the candidate integers with N = 5, 6, 7. Three training datasets with different sizes scaling from 1 million to 10 millions and 100 millions are generated. And another 7500 samples (2500 samples for each N) under the same setting are generated as the in-distribution test dataset. Figure. 1 shows the distribution of N and X in these three training datasets. And the corresponding distribution of the tokenized prompt and response length is shown in Figure. 2.

\ To further evaluate the model’s performance on extrapolation, we have also designed two benchmarks of out-of-distribution dataset:

\ Numerical OOD test datasets. The upper bound of the sampled integers V is raised to 100 and 1000 separately to test the model’s generalization ability with unseen larger numbers. Specifically, 6000 samples are generated for each value of V with 2000 samples for each N. An additional filtering pipeline is applied to ensure that for each sample, there exists at least one integer Xi that satisfies 60 < Xi < 100 for the dataset with V = 100 and 100 < Xi < 1000 for that with V = 1000.

\ Form OOD test dataset. In mathematics, abstract forms often extend, such as expanding from a two-variable linear equation to one with three variables. For the proposed arithmetic puzzle, the extrapolation of abstract forms can be achieved by changing the number of candidate integers N. Clearly, when N increases, the exploration space leading to a feasible solution would expand exponentially, which results in an increased demand for precise reasoning steps. From another perspective, when the total number of the candidate integers changes, it actually requires the model’s ability to absorb and adapt to the puzzle’s abstract forms. Therefore, to test the model’s generalization capability from this point of view, we create another benchmark for OOD test dataset with 5000 samples generated with setting N to 8. To control variables, all the candidate integers in this dataset are sampled with the same upper bound V = 60 as the training dataset.

3 Model

3.1 Framework

We adopt the llama architecture (Touvron et al., 2023a) and employ low-rank adaptation (LoRA) tuning (Hu et al., 2021) based on the implementation of TRL full stack library (von Werra et al., 2020). LoRA achieves a remarkable reduction of 89% in our trainable parameters, from 3B to 0.3B.

3.2 Implementation Details

We train our model by fine-tuning open-llama-3B. We systematically apply left-padding to the query text and right-padding to the answer text to control the overall context length. All experiments are conducted with 8× NVIDIA A100-SXM4-80GB GPUs. The specific hyperparameter settings are listed in Table 3 in Appendix A.1.

\

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Trading Moment: Markets Enter a Key Week Ending the Year, Bitcoin Holds Key Level at $86,000

Trading Moment: Markets Enter a Key Week Ending the Year, Bitcoin Holds Key Level at $86,000

Daily market data review and trend analysis, produced by PANews. 1. Market Observation Markets are holding their breath for this week's Federal Reserve meeting, with a 25-basis-point rate cut widely expected. However, contrary to conventional wisdom, since the rate-cutting cycle began in September, the yield on long-term US Treasury bonds, the anchor for global asset pricing, has risen instead of falling, triggering intense debate about the future economic path. Optimists see this as a signal of a "soft landing," while pessimists worry it's a vote of no confidence from the "bond vigilantes" regarding the high national debt and inflation risks in the US. Against this backdrop, Wall Street veteran strategists like Mark Cabana of Bank of America predict that, in addition to rate cuts, the Fed may announce a major balance sheet expansion plan of up to $45 billion per month to address potential liquidity shortages. Meanwhile, China will also usher in a super week of policy announcements, with important meetings and the release of key economic data such as inflation and social financing providing new guidance for the market. Furthermore, competition in the field of artificial intelligence is becoming increasingly fierce, with OpenAI planning to release GPT-5.2 ahead of schedule to address this competition. The financial reports of Broadcom, a chip designer and Oracle, both core players in the AI industry chain, as well as the visit of Microsoft's CEO to India, will all serve as key indicators for assessing the investment climate in AI infrastructure and the future direction of the industry. In the Bitcoin market, short-term sentiment is cautious, but long-term indicators remain resilient. Analyst Murphy, based on the MVRV indicator, predicts that Bitcoin's price may reach $85,000 to $94,000 by December 31st, and then touch the $71,000 to $104,000 range in early 2026, considering $104,000 as a key bull-bear dividing line. Several analysts consider the $86,000 to $88,000 area as key support. For example, Daan Crypto Trades points out that a break below this key Fibonacci level could lead to a price pullback to a low of $76,000, while Michaël van de Poppe believes that holding $86,000 is a prerequisite for his bullish scenario (i.e., a price break above $92,000 and head towards $100,000). On-chain data presents a mixed picture: on the one hand, Glassnode points out that ETF demand continues to weaken, and market risk appetite is declining; on the other hand, analyst @TXMCtrades emphasizes the continued rise in the "activity" indicator, and CryptoQuant data also shows that selling pressure from long-term holders has been "completely reset," which may indicate potential spot demand and the formation of a market bottom. Bloomberg ETF expert Eric Balchunas, however, offers a more macro-level reassurance to the market, believing that Bitcoin's correction this year is merely a normal cooling down of last year's extreme 122% surge. Its resilience in reaching new highs after multiple significant pullbacks makes it no longer suitable for comparison to the "tulip bubble." Regarding Ethereum, short-term market sentiment leans towards pessimism, but long-term technical patterns are showing optimistic signals. According to Nansen data, "smart money" traders are still adding to their short positions in Ethereum on the derivatives platform Hyperliquid, with net short positions accumulating to over $21 million. However, analyst Sykodelic sees a positive side in the technical charts, pointing out that Ethereum's 5-day MACD and RSI indicators, after a thorough reset, are exhibiting patterns that have historically led to significant rallies, suggesting that a market bottom is forming. In the altcoin market, the AI project Bittensor (TAO) became the focus of attention. The project will undergo its first halving on December 14th, reducing the daily token issuance by half. Grayscale analyst Will Ogden Moore commented positively, believing it marks a significant milestone in the network's maturation. He pointed out that its strong adoption momentum, rising institutional interest, and the success of the dTAO mechanism could all be catalysts for price increases. TAO rose nearly 10% intraday. The weekend saw numerous market developments, with several events and figures attracting widespread attention. Terraform Labs co-founder Do Kwon's legal case saw new developments. US prosecutors recommended a 12-year prison sentence for his "massive" fraudulent activities, and US District Judge Paul Engelmayer will deliver sentencing on December 11th. This news initially caused USTC and LUNA tokens to surge by over 100% over the weekend before falling sharply, down nearly 20% in the past 24 hours. Additionally, Binance founder CZ's joke about executive He Yi's misspelling of "DOYR" in a tweet unexpectedly spawned a meme coin with the same name. Meanwhile, Binance responded directly to community concerns, stating that it is conducting an internal review of potential corruption related to token listings. Another noteworthy piece of news comes from the intersection of the tech and cryptocurrency worlds: Moore Threads, the "first domestically produced GPU stock," saw its share price surge after listing on the STAR Market. The controversial past of its co-founder, Li Feng, has also resurfaced, including his involvement in the "Mallego Coin" project with Li Xiaolai and others, and a long-standing debt dispute with OKX founder Star involving 1,500 bitcoins (currently worth approximately $135 million). In response, Star recently stated on social media that the debt issue has been handed over to legal action and that the focus should be on the future. 2. Key Data (as of 13:00 HKT, December 8) (Data source: CoinAnk, Upbit, Coingecko, SoSoValue, CoinMarketCap) Bitcoin: $91,596 (down 2.11% year-to-date), daily spot trading volume $40.49 billion. Ethereum: $3,134 (down 6.17% year-to-date), daily spot trading volume $25.27 billion. Fear of Greed Index: 20 (Extreme Fear) Average GAS: BTC: 1.2 sat/vB, ETH: 0.04 Gwei Market share: BTC 58.7%, ETH 12.2% Upbit 24-hour trading volume rankings: XRP, ETH, BTC, MOODENG, SOL 24-hour BTC long/short ratio: 50.54% / 49.46% Sector Performance: Meme and DeFi sectors saw a slight pullback, while SocialFi and AI rose by over 2%. 24-hour liquidation data: A total of 112,699 people worldwide were liquidated, with a total liquidation amount of $416 million. This included $105 million in BTC liquidations, $169 million in ETH liquidations, and $21.92 million in SOL liquidations. 3. ETF Flows (as of December 5) Bitcoin ETFs saw a net outflow of $87.77 million last week, with ARKB experiencing the largest net outflow at $77.86 million. Ethereum ETFs saw net outflows of $65.59 million last week, with BlackRock's ETHA experiencing the largest net outflow at $55.87 million. Solana ETF: Net inflow of $20.3 million last week XRP ETF: Net inflows of $231 million last week, marking the fourth consecutive week of net inflows. 4. Today's Outlook HumidiFi: New token public sale will begin on December 8th at 23:00. The Stable mainnet will launch on December 8th at 21:00. The company formed by the merger of Twenty One Capital and CEP is expected to list on the NYSE on December 9. BounceBit (BB) will unlock approximately 29.93 million tokens at 8:00 AM Beijing time on December 9th, representing 3.42% of the circulating supply, worth approximately $2.7 million. The top 100 cryptocurrencies by market capitalization with the largest gains today are: Ultima up 7%, SPX6900 up 5.8%, Canton Network up 5.5%, Ethena up 5.1%, and Zcash up 4.5%. 5. Hot News Data: APT, LINEA, CHEEL and other tokens will see large-scale unlocking, with APT unlocking value estimated at approximately $19.3 million. This Week's Preview | The Federal Reserve FOMC announces its interest rate decision; the Stable blockchain mainnet will officially launch on December 8th. The largest short position in BTC on Hyperliquid currently has a floating profit of approximately $17 million, having reduced its position by about 20 BTC in 26 minutes. The BEAT team's linked wallet sent $1.2 million worth of tokens to a CEX, seemingly indicating a planned sell-off for profit. Twenty One Capital transferred 43,122 BTC to a new wallet. The U.S. SEC's Cryptocurrency Working Group will hold a roundtable meeting on financial regulation and privacy on December 15. Bittensor will undergo its first halving on December 14th, at which time the daily supply of TAO will decrease to 3600 tokens. ZKsync plans to abandon its early network, ZKsync Lite, in 2026. The long positions held by the "whale that opened short positions after the 1011 flash crash" have reached $164 million, and are currently showing a floating loss of $950,000. A wallet suspected to be Windemute has accumulated approximately $5.2 million worth of SYRUP tokens over the past two weeks. South Korea is considering legislation requiring virtual asset operators to bear "no-fault liability" for hacker attacks, with fines potentially increased to 3% of sales revenue. The average cash cost for public miners mining Bitcoin has reached $74,600, with a total cost of $137,800. Caixin: Last year, 3,032 people were prosecuted for money laundering related to cryptocurrencies; establishing a firewall against virtual currencies is necessary to protect normal economic and trade activities. Farcaster announces strategic shift: from a social-first approach to wallet-driven growth.
Share
PANews2025/12/08 14:48
Robinhood Sets Indonesia Footprint Through Crypto Trader, Brokerage Firms Acquisition

Robinhood Sets Indonesia Footprint Through Crypto Trader, Brokerage Firms Acquisition

Robinhood Markets has announced two key acquisitions, marking its official entry into the Indonesian market. The American financial services firm has
Share
CryptoNews2025/12/08 14:45