Large language models hold promise as simulators of virtual environments, but new benchmarking with BYTESIZED32 shows that even GPT-4 falls short. While LLMs can generate plausible outcomes, they often fail at capturing complex state transitions requiring arithmetic, common sense, or scientific reasoning. This research highlights both their potential and current limitations, offering a novel benchmark for tracking progress as models evolve.Large language models hold promise as simulators of virtual environments, but new benchmarking with BYTESIZED32 shows that even GPT-4 falls short. While LLMs can generate plausible outcomes, they often fail at capturing complex state transitions requiring arithmetic, common sense, or scientific reasoning. This research highlights both their potential and current limitations, offering a novel benchmark for tracking progress as models evolve.

Are Large Language Models the Future of Game State Simulation?

:::info Authors:

(1) Ruoyao Wang, University of Arizona ([email protected]);

(2) Graham Todd, New York University ([email protected]);

(3) Ziang Xiao, Johns Hopkins University ([email protected]);

(4) Xingdi Yuan, Microsoft Research Montréal ([email protected]);

(5) Marc-Alexandre Côté, Microsoft Research Montréal ([email protected]);

(6) Peter Clark, Allen Institute for AI ([email protected]).;

(7) Peter Jansen, University of Arizona and Allen Institute for AI ([email protected]).

:::

Abstract and 1. Introduction and Related Work

  1. Methodology

    2.1 LLM-Sim Task

    2.2 Data

    2.3 Evaluation

  2. Experiments

  3. Results

  4. Conclusion

  5. Limitations and Ethical Concerns, Acknowledgements, and References

A. Model details

B. Game transition examples

C. Game rules generation

D. Prompts

E. GPT-3.5 results

F. Histograms

Abstract

Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called BYTESIZED32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM’s capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.

1 Introduction and Related Work

Simulating the world is crucial for studying and understanding it. In many cases, however, the breadth and depth of available simulations are limited by the fact that their implementation requires extensive work from a team of human experts over weeks or months. Recent advances in large language models (LLMs) have pointed towards an alternate approach by leveraging the huge amount of knowledge contained in their pre-training datasets. But are they ready to be used directly as simulators?

\ We examine this question in the domain of textbased games, which naturally express the environment and its dynamics in natural language and have long been used as part of advances in decision making processes (Côté et al., 2018; Fan et al., 2020; Urbanek et al., 2019; Shridhar et al., 2020; Hausknecht et al., 2020; Jansen, 2022; Wang et al.,2023), information extraction (Ammanabrolu and Hausknecht, 2020; Adhikari et al., 2020), and artificial reasoning (Wang et al., 2022).

\ Figure 1: An overview of our two approaches using an LLM as a text game simulator. The example shows the process that a cup in the sink is filled by water after turning on the sink. The full state prediction includes all objects in the game including the unrelated stove, while the state difference prediction excludes the unrelated stove. State changes caused by Fact and Fenv are highlighted in yellow and green , respectively.

\ Broadly speaking, there are two ways to leverage LLMs in the context of world modeling and simulation. The first is neurosymbolic: a number of efforts use language models to generate code in a symbolic representation that allows for formal planning or inference (Liu et al., 2023; Nottingham et al., 2023; Wong et al., 2023; Tang et al., 2024). REASONING VIA PLANNING (RAP) (Hao et al., 2023) is one such approach – it constructs a world model using LLM priors and then uses a dedicated planning algorithm to decide on agent policies (LLMs themselves continue to struggle to act directly as planners (Valmeekam et al., 2023)). Similarly, BYTESIZED32 (Wang et al., 2023) tasks LLMs with instantiating simulations of scientific reasoning concepts in the form of large PYTHON programs. These efforts are in contrast to the second, and comparatively less studied, approach of direct simulation. For instance, AI-DUNGEON represents a game world purely through the generated output of a language model, with inconsistent results (Walton, 2020). In this work, we provide the first quantitative analysis of the abilities of LLMs to directly simulate virtual environments. We make use of structured representations in the JSON schema as a scaffold that both improves simulation accuracy and allows for us to directly probe the LLM’s abilities across a variety of conditions.

\ In a systematic analysis of GPT-4 (Achiam et al., 2023), we find that LLMs broadly fail to capture state transitions not directly related to agent actions, as well as transitions that require arithmetic, common-sense, or scientific reasoning. Across a variety of conditions, model accuracy does not exceed 59.9% for transitions in which a non-trivial change in the world state occurs. These results suggest that, while promising and useful for downstream tasks, LLMs are not yet ready to act as reliable world simulators without further innovation.[1]

\

:::info This paper is available on arxiv under CC BY 4.0 license.

:::

[1] Code and data are available at https://github. com/cognitiveailab/GPT-simulator.

Market Opportunity
SQUID MEME Logo
SQUID MEME Price(GAME)
$40.4575
$40.4575$40.4575
+1.31%
USD
SQUID MEME (GAME) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

VIRTUAL Weekly Analysis Jan 21

VIRTUAL Weekly Analysis Jan 21

The post VIRTUAL Weekly Analysis Jan 21 appeared on BitcoinEthereumNews.com. VIRTUAL closed the week up 3.57% at $0.84, but the long-term downtrend maintains its
Share
BitcoinEthereumNews2026/01/22 06:54
Dogecoin, Shiba Inu & XYZVerse: Three Meme Coin Paths — Stability, Gradual Growth & Explosive Upside?

Dogecoin, Shiba Inu & XYZVerse: Three Meme Coin Paths — Stability, Gradual Growth & Explosive Upside?

Three meme tokens are taking unique routes in the market. One is holding firm, another is making slow gains, and a third is causing excitement with its big jumps. What sets these coins apart and makes each path interesting? The coming analysis looks at how these strategies could shape their future and what it might mean for traders. From Meme to Mainstream: Is Dogecoin Ready for Another Lift-Off? Dogecoin burst onto the scene in 2013 with a grinning Shiba Inu and a shrug. Its creators, Billy Marcus and Jackson Palmer, wanted a light-hearted twist on serious crypto. They set no hard limit on coins; in fact 10,000 fresh DOGE roll out every minute. What began as a joke became a juggernaut. Social media rallies, led by Elon Musk, pushed its worth above $50 billion in 2021, planting it in the top ten. The surge proved one thing: an online crowd can turn a meme into a market force. Under the hood DOGE runs on the same proof-of-work idea as Bitcoin, yet blocks clear faster and fees stay tiny. That makes tipping gamers, streamers, and friends quick and cheap. The endless supply fuels spending but also keeps a lid on scarcity. In today’s cycle Bitcoin’s rebound has traders hunting for lagging plays. New meme coins flash brighter, yet many fade fast. Dogecoin still owns the biggest fan club and sits on every major exchange, giving it staying power. If utility grows—or another Musk tweet lands—momentum could return in a hurry. Shiba Inu: The Meme Dog That Sniffed Out a Spot on Ethereum Shiba Inu burst onto the scene in 2020, barking at Dogecoin’s heels. Built on Ethereum, it plugs into a huge network of apps and wallets. Its maker, known only as Ryoshi, unleashed one quadrillion tokens. Half went to Vitalik Buterin, who later gave much away and burned the rest. That bold move grabbed headlines and trust. At the same time, it showed the coin was more than a joke. Today, SHIB powers ShibaSwap, a place to trade tokens without a middleman. Soon, holders may vote on new changes and even mint art pieces called NFTs. This wider plan gives SHIB tools that Dogecoin still lacks. The market cycle now rewards coins with clear stories and active teams. Meme coins often ride big waves, and Ethereum-based ones get extra attention because they fit with popular chains like Uniswap and OpenSea. SHIB also has a huge, vocal fan base that can drive fast moves. Prices are still far below last year’s peak, so some see room for a fresh run if the next bull phase appears. Demand for $XYZ Surges As Its Capitalization Hits the $15M Milestone XYZVerse ($XYZ), recently recognized as Best NEW Meme Project, is drawing significant attention thanks to its standout concept. It is the first ever meme coin that merges the thrill of sports and the innovation of web3. Unlike typical meme coins, XYZVerse offers real utility and a clear roadmap for long-term development. It plans to launch gamified products and form partnerships with big sports teams and platforms. Notably, XYZVerse recently delivered on one of its goals ahead of schedule by partnering with bookmaker.XYZ, the first fully on-chain decentralized sportsbook and casino. As a bonus, $XYZ token holders receive exclusive perks on their first bet. Price Dynamics and Listing Plans During its presale phase, the $XYZ token has shown steady growth. Since its launch, the price has increased from $0.0001 to $0.0055, with the next stage set to push it further to $0.0056. With an anticipated listing price of $0.10, the token is set to launch on leading CEXs and DEXs. The projected listing price of $0.10 could generate up to 1,000x returns for early investors, provided the project secures the necessary market capitalization. So far, more than $15 million has been raised, and the presale is approaching another significant milestone of $20 million. This fast progress is signaling strong demand from both retail and institutional investors. Champions Get Rewarded In XYZVerse, the community calls the plays. Active contributors are rewarded with airdropped XYZ tokens for their dedication. It’s a game where the most passionate players win big. The Road to Victory With solid tokenomics, strategic CEX and DEX listings, and consistent token burns, $XYZ is built for a championship run. Every play is designed to push it further, to strengthen its price, and to rally a community of believers who believe this is the start of something legendary. Airdrops, Rewards, and More - Join XYZVerse to Unlock All the Benefits Conclusion DOGE offers steadiness, SHIB moves upward in steps, yet XYZVerse (XYZ) blends sports and memes, presale live, community-led, aiming to beat past 17,000% stars in the 2025 bull run. You can find more information about XYZVerse (XYZ) here: https://xyzverse.io/, https://t.me/xyzverse, https://x.com/xyz_verse   Disclaimer: This article is provided for informational purposes only. It is not offered or intended to be used as legal, tax, investment, financial, or other advice.
Share
Coinstats2025/09/20 16:32
YZi Labs invests in Ethena Labs to support the expansion of the USDe ecosystem

YZi Labs invests in Ethena Labs to support the expansion of the USDe ecosystem

PANews reported on September 19th that YZi Labs announced it has deepened its holdings in Ethena Labs and will continue its strategic support for the development of the USDe ecosystem. USDe is the fastest-growing and third-largest dollar-denominated crypto asset in history, with a current circulating supply exceeding $ 13 billion. YZi Labs' support will promote the expansion of USDe's application across centralized and decentralized platforms, and will contribute to the development of new products : USDtb (a fiat-backed stablecoin) and Converge (an institutional settlement layer).
Share
PANews2025/09/19 21:07