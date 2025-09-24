This article explores how GPT-4 and GPT-3.5 perform in simulated game environments under different conditions—human-written rules, LLM-generated rules, and no rules. The results reveal that GPT-4 significantly outpaces GPT-3.5, especially when rules are absent, underscoring its superior ability to apply common sense and accurately predict game states.This article explores how GPT-4 and GPT-3.5 perform in simulated game environments under different conditions—human-written rules, LLM-generated rules, and no rules. The results reveal that GPT-4 significantly outpaces GPT-3.5, especially when rules are absent, underscoring its superior ability to apply common sense and accurately predict game states.

GPT-4 vs GPT-3.5 Performance in Game Simulations

By: Hackernoon
2025/09/24 23:00
SQUID MEME
GAME$33.8275-1.33%
Large Language Model
LLM$0.0007942-11.76%

Abstract and 1. Introduction and Related Work

  1. Methodology

    2.1 LLM-Sim Task

    2.2 Data

    2.3 Evaluation

  2. Experiments

  3. Results

  4. Conclusion

  5. Limitations and Ethical Concerns, Acknowledgements, and References

A. Model details

B. Game transition examples

C. Game rules generation

D. Prompts

E. GPT-3.5 results

F. Histograms

D Prompts

The prompts introduced in this section includes game rules that can either be human written rules or LLM generated rules. For experiments without game rules, we simply remove the rules from the corresponding prompts.

D.1 Prompt Example: Fact

D.1 Prompt Example: Fact

\

\ D.1.2 State Difference Prediction

\

D.2 Prompt Example: Fenv

D.2.1 Full State Prediction

\

\ D.2.2 State Difference Prediction

\

D.3 Prompt Example: FR (Game Progress)

D.4 Prompt Example: F

D.4.1 Full State Prediction

\

\ D.4.2 State Difference Prediction

\

D.5 Other Examples

Below is an example of the rule of an action:

\

\ Below is an example of the rule of an object:

\

\ Below is an example of the score rule:

\

\ Below is an example of a game state:

\

\ Table 5: Average accuracy per game of GPT-3.5 predicting the whole state transitions (F) as well as action-driven transitions (Fact) and environment-driven transitions (Fenv). We report settings that use LLM generated rules, human written rules, or no rules. Dynamic and static denote whether the game object properties and game progress should be changed; Full and diff denote whether the prediction outcome is the full game state or state differences. Numbers shown in percentage.

\ Table 6: GPT-3.5 game progress prediction results

\ Below is an example of a JSON that describes the difference of two game states:

\

\

E GPT-3.5 results

Table 5 and Table 6 shows the performance of a GPT-3.5 simulator predicting objects properties and game progress respectively. There is a huge gap between the GPT-4 performance and GPT-3.5 performance, providing yet another example of how fast LLM develops in the two years. It is also worth notices that the performance difference is larger when no rules is provided, indicating that GPT-3.5 is especially weak at applying common sense knowledge to this few-shot world simulation task.

\

F Histograms

1. In Figure 3, we show detailed experimental results on the full state prediction task performed by GPT-4.

\ \ Table 7: Description of object properties mentioned in Figure 2

\ \ 2. In Figure 4, we show detailed experimental results on the state difference prediction task performed by GPT-4.

\ 3. In Figure 5, we show detailed experimental results on the full state prediction task performed by GPT-3.5.

\ 4. In Figure 6, we show detailed experimental results on the state difference prediction task performed by GPT-3.5.

\ \ (a) Human-generated rules.

\ \ \ (b) LLM-generated rules.

\ \ \ (c) No rules.

\ \ Figure 3: GPT-4 - Full State prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

\ \ (a) Human-generated rules.

\ \ \ (b) LLM-generated rules.

\ \ \ (c) No rules.

\ \ Figure 4: GPT-4 - Difference prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

\ \ (a) Human-generated rules.

\ \ \ (b) LLM-generated rules.

\ \ \ (c) No rules.

\ \ Figure 5: GPT-3.5 - Full State prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

\ \ (a) Human-generated rules.

\ \ \ (b) LLM-generated rules.

\ \ \ (c) No rules.

\ \ Figure 6: GPT-3.5 - Difference prediction from a) Human-generated rules, b) LLM-generated rules, and c) No rules.

\ \

:::info Authors:

(1) Ruoyao Wang, University of Arizona ([email protected]);

(2) Graham Todd, New York University ([email protected]);

(3) Ziang Xiao, Johns Hopkins University ([email protected]);

(4) Xingdi Yuan, Microsoft Research Montréal ([email protected]);

(5) Marc-Alexandre Côté, Microsoft Research Montréal ([email protected]);

(6) Peter Clark, Allen Institute for AI ([email protected]).;

(7) Peter Jansen, University of Arizona and Allen Institute for AI ([email protected]).

:::

:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

From Images to Programs: A Denoising Diffusion Method for Inverse Graphics

From Images to Programs: A Denoising Diffusion Method for Inverse Graphics

This article presents a novel method for program synthesis using denoising diffusion models on syntax trees.
Share
Hackernoon2025/09/24 23:00
Share
Syntactically Valid Code Editing: A Training Methodology for Neural Program Synthesis

Syntactically Valid Code Editing: A Training Methodology for Neural Program Synthesis

This method is designed to overcome challenges in a standard autoregressive approach, allowing the model to make targeted, grammatically correct changes
Share
Hackernoon2025/09/25 00:00
Share
A Practical Guide to G-LSM: Improving High-Dimensional Option Pricing with Minimal Overhead

A Practical Guide to G-LSM: Improving High-Dimensional Option Pricing with Minimal Overhead

Solving high-dimensional option pricing: G-LSM leverages Hermite polynomials and gradients to achieve a 10x accuracy boost over LSM.
Gravity
G$0.01046+2.75%
Boost
BOOST$0.09912+2.35%
Share
Hackernoon2025/09/24 21:15
Share

Trending News

More

From Images to Programs: A Denoising Diffusion Method for Inverse Graphics

Syntactically Valid Code Editing: A Training Methodology for Neural Program Synthesis

A Practical Guide to G-LSM: Improving High-Dimensional Option Pricing with Minimal Overhead

Solana Eyes $500, Digitap Surges On Visa Card Adoption

Lovable AI’s Astonishing Rise: Anton Osika Reveals Startup Secrets at Bitcoin World Disrupt 2025