Is OpenAI GPT-5.2 actually better than Google Gemini 3 Pro? If you strip away the extra "thinking" time used in the benchmarks, the gap disappears. We dug into Is OpenAI GPT-5.2 actually better than Google Gemini 3 Pro? If you strip away the extra "thinking" time used in the benchmarks, the gap disappears. We dug into

OpenAI GPT-5.2: The “Cheating” Controversy

2025/12/15 12:58
5 min read
For feedback or concerns regarding this content, please contact us at [email protected]

Recently OpenAI released GPT-5.2 which has superior benchmark results. However, some online chatters reveal that OpenAI might have used more tokens and compute for the benchmark test, and might be considered “cheating” the tests. If everything is equal, is GPT-5.2 actually on par with Gemini 3 Pro? Here we try to find out.

The "Cheating" Controversy: Compute & Tokens

The core of the controversy lies in inference-time compute. "Cheating" in this context refers to OpenAI using a configuration for benchmarks that is significantly more powerful (and expensive) than what is available to standard users or what is typical for a "fair" comparison.

\

  • "xhigh" vs. "Medium" Effort: Reports indicate that OpenAI's published benchmark results were generated using an "xhigh" reasoning effort setting. This mode allows the model to generate a massive number of internal "thought" tokens (reasoning steps) before producing an answer.
  • The Issue: Standard ChatGPT Plus users reportedly only have access to "medium" or "high" effort modes. The "xhigh" mode used for benchmarks consumes vastly more tokens and compute, effectively brute-forcing higher scores by allowing the model to "think" for much longer (sometimes 30-50 minutes for complex tasks) than a standard interaction allows.
  • Inference Scaling: This leverages a concept where allowing a model to generate more tokens during inference (test time) improves performance significantly. Critics argue that comparing GPT-5.2's "xhigh" scores against Gemini 3 Pro's standard outputs is misleading because it compares a "maximum compute" scenario against a "standard usage" scenario.

Benchmark Comparison (GPT-5.2 vs. Gemini 3 Pro)

When the massive compute boost is factored in, GPT-5.2 does post higher scores, but the gap narrows or reverses when conditions are scrutinized.

\

| Benchmark | GPT-5.2 (Thinking/Pro) | Gemini 3 Pro | Context | |----|----|----|----| | ARC-AGI-2 | 52.9% | ~31.1% | Measures abstract reasoning. GPT-5.2's score is heavily reliant on the "Thinking" process. | | GPQA Diamond | 92.4% | 91.9% | Graduate-level science. The scores are effectively tied (within margin of error). | | SWE-Bench Pro | 55.6% | N/A | Real-world software engineering. GPT-5.2 sets a new SOTA here. | | SWE-Bench Verified | 80.0% | 76.2% | A more established coding benchmark. The models are roughly comparable here. |

\n

  • Private Benchmarks: Some independent evaluations (e.g., restricted "private benchmarks" mentioned in discussions) suggest that Gemini 3 Pro actually outperforms GPT-5.2 in areas like creative writing, philosophy, and tool use when the "gaming" of public benchmarks is removed.

Are They "On Par"?

Yes, and Gemini 3 Pro may even be superior in "base" capability.

\ If "everything is equal"—meaning both models are restricted to the same amount of inference compute (thinking time)—the general consensus implies they are highly comparable, with different strengths:

\

  • Gemini 3 Pro Advantages:
  • Base Intelligence: Appears to have stronger fundamental capability in long-context understanding (massive context window), theoretical reasoning, and creative tasks without needing excessive "thinking" time.
  • Cost Efficiency: For many tasks, it achieves similar results with less compute (and thus lower cost/latency).
  • GPT-5.2 Advantages:
  • Agentic Workflow: With the "Thinking" mode enabled (high compute), it excels at complex, multi-step agents and coding tasks (SWE-Bench). It is "tuned" effectively to use extra compute to solve harder problems.

\

Conclusions

The claim that they are "on par" is accurate. If you strip away OpenAI's "xhigh" compute advantage used in benchmarks, Gemini 3 Pro is likely equal or slightly ahead in raw model intelligence. GPT-5.2's "superiority" in benchmarks largely comes from its ability to spend significantly more time and compute processing a single prompt.

\ Based on the verification performed, here is the compiled list of sources regarding the GPT-5.2 release, the Gemini 3 Pro comparison, and the associated benchmarking controversy.

References

1. Official Release Announcements

OpenAI – System Card Update

  • openai.com/index/gpt-5-system-card-update-gpt-5-2/

    \n Google – The Gemini 3 Era

  • blog.google/products/gemini/gemini-3/

2. Benchmark Performance & Technical Analysis

R&D World – Comparative Analysis

\

  • Title: "How GPT-5.2 stacks up against Gemini 3.0 and Claude Opus 4.5"
  • Verified Details: Validates the 52.9% score on ARC-AGI-2 (Thinking mode) vs. Gemini 3 Pro's ~31.1%. Confirms GPT-5.2's lead in abstract reasoning is heavily tied to the "Thinking" process.
  • Source: rdworldonline.com/how-gpt-5-2-stacks-up \n

Vellum AI – Deep Dive

\

  • Title: "GPT-5.2 Benchmarks"
  • Verified Details: Verifies the 92.4% score on GPQA Diamond, noting it is effectively tied with Gemini 3 Pro (91.9%) when within the margin of error, but marketed as a "win" by OpenAI.
  • Source: vellum.ai/blog/gpt-5-2-benchmarks

\ Simon Willison’s Weblog

\

  • Title: "GPT-5.2"
  • Verified Details: Technical breakdown of the API pricing ($1.75/1M input) and the distinction between the "Instant" and "Thinking" API endpoints.
  • Source: simonwillison.net/2025/Dec/11/gpt-52/

3. The "Cheating" & Compute Controversy

Reddit (r/LocalLLaMA & r/Singularity)

\

  • Threads: "GPT-5.2 Thinking evals" & "OpenAI drops GPT-5.2 'Code Red' vibes"
  • Verified Details: These community discussions are the primary source of the "cheating" allegations. Users identified that OpenAI's benchmarks used "xhigh" (extra high) reasoning effort—a setting that uses significantly more tokens and time than the "Medium" or "High" settings available to standard users or used in Gemini's standard benchmarks.
  • Source: reddit.com/r/singularity/comments/1pk4t5z/gpt52thinkingevals/
  • Source: reddit.com/r/ChatGPTCoding/comments/1pkq4mc/

\ InfoQ News

\

  • Title: "OpenAI's New GPT-5.1 Models are Faster and More Conversational" (Contextual coverage including 5.2)
  • Verified Details: Discusses the introduction of the "xhigh" reasoning effort level and the trade-offs between benchmark scores and actual user latency/cost.
  • Source: infoq.com/news/2025/12/openai-gpt-51/

\

Market Opportunity
Propy Logo
Propy Price(PRO)
$0.3571
$0.3571$0.3571
+0.42%
USD
Propy (PRO) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

GBP trades firmly against US Dollar

GBP trades firmly against US Dollar

The post GBP trades firmly against US Dollar appeared on BitcoinEthereumNews.com. Pound Sterling trades firmly against US Dollar ahead of Fed’s policy outcome The Pound Sterling (GBP) clings to Tuesday’s gains near 1.3640 against the US Dollar (USD) during the European trading session on Wednesday. The GBP/USD pair holds onto gains as the US Dollar remains on the back foot amid firm expectations that the Federal Reserve (Fed) will cut interest rates in the monetary policy announcement at 18:00 GMT. At the time of writing, the US Dollar Index (DXY), which tracks the Greenback’s value against six major currencies, holds onto losses near a fresh two-month low of 96.60 posted on Tuesday. Read more… UK inflation unchanged at 3.8%, Pound shrugs The British pound is unchanged on Wednesday, trading at 1.3645 in the European session. Today’s inflation report was a dour reminder that UK inflation remains entrenched. CPI for August was unchanged at 3.8% y/y, matching the consensus and its highest level since January 2024. Airfares decreased but this was offset by food and petrol prices. Monthly, CPI rose 0.3%, up from 0.1% in July and matching the consensus. Core CPI, which excludes volatile items such as food and energy, eased to 3.6% from 3.8%. Monthly, core CPI ticked up to 0.3% from 0.2%. The inflation report comes just a day before the Bank of England announces its rate decision. Inflation is almost double the BoE’s target of 2% and today’s release likely means that the BoE will not reduce rates before 2026. Read more… Source: https://www.fxstreet.com/news/pound-sterling-price-news-and-forecast-gbp-trades-firmly-against-us-dollar-ahead-of-feds-policy-outcome-202509171209
Share
BitcoinEthereumNews2025/09/18 01:50
iLink Digital at FabCon Signals Shift to Real-Time AI Execution

iLink Digital at FabCon Signals Shift to Real-Time AI Execution

iLink Digital at FabCon: Moving Enterprise AI from Ambition to Execution The presence of iLink Digital at FabCon Atlanta 2026 reflects a decisive inflection point
Share
Cxquest2026/03/30 22:33
New Trump appointee Miran calls for half-point cut in only dissent as rest of Fed bands together

New Trump appointee Miran calls for half-point cut in only dissent as rest of Fed bands together

The post New Trump appointee Miran calls for half-point cut in only dissent as rest of Fed bands together appeared on BitcoinEthereumNews.com. Stephen Miran, chairman of the Council of Economic Advisers and US Federal Reserve governor nominee for US President Donald Trump, arrives for a Senate Banking, Housing, and Urban Affairs Committee confirmation hearing in Washington, DC, US, on Thursday, Sept. 4, 2025. The Senate Banking Committee’s examination of Stephen Miran’s appointment will provide the first extended look at how prominent Republican senators balance their long-standing support of an independent central bank against loyalty to their party leader. Photographer: Daniel Heuer/Bloomberg via Getty Images Daniel Heuer | Bloomberg | Getty Images Newly-confirmed Federal Reserve Governor Stephen Miran dissented from the central bank’s decision to lower the federal funds rate by a quarter percentage point on Wednesday, choosing instead to call for a half-point cut. Miran, who was confirmed by the Senate to the Fed Board of Governors on Monday, was the sole dissenter in the Federal Open Market Committee’s statement. Governors Michelle Bowman and Christopher Waller, who had dissented at the Fed’s prior meeting in favor of a quarter-point move, were aligned with Fed Chair Jerome Powell and the others besides Miran this time. Miran was selected by Trump back in August to fill the seat that was vacated by former Governor Adriana Kugler after she suddenly announced her resignation without stating a reason for doing so. He has said that he will take an unpaid leave of absence as chair of the White House’s Council of Economic Advisors rather than fully resign from the position. Miran’s place on the board, which will last until Jan. 31, 2026 when Kugler’s term was due to end, has been viewed by critics as a threat from Trump to the Fed’s independence, as the president has nominated three of the seven members. Trump also said in August that he had fired Federal Reserve Board Governor…
Share
BitcoinEthereumNews2025/09/18 02:26