Yuvraj Khanna, Raghav Rastogi, Dhruv Kumar, Peter Relan, Hung Tran
In our earlier research, “Exploring the Limits of Reasoning vs. Recall in Large Language Models: A Case Study with the MATH dataset,” we uncovered a critical weakness in advanced language models. While these models showcase impressive abilities, we have found that they can depend on memorization and pattern matching rather than true reasoning – particularly when solving high school-level mathematical problems. If given slight variations to the original problem the models solve correctly, we see a notable drop in accuracy, exposing a significant limitation in their adaptability.
This blog continues that exploration by examining the latest developments in the most recent reasoning-focused large language models, including DeepSeek_v3, OpenAI’s o1 series, OpenAI’s GPT-4o series, and the Llama series. These latest frontier models address the limitations of their earlier counterparts through innovative approaches like chain-of-thought decomposition, multimodal reasoning, and systematic problem analysis. Despite these improvements, our analysis highlights persistent challenges, especially when these models are tested with out-of-distribution problem variations, where their reasoning capabilities are most strained.
DeepSeek v3, released in December 2024, represents a significant leap in open-source AI. Unlike earlier models, it employs a Mixture-of-Experts (MoE) architecture of 671 billion total parameters with 37 billion activated parameters, and was trained on an expansive dataset of 14.8 trillion tokens. Key innovations, such as Multi-head Latent Attention and an auxiliary-loss-free strategy for load balancing, contribute to its efficiency and adaptability.
In contrast to models like o1, which concentrates on step-by-step logical reasoning, and Gemini 2.0 Flash, which combines visual and textual data for multimodal problem-solving, DeepSeek v3’s architecture prioritizes scalability and raw processing power. With a context window of 128,000 tokens, DeepSeek v3 demonstrates a substantial capacity for understanding and reasoning through complex problems.
Advancements in Next-Generation AI Models
Modern reasoning-focused large language models claim substantial improvements in problem-solving capabilities. These include:
- Systematic planning, enabling the breakdown of complex tasks into logical subproblems or exploration of multiple paths to a solution.
- Multimodal understanding, which integrates visual and textual data for enhanced reasoning.
- Agentic abilities, such as execution of multiple steps and dynamically using tools to arrive at solutions.
While these innovations represent meaningful progress, our findings make it clear that all models tested still fail to perform reliably when exposed to variations in problem structure or presentation.
Methodology: Testing Reasoning Robustness
We apply the same methodology from our previous two papers, working with the MATH test set. MATH offers a wide range of very challenging and difficult high school topics—algebra, geometry, number theory, and more—making it a stringent test of reasoning skills.
We selected 90 problems from MATH, each involving both a mathematical variable and some context. We used a zero-shot approach: no worked examples, just a straightforward prompt to “solve step by step.”To push the models further, from these original problems, we created four types of variations, ranging from small changes to variables to fully reworded questions, all aimed at challenging the models in different ways.
Problem Variations
- Variant 1: Change only the variable (e.g., from “x” to “y”).
- Variant 2: Change only the non-mathematical context (e.g., “4 dogs” to “4 books”), keeping variables and numbers intact.
- Variant 3: Reformulate the question’s language entirely but retain the same variable and numbers.
- Variant 4: Change both language and variable names, holding numbers and solutions constant. These variations probe how stable a model’s performance remains when it must reason about a problem that “looks” different while representing the same underlying math.
We focus on Variation 4 only in this blog post, as it is the most challenging of all the variations. An example is below:
Example of Variation 4
Original Question | Variant 4 Question |
Denali and Nate work for a dog walking business and are paid for each dog they walk. Denali is responsible for 16 dogs and Nate is responsible for 12 dogs. Under the company’s new policy, they will be assigned or unassigned new dogs in groups of $x$ dogs . The ratio of Denali’s pay to Nate’s pay would be the same if Denali started walking $4x$ more dogs and Nate stayed at 12 dogs or if $x$ of Nate’s dogs were reassigned to Denali. Find $x$ if $x≠0$ | Alice and Bob are avid book collectors. Alice currently has 16 rare first editions, while Bob has 12 signed copies. They have agreed to trade books in groups of $y$ books , where $y$ is a positive integer. The ratio of Alice’s first editions to Bob’s signed copies would remain the same if Alice acquired $4y$ more first editions and Bob kept his collection unchanged, or if Bob gave away $y$ of his signed copies to Alice. Find the value of $y$ if $y≠0$. |
Original Solution | Variant 4 Solution |
Rewriting the sentence “The ratio of Denali’s pay to Nate’s pay would be the same if Denali started walking 4x more dogs and Nate stayed at 12 dogs or if x of Nate’s dogs were reassigned to Denali” as an equation, we have: $$ \frac{16 + 4\colorbox{cyan}{$x$}}{12} = \frac{16 + \colorbox{cyan}{$x$}}{12 – \colorbox{cyan}{$x$}} $$ Clearing denominators, we get: $$\begin{aligned} (16 + 4 \colorbox{cyan}{$x$})(12 – \colorbox{cyan}{$x$}) &= (16 + \colorbox{cyan}{$x$})(12) \\ 192 – 16 \colorbox{cyan}{$x$} + 48 \colorbox{cyan}{$x$} – 4 \colorbox{cyan}{$x$}^2 &= 192 + 12 \colorbox{cyan}{$x$} \\ 32 \colorbox{cyan}{$x$} – 4 \colorbox{cyan}{$x$}^2 &= 12 \colorbox{cyan}{$x$} \\ 0 &= 4 \colorbox{cyan}{$x$}^2 – 20 \colorbox{cyan}{$x$} \\ 0 &= 4 \colorbox{cyan}{$x$}(\colorbox{cyan}{$x$} – 5) \end{aligned} $$ Since $\colorbox{cyan}{$x$} $ cannot be $0$, we have $\colorbox{cyan}{$x$}=\boxed{5}$ | Rewriting the sentence “The ratio of Alice’s first editions to Bob’s signed copies would remain the same if Alice acquired 4y more first editions and Bob kept his collection unchanged, or if Bob gave away y of his signed copies to Alice” as an equation, we have: $$ \frac{16 + 4\colorbox{cyan}{$y$}}{12} = \frac{16 + \colorbox{cyan}{$y$}}{12 – \colorbox{cyan}{$y$}} $$ Clearing denominators, we get: $$\begin{aligned} (16 + 4 \colorbox{cyan}{$y$})(12 – \colorbox{cyan}{$y$}) &= (16 + \colorbox{cyan}{$y$})(12) \\ 192 – 16 \colorbox{cyan}{$y$} + 48 \colorbox{cyan}{$y$} – 4 \colorbox{cyan}{$y$}^2 &= 192 + 12 \colorbox{cyan}{$y$} \\ 32 \colorbox{cyan}{$y$} – 4 \colorbox{cyan}{$y$}^2 &= 12 \colorbox{cyan}{$y$} \\ 0 &= 4 \colorbox{cyan}{$y$}^2 – 20 \colorbox{cyan}{$y$} \\ 0 &= 4 \colorbox{cyan}{$x$}(\colorbox{cyan}{$x$} – 5) \end{aligned} $$ Since $\colorbox{cyan}{$y$} $ cannot be $0$, we have $\colorbox{cyan}{$y$}=\boxed{5}$. |
To assess reasoning robustness, we employed the following metrics:
- 5 runs: Each question was solved 5 times to account for stochasticity making a total of 450 runs.
- Accuracy: Calculated as the average correct solution rate over 5 runs.
- Mark: A binary measure indicating whether the model achieved over 50 percent accuracy (Mark = 1) or less than 50 percent accuracy (Mark = 0).
This rigorous methodology allowed us to evaluate the adaptability of these models in handling problem variations. The reason we are using the mark-based scoring is because we find that it emphasizes the inconsistency in correct responses for the variation question across multiple attempts, revealing whether a solution reflects genuine reasoning rather than chance. By filtering out problems that a model only solves sporadically, Mark enables a targeted analysis of “truly understood” questions and detects where a model’s accuracy collapses under reworded or out-of-distribution variations. Crucially, Mark showcases whether models are generalizing core principles and using reasoning or relying on memorized pattern matching—thereby highlighting the gap between shallow recall and deeper reasoning that our research seeks to bridge.
Results and Observations
Part 1: Analysis on all 90 questions
Model name | Size | Original Accuracy (Average of 5) | Variation 4 Accuracy (Average of 5) | Drop (Average of 5) |
---|---|---|---|---|
Deepseek v3 | 671B-A37B | 86.67% | 81.11% | -5.56% |
Llama 3.1 405B IT | 405B | 69.11% | 69.17% | 0.06% |
o1 | Unknown | 96.89% | 90.89% | -6.00% |
o1-preview | ~300B | 94.22% | 88.00% | -6.22% |
GPT-4o | ~200B | 70.44% | 66.67% | -3.77% |
o1-mini | ~100B | 93.33% | 87.33% | -6.00% |
Llama 3.2 90B IT | 90B | 62.22% | 56.67% | -5.55% |
Llama 3.3 70B IT | 70B | 70.06% | 66.59% | -3.47% |
Llama 3.1 70B IT | 70B | 62.67% | 58.67% | -4.00% |
Llama 3 70B IT | 70B | 42.89% | 40.22% | -2.67% |
Llama 3.2 11B IT | 11B | 40.22% | 40.22% | 0.00% |
GPT-4o-mini | ~8B | 73.56% | 70.44% | -3.12% |
Llama 3.1 8B IT | 8B | 43.56% | 40.00% | -3.56% |
Llama 3 8B IT | 8B | 25.11% | 25.33% | 0.22% |
We got the model sizes for closed source OpenAI models using this paper released by Microsoft. IT stands for Instruct Tuned models.
Our analysis reveals the challenges models face with out-of-distribution problems. While some models maintain consistent accuracy, many exhibit significant drops when variations are introduced. Topics such as Intermediate Algebra and Precalculus emerge as particularly problematic, where small changes in phrasing or structure lead to sharp declines in accuracy. It is also important to note that the inaccuracy level after being exposed to variation 4 (out-of-distribution data) has tripled for o1 and doubled for o1 mini.
Part 2: Analysis on question with Original Mark = 1
We further examined a subset of 90 problems where models solved the original question majority of the time (Mark=1). This analysis tests whether the models, having demonstrated an understanding of the original problem, could adapt to variations
This subset analysis confirms that even when models demonstrate an understanding of the original question, their accuracy drops significantly with reworded variations. Notably:
- Llama 3.2 90b instruct experiences the largest drop (-17.55%), underscoring difficulties in adapting to even small linguistic variations.
- OpenAI o1-mini shows the smallest drop (-6.12%), highlighting its incremental improvement over other models.
- Deepseek-v3 shows a relatively smaller drop in accuracy (-6.25%) showing better reasoning capabilities on the questions it got correct the majority of the time.
We find that the questions that these models begin to get wrong post the Variation 4 are pre-calculus, geometry, or intermediate algebra questions. They range between Level 2, 4, and 5 on the MATH difficulty scale. Upon further investigation, we noticed a common thread between the questions that DeepSeek v3 got wrong after Variation 4:
- Questions Involved Multi-Step or Higher-Level Mathematical Reasoning
- Folding and creasing a rectangle to form intersecting triangles (geometric transformations)
- Minimizing travel time with piecewise speeds (optimization and coordinate geometry)
- Tracking probabilities through directed paths (graph-based probability/Markov processes)
- Computing latitudes on a sphere (spherical geometry)
- Finding arc length of a parametric curve (calculus-based)
- Determining integer parameters in a cubic polynomial (algebraic constraints)
- Is Vulnerable to Misinterpretation Under Rewording or Context Shifts
- The “story” or “context” (e.g., “mystical land,” “futuristic city,” “magical fairy,” etc.) conceals a multi-stage process requiring precise variable relationships. Small changes in how variables, names, or details are described can derail pattern-matching strategies and demand genuine conceptual understanding.
- The “story” or “context” (e.g., “mystical land,” “futuristic city,” “magical fairy,” etc.) conceals a multi-stage process requiring precise variable relationships. Small changes in how variables, names, or details are described can derail pattern-matching strategies and demand genuine conceptual understanding.
In short, the questions it got wrong pushed beyond straightforward pattern recall; they demand multi-step, concept-spanning solutions that are especially susceptible to errors when the prompt is rephrased, illustrating how these problems expose deeper gaps in a model’s reasoning.
Key Insights
Our research reveals a larger truth: current AI still struggles to consistently reason through variations in problem statements of problems that are trained on. Many models excel on familiar tasks yet stumble the moment phrasing or context changes, suggesting their reliance on memorization over true understanding. Put simply, achieving robust reasoning and broad generalization remains a formidable but vital challenge.
- Out-of-Distribution Challenges Persist: Subtle variations in problem formulation can lead to steep declines in model performance, exposing vulnerabilities in reasoning capabilities.
- Generalization Remains Elusive: While models excel on familiar problem structures, they falter with changes, highlighting the need for solutions that go beyond surface-level pattern recognition.
One critical question that drives our work is this: to what extent can these models grasp the underlying mathematical principles rather than merely recognizing patterns in problem presentation? Solving this question is not just pivotal for reasoning AI but foundational for the future of truly adaptive systems.
Future Directions
At MathGPT.ai, we are deeply committed to understanding the strengths and limitations of modern AI systems. In this experiment, we analyzed problems that models consistently solved correctly in their original form. However, when these problems were rephrased or slightly modified, even the most accurate models exhibited significant performance drops. This stark finding highlights a persistent challenge: current AI systems, despite their impressive capabilities, often rely on memorization rather than genuine reasoning. Addressing this gap is not just an industry need but a mission for our team. Our approach includes:
- Developing Robust Benchmarks: Creating novel deep-reasoning benchmarks that eliminate the risk of overlap with training data, ensuring models are evaluated based on their genuine reasoning abilities.
- Innovating Reasoning Strategies: Incorporating agentic workflows and dynamic tool usage to empower models with adaptable reasoning abilities. We also aim to investigate the reasoning capabilities of cutting-edge models such as Sky T-1 and rStar to further our understanding of pushing the boundaries of AI systems.
- Collaborating Across the Industry: Engaging with the broader research community to establish benchmarks and methodologies that measure reasoning depth and adaptability.
Conclusion
The future of AI depends on addressing these fundamental challenges. At MathGPT.ai, we recognize that robust reasoning capabilities are not just a technological aspiration, but a necessity for systems that can adapt, learn, and solve real-world problems with reliability and precision. While significant progress has been made, we are committed to advance AI beyond its current limitations. While the path is long, the reward is immense. At MathGPT.ai, we’re determined to bridge this gap and push the boundaries of what’s possible, setting the stage for truly adaptive and intelligent machines.
Appendix
Example of a problem wherein Deepseek v3 got the original question correct but variant 4 question wrong.
Original Question
Steve says to Jon, “I am thinking of a polynomial whose roots are all positive integers. The polynomial has the form $P(x) = 2x^3-2ax^2+(a^2-81)x-c$ for some positive integers $a$ and $c$. Can you tell me the values of $a$ and $c$?” After some calculations, Jon says, “There is more than one such polynomial.” Steve says, “You’re right. Here is the value of $a$.” He writes down a positive integer and asks, “Can you tell me the value of $c$?” Jon says, “There are still two possible values of $c$.” Find the sum of the two possible values of $c$.
Variant 4 Question
In the mystical land of Polynomia, the sorceress Algebraia is pondering a magical spell whose enchantment levels are all positive integers. The spell has the form $S(x) = 2x^3-2bx^2+(b^2-81)x-d$ for some positive integers $b$ and $d$. Algebraia seeks the wisdom of the ancient sage Geometricus to determine the values of $b$ and $d$. After some calculations, Geometricus says, “There is more than one such spell.” Algebraia says, “You’re right. Here is the value of $b$.” She reveals a positive integer and asks, “Can you tell me the value of $d$?”
Geometricus says, “There are still two possible values of $d$.”
Find the sum of the two possible values of $d$.
Incorrect Solution
Let $p, q, r$ be the three positive integer enchantment levels of $S(x)$. Then by Vieta’s formulas,
$$\begin{aligned} p+q+r &= b, \ pq+qr+pr &= \frac{b^2-81}{2}, \ pqr &= \frac{d}{2}. \end{aligned}$$
Substituting the first equation into the second to eliminate $b,$ we have
$pq+qr+pr = \frac{(p+q+r)^2 – 81}{2} = \frac{(p^2+q^2+r^2) + 2(pq+qr+pr) – 81}{2}.$
This simplifies to $p^2 + q^2 + r^2 = 81.$
Therefore, each of $p, q, r$ lies in the set ${1, 2, \ldots, 9}.$ Assuming without loss of generality that $p \le q \le r,$ we have $81=p^2+q^2+r^2 \le 3r^2,$ so $r^2 \ge 27,$ and $r \ge 6.$ We take cases:
If $r = 6,$ then $p^2+q^2 = 81 – 6^2 = 45;$ the only solution where $p \le q \le 6$ is $(p, q) = (3, 6).$
If $r = 7,$ then $p^2+q^2 = 81-7^2 = 32;$ the only solution where $p \le q \le 7$ is $(p, q) = (4, 4).$
If $r = 8,$ then $p^2+q^2 = 81-8^2 = 17;$ the only solution where $p \le q \le 8$ is $(p, q) = (1, 4).$
Therefore, the possible sets of enchantment levels of such a spell are $(3, 6, 6), (4, 4, 7),$ and $(1, 4, 8).$ Calculating $b = p+q+r$ and $d=2pqr$ for each set, we have $(b, d) = (15, 216), (15, 224), (13, 64).$
Since, given the value of $b,$ there are still two possible values of $d,$ it must be that $b = 15,$ since two of the pairs $(b, d)$ have $b = 15,$ but only one has $b = 13.$ Then the sum of the two possible values of $d$ is
$$216 + 224 = \boxed{440}.$$