Yuvraj Khanna, Raghav Rastogi, Dhruv Kumar, Peter Relan, Hung Tran

This blog post serves as a follow-up, focusing on the new generation of reasoning-centric large language models released last week: OpenAI’s o1 and Google’s Gemini 2.0 Flash. Both models are presented as substantial advancements, moving beyond simple pattern matching toward more robust, self-verifying reasoning. They employ a chain-of-thought approach to decompose complex questions into logical steps—mimicking human problem-solving—and are claimed to excel at multi-step reasoning tasks like advanced calculus, algebra, and combinatorics.

Advancements in Next-Generation AI Models

According to developers, these models have been refined to follow logical sequences rather than rely on memorized solutions. o1 is trained with Reinforcement Learning from Human Feedback (RLHF) and deploys inference strategies like Test time compute methods – that have shown to improve reasoning capabilities. It systematically analyzes problem structures (Systematic Analysis), reuses known methods for classic problem transformations (Method Reuse), breaks complex tasks into simpler subproblems (Divide and Conquer), and continuously refines its reasoning process during inference (Self-Refinement). These processes aim to preserve accuracy by improving reasoning.

On the other hand, Gemini 2.0 Flash claims multimodal capabilities that include visual or diagrammatic reasoning. Additionally, Gemini 2.0 incorporates agentic capabilities, enabling it to plan, reason through multiple steps. Its combination of multimodal reasoning, extended context comprehension, complex instruction handling and planning, compositional function-calling, native tool integration, and faster response times collectively are set to improve model reasoning.

Continuing Our Methodology

We apply the same methodology from our previous paper, working with the MATH test set. Unlike the simpler GSM8K test set used by Apple’s research, MATH offers a wide range of very challenging and difficult high school topics—algebra, geometry, number theory, and more—making it a stringent test of reasoning skills.

For a quick experiment, we selected 90 problems from MATH, each involving both a mathematical variable and some context. We used a zero-shot approach: no worked examples, just a straightforward prompt to “solve step by step.”To push the models further, from these original problems, we created four types of variations, ranging from small changes to variables to fully reworded questions, all aimed at challenging the models in different ways.

Problem Variations

  1. Variation 1: Change only the variable (e.g., from “x” to “y”).
  2. Variation 2: Change only the non-mathematical context (e.g., “4 dogs” to “4 books”), keeping variables and numbers intact.
  3. Variation 3: Reformulate the question’s language entirely but retain the same variable and numbers.
  4. Variation 4: Change both language and variable names, holding numbers and solutions constant. These variations probe how stable a model’s performance remains when it must reason about a problem that “looks” different while representing the same underlying math.


We focus on Variation 4 only in this blog post, as it is the most challenging of all the variations. Here we show the different language elements of a question explored as Names, Context, and Variables used in the question. Here is an example for variation 4:

Variation 4 Question

Original QuestionVariation 4 Question

Denali and Nate work for a dog walking business and are paid for each dog they walk. Denali is responsible for 16 dogs and Nate is responsible for 12 dogs. Under the company’s new policy, they will be assigned or unassigned new dogs in groups of $x$ dogs . The ratio of Denali’s pay to Nate’s pay would be the same if Denali started walking $4x$ more dogs and Nate stayed at 12 dogs or if $x$ of Nate’s dogs were reassigned to Denali. Find $x$ if $x≠0$.

Alice and Bob are avid book collectors. Alice currently has 16 rare first editions, while Bob has 12 signed copies. They have agreed to trade books in groups of $y$ books , where $y$ is a positive integer. The ratio of Alice’s first editions to Bob’s signed copies would remain the same if Alice acquired $4y$ more first editions and Bob kept his collection unchanged, or if Bob gave away $y$ of his signed copies to Alice. Find the value of $y$ if $y≠0$.

Variation 4 Solution

Original SolutionVariation 4 Solution
Rewriting the sentence “The ratio of Denali’s pay to Nate’s pay would be the same if Denali started walking $4x$ more dogs and Nate stayed at 12 dogs or if $x$ of Nate’s dogs were reassigned to Denali” as an equation, we have:$$\frac{16 + 4\colorbox{cyan}{$x$}}{12} = \frac{16 + \colorbox{cyan}{$x$}}{12 – \colorbox{cyan}{$x$}}$$

Clearing denominators, we get:

$$ \begin{align}
\frac{16 + 4\colorbox{cyan}{$x$}}{12 – \colorbox{cyan}{$x$}} &= \frac{16 + \colorbox{cyan}{$x$}}{12} \quad \Rightarrow \\
192 – 16\colorbox{cyan}{$x$} + 48\colorbox{cyan}{$x$} – 4\colorbox{cyan}{$x$}^2 &= 192 + 12\colorbox{cyan}{$x$} \quad \Rightarrow \\
32\colorbox{cyan}{$x$} – 4\colorbox{cyan}{$x$}^2 &= 12\colorbox{cyan}{$x$} \quad \Rightarrow \\
0 &= 4\colorbox{cyan}{$x$}^2 – 20\colorbox{cyan}{$x$} \quad \Rightarrow \\
0 &= 4\colorbox{cyan}{$x$}(\colorbox{cyan}{$x$} – 5)
\end{align} $$

Since $\colorbox{cyan}{$x$}$ cannot be $0$, we have $\colorbox{cyan}{$x$}=\boxed{5}$

Rewriting the sentence “The ratio of Alice’s first editions to Bob’s signed copies would remain the same if Alice acquired $4y$ more first editions and Bob kept his collection unchanged, or if Bob gave away $y$ of his signed copies to Alice” as an equation, we have:

$$\frac{16 + 4\colorbox{cyan}{$y$}}{12} = \frac{16 + \colorbox{cyan}{$y$}}{12 – \colorbox{cyan}{$y$}}$$

Clearing denominators, we get:

$$ \begin{align}
\frac{16 + 4\colorbox{cyan}{$y$}}{12 – \colorbox{cyan}{$y$}} &= \frac{16 + \colorbox{cyan}{$y$}}{12} \quad \Rightarrow \\
192 – 16\colorbox{cyan}{$y$} + 48\colorbox{cyan}{$y$} – 4\colorbox{cyan}{$y$}^2 &= 192 + 12\colorbox{cyan}{$y$} \quad \Rightarrow \\
32\colorbox{cyan}{$y$} – 4\colorbox{cyan}{$y$}^2 &= 12\colorbox{cyan}{$y$} \quad \Rightarrow \\
0 &= 4\colorbox{cyan}{$y$}^2 – 20\colorbox{cyan}{$y$} \quad \Rightarrow \\
0 &= 4\colorbox{cyan}{$y$}(\colorbox{cyan}{$y$} – 5)
\end{align} $$
Since $\colorbox{cyan}{$y$}$ cannot be $0$, we have $\colorbox{cyan}{$y$}=\boxed{5}$.

Preliminary Results

While final results will be presented in a forthcoming release, our initial tests show a familiar pattern. We have validated the accuracy of these models on the MATH dataset and find it to be matching.

ModelMATHVariation 4 (pass@1)     Drop (pass@1)Variation 4 (Average of 5)   Drop (Average of 5)
OpenAI o1“94.8%”86.5%-8.3%89.3%-5.5%
Google Gemini 2.0 Flash89.7%”76.8%-12.9%80.0%-9.7%

Both models experience a noticeable drop in accuracy in the experiment. o1’s accuracy falls to 8.3%, and Gemini 2.0’s to 12.9% when tested pass@1(metric that measures how often the first produced solution is correct)— showing a significant drop from their original performance. This suggests that subtle changes in problem wording or context disrupt their reasoning, indicating a continued reliance on pattern familiarity rather than genuine logical processing. Allowing multiple attempts (Average of 5) helps somewhat: o1 rebounds to 89.3%, and Gemini 2.0 to 80.0%, but neither model returns to its previous confidence level. Yet, even with these multiple attempts, neither model fully returns to the confidence it displayed with the original MATH problems.

In earlier scenarios, even slight hints or similar examples could boost performance substantially. Now, both o1 and Gemini 2.0, despite their advanced training methods, show the same vulnerability seen in older models when faced with reworded or recontextualized questions.

LLM’s exhibit a surprising rigidity in their “reasoning” capabilities, struggling to generalize and adapt when confronted with even slight deviations from familiar problem formulations. Our research provides a critical lens into the current state of machine reasoning, highlighting the gap between computational pattern matching and true mathematical intelligence. Although these new LLMs represent progress, especially in handling straightforward problems, they still lag when it comes to adapting their reasoning reliably.

Future work

We are actively developing contamination-resistant datasets that minimize training overlaps to ensure that performance improvements are a result of genuine reasoning abilities rather than simple memorization. Additionally, we are exploring the integration of agentic workflows, where the model autonomously decides to query tools or perform calculations, to further distinguish genuine reasoning. While we believe that well-structured prompting and dynamic tool usage can bolster robustness against problem variations, we do believe that LLM architectures as currently deployed in the latest models, will not be able to to reach acceptable levels of mathematical reasoning (high 90s) which approximate human performance, say of a Math tutor… Finally, we look forward to collaborating with other researchers to establish clearer benchmarks that emphasize reasoning depth over surface-level pattern recognition, enabling more accurate measurement of model progress.

The overarching goal remains the same: to guide LLM development toward systems that do not just reproduce patterns but truly understand and reason about problems. With each iteration, including this step forward with o1 and Gemini 2.0 flash, we learn more about the barriers to achieving human-like mathematical reasoning—and how to break them down.

Conclusion

Even with simple language variations, Gemini and o1 do not withstand the test of reasoning when solving MATH test set problems. We stand at a pivotal moment. As we push beyond pattern recognition, the field is shifting toward a new era of AI that can reason mathematically, navigate unfamiliar problem forms, and leverage diverse tools to break free of its training-data shackles. If these breakthroughs occur, they could redefine not only mathematical problem-solving but the very foundation of AI reasoning itself—ushering in an era where LLMs evolve from linguistic mirrors of their training sets into genuinely insightful, adaptive problem solvers.

Appendix

Example of wrong solutions

Original Question

Steve says to Jon, “I am thinking of a polynomial whose roots are all positive integers. The polynomial has the form $P(x) = 2x^3-2ax^2+(a^2-81)x-c$ for some positive integers $a$ and $c$. Can you tell me the values of $a$ and $c$?”

After some calculations, Jon says, “There is more than one such polynomial.”

Steve says, “You’re right. Here is the value of $a$.” He writes down a positive integer and asks, “Can you tell me the value of $c$?”

Jon says, “There are still two possible values of $c$.”

Find the sum of the two possible values of $c$.

Original Solution

Let $r, s, t$ be the three positive integer roots of $P(x).$ Then by Vieta’s formulas,

$$\begin{align} r+s+t &= a, \\ rs+st+rt &= \frac{a^2-81}{2}, \\ rst &= \frac{c}{2}. \end{align}$$

Substituting the first equation into the second to eliminate $a$, we have

$$rs+st+rt = \frac{(r+s+t)^2 – 81}{2} = \frac{(r^2+s^2+t^2) + 2(rs+st+rt) – 81}{2}.$$

This simplifies to

$$r^2 + s^2 + t^2 = 81.$$

Therefore, each of $r, s, t$ lies in the set ${1, 2, \ldots, 9}$. Assuming without loss of generality that $r \le s \le t,$ we have $81=r^2+s^2+t^2 \le 3t^2,$ so $t^2 \ge 27$, and $t \ge 6$. We take cases:

Case 1: If $t = 6,$ then $r^2+s^2 = 81 – 6^2 = 45;$ the only solution where $r \le s \le 6$ is $(r, s) = (3, 6)$.

Case 2: If $t = 7,$ then $r^2+s^2 = 81-7^2 = 32;$ the only solution where $r \le s \le 7$ is $(r, s) = (4, 4)$.

Case 3: If $t = 8,$ then $r^2+s^2 = 81-8^2 = 17;$ the only solution where $r \le s \le 8$ is $(r, s) = (1, 4)$.

Therefore, the possible sets of roots of such a polynomial are $(3, 6, 6), (4, 4, 7),$ and $(1, 4, 8)$. Calculating $a = r+s+t$ and $c=2rst$ for each set, we have $(a, c) = (15, 216), (15, 224), (13, 64)$.

Since, given the value of $a$, there are still two possible values of $c$, it must be that $a = 15$, since two of the pairs $(a, c)$ have $a = 15$, but only one has $a = 13$. Then the sum of the two possible values of $c$ is $\colorbox{green}{$\textcolor{white}{216 + 224 = \boxed{440}}$}$.

Google Gemini 2.0 Flash Incorrect response

Let $S(x)=2×3−2bx2+(b2−81)x−d$. Since all enchantment levels are positive integers, we must have $S(x)>0$ for all positive integers $x$. We are given that there is more than one such spell. This means that there is more than one possible value for d.

Let’s consider the case when $x=1$. Then $S(1)=2−2b+b2−81−d=b2−2b−79−d>0$. When $x=2$, $S(2)=16−8b+2(b2−81)−d=2b2−8b−146−d>0$. When $x=3$, $S(3)=54−18b+3(b2−81)−d=3b2−18b−189−d>0$.

We are told that given $a$ specific value of $b$, there are two possible values of $d$. Let’s analyze the case when $S(x)$ has three distinct positive integer roots. If $S(x)$ has roots $r1 ,r2 ,r3$ , then $S(x)=2(x−r1 )(x−r2 )(x−r3 )$. Expanding this, we get $S(x)=2(x3−(r1 +r2 +r3 )x2+(r1 r2 +r1 r3 +r2 r3 )x−r1 r2 r3 )$. Comparing this with $S(x)=2×3−2bx2+(b2−81)x−d$, we have: $b=r1 +r2 +r3$ $b2−81=r1 r2 +r1 r3 +r2 r3$ $d=2r1 r2 r3$

Suppose $r1 =1, r2 =4, r3 =6$. Then $$\begin{align} b=1+4+6=11. \\ b2−81=121−81=40. \\ r1 r2 +r1 r3 +r2 r3 =4+6+24=34.\end{align}$$ This doesn’t work.

Consider the case where $S(x)$ has roots $3,9,12$. $$\begin{align} b=3+9+12=24. \\ b2−81=576−81=495.3(9)+3(12)+9(12)=27+36+108=171.\end{align}$$ This doesn’t work.

Let the roots be $r1 ,r2 ,r3$ . Then $$\begin{align}b=r1 +r2 +r3 . \\ b2−81=r1 r2 +r1 r3 +r2 r3 .\end{align}$$ We have $$\begin{align}(r1 +r2 +r3 )2=r12 +r22 +r32 +2(r1 r2 +r1 r3 +r2 r3 ).\\ b2=r12 +r22 +r32 +2(b2−81). \\ b2=162−(r12 +r22 +r32 ).\end{align}$$

If $b=10$, $100=162−(r12 +r22 +r32 )$, so $r12 +r22 +r32 =62$. If $r1 =1$,$r2 =5$,$r3 =6$, $b=12$, $144=162−(1+25+36)=162−62=100.$ $b=12.$ $b2−81=144−81=63.$ $1(5)+1(6)+5(6)=5+6+30=41.$

If $b=10$, $r1 =1$,$r2 =4$,$r3 =5$. $b=10. 100−81=19.$ $4+5+20=29.$

If $b=17$, $289−81=208$. If $r1 =4$,$r2 =8$,$r3 =5$, $b=17, 32+20+40=92.$

If $b=10$, $r1 =1$,$r2 =2$,$r3 =7$. $1+4+49=54.$ $2+7+14=23.$ If $r1 =1$,$r2 =3$,$r3 =6$, $1+9+36=46.$ $3+6+18=27.$ If $r1 =1$,$r2 =4$,$r3 =5$, $1+16+25=42.$ $4+5+20=29.$

If $b=10$, $d=2(1)(2)(7)=28$ or $d=2(1)(3)(6)=36$ or $d=2(1)(4)(5)=40.$ If $b=10$, $d=28$ and $d=40. 28+40=68.$
Final Answer: The final answer is 68