Yuvraj Khanna, Raghav Rastogi, Dhruv Kumar, Peter Relan, Hung Tran
Introduction
This is the 4th blog in our series of examining recall vs mathematical reasoning in LLM’s. Building on our previous blog, we are going to compare the performance of the OpenAI’s o1 model with DeepSeek R1 and 6 and models distilled from R1, evaluating their reasoning robustness, cost efficiency, and performance.
Methodology: Testing Reasoning Robustness
Our analysis follows the methodology outlined in our prior research, “Exploring the Limits of Reasoning vs. Recall in Large Language Models: A Case Study with the MATH dataset.” This methodology focuses on testing whether models rely on memorization or demonstrate genuine reasoning capabilities.
We evaluated models using 90 problems from the MATH dataset, covering topics such as algebra, geometry, and number theory. Each problem was tested under a zero-shot approach, where no worked examples were provided—just a direct prompt to “solve step by step.”
To further challenge model robustness, we introduced four levels of problem variations:
- Variant 1: Change only the mathematical variable (e.g., “x” to “y”).
- Variant 2: Change only the non-mathematical context (e.g., replacing “dogs” with “books”).
- Variant 3: Reformulate the question’s language while maintaining numerical values.
- Variant 4: Modify both language and variable names while keeping the solution unchanged.
Since Variation 4 represents the most challenging scenario, we focus on its impact in this study.
Cost Comparison
Model | Input Cost (per Million Tokens) | Output Cost (per Million Tokens) |
---|---|---|
OpenAI o1 | $15.00 | $60.00 |
DeepSeek R1 | $0.14 | $ 0.28 |
DeepSeek R1 offers significant cost advantages compared to O1, with initial operating costs roughly 10-20x lower. While R1’s highly detailed reasoning process generates 2-3x more output tokens, it still maintains a compelling 3-7x cost advantage. Although these token-adjusted calculations are preliminary, R1’s superior cost efficiency makes it particularly attractive for large-scale applications, especially when handling similar problem types where token usage patterns remain consistent. This cost-effectiveness, combined with its detailed reasoning capabilities, positions R1 as an economically viable solution for widespread deployment.
Results Comparison
Accuracy Drop from Original to Variation 4
Accuracy Drop = Variation 4 Accuracy – Original Accuracy
Model | Original Accuracy | Variation 4 Accuracy | Accuracy Drop |
---|---|---|---|
OpenAI o1 | 96.89% | 90.89% | -6.00% |
DeepSeek R1 | 92.22% | 88.89% | -3.00% |
Key Findings
Comparing DeepSeek R1 to OpenAI o1:
- DeepSeek R1 demonstrates significant improvements in handling problem variations compared to o1, showing a smaller accuracy drop and better adaptability.
- o1’s higher original accuracy does not translate to variation stability , as it suffers the greatest drop, implying some reliance on memorized patterns.
The sharp drop in accuracy on problem variations reveals something crucial: high performance on the original MATH dataset likely comes from some pattern matching, not just reasoning. o1 results make this especially clear – if these models were truly reasoning, they should handle slight variations just as well as the originals.
Direct Comparison: DeepSeek R1 vs o1
Questions Models Get Wrong
- DeepSeek R1: More resilient to problem rewording per variation 4, but still struggles with mainly solving multi-step geometric transformations.
- OpenAI o1: While highly accurate on original MATH questions, o1’s performance declines the most in topics such as Intermediate Algebra, Precalculus, and Probability. Small changes in phrasing or structure have led to a sharper drop in accuracy.
Improvements Noted in R1
- More Robust Reasoning: Compared to o1, R1 shows superior adaptability to changes in context and phrasing.
- This begs us to explore whether R1 trained directly on multi-step geometric transformations, might show significant improvement with focused training data. In contrast, O1 consistently fails across broader set of math subjects despite its broader pre-training base, indicating that general mathematical training and associated RL alone may not be sufficient for mastering mathematical reasoning.
Critical Error Types in both models
- Spatial Relationship Tracking Errors: In reworded geometry problems—such as those involving the folding of a magical rectangular amulet—models often misidentify or misplace key points. For example, when the labels or orientations of edges and vertices are altered, the model may incorrectly locate crease intersections or confuse which segments correspond to the amulet’s width or height. This results in inaccurate calculations of areas or distances.
- Solution Path Complexity Errors:In multi-step problems that require a carefully ordered sequence of operations (like subtracting triangle areas to find a quadrilateral’s area or computing the volume of a pyramid), reworded versions can disrupt the logical flow. The model may skip critical intermediate steps or rearrange the solution path improperly. For instance, even if it correctly determines individual triangle areas in a folding problem, it might misapply these values when combining them, leading to a complete breakdown of the intended solution.
- Computational Efficiency and Reasoning:Models not only required significantly more computation time but also exhibited a form of over-analysis. Chain-of-thought evaluations uncovered a critical insight: as models explored multiple solution approaches, they often became trapped in cycles of second-guessing, which ultimately reduced accuracy despite the increased reasoning efforts. Small initial deviations snowballed into cascading errors, severely impacting overall accuracy.

Distilled version of DeepSeek R1 even at 7B parameters show extremely promising numbers
Accuracy Drop = Variation 4 Accuracy – Original Accuracy
Model name* | MATH-500 Accuracy** | Original 90 Accuracy | Variation 4 Accuracy | Accuracy Drop |
---|---|---|---|---|
Qwen-1.5B | 83.9% | 34.81% | 42.59% | 7.78% |
Qwen-7B | 92.8% | 90.37% | 84.44% | -5.93% |
Llama-8B | 89.1% | 87.78% | 80.005 | -7.78% |
Qwen-14B | 93.9% | 89.56% | 82.445 | -7.11% |
Qwen-32B | 89.1% | 91.78% | 84.00% | -7.78% |
Llama-70B | 94.5% | 93.33% | 78.89% | -14.44% |
*All models are DeepSeek-R1-Distil Models. **As reported in the DeepSeek-R1 Technical report.
Key Observations
The Qwen2.5-7B model demonstrates remarkable resilience, maintaining 84% accuracy under challenging variations and surpassing o1’s stability – a striking challenge to the assumption that larger models inherently generalize better. While Qwen2.5-32B performs similarly to its 7B counterpart, its superior tokens/parameter efficiency suggests untapped potential. The 7B model’s extensive training contrasts with the 32B’s incomplete training, indicating that with continued optimization, the 32B model could potentially outperform o1.
LLama 3.3’s low tokens/parameter ratio correlates with its significant accuracy drops in variation testing, revealing a critical efficiency gap. It’s remarkable that distilled models like Qwen are now outperforming state-of-the-art models like LLama while using fewer resources, demonstrating that intelligent optimization can trump raw computational power. This shift signals a pivotal moment in AI development where architectural efficiency and reasoning capabilities are becoming more crucial than model size.
Future work
Looking ahead, these key research directions will drive continued advancement:
- Developing precise benchmarking tools for geometric reasoning to enable more robust evaluation of spatial understanding capabilities
- Investigating foundational training approaches to enhance visual-spatial comprehension, particularly in handling complex geometric transformations
- Leveraging our proprietary data to fine-tune leading models like Qwen2.5-7B and Qwen2.5-32B or other open source models that in the process of catching up to meaningfully improve their reasoning performance
With our proprietary data assets, we’re working to enhance the distilled models to achieve stronger reasoning performance. We will also be publishing results on o3-mini performance shortly.