Yuvraj Khanna, Raghav Rastogi, Dhruv Kumar, Peter Relan, Hung Tran

Introduction

This is the 4th blog in our series of examining recall vs mathematical reasoning in LLM’s. Building on our previous blog, we are going to compare the performance of the OpenAI’s o1 model with DeepSeek R1 and 6 and models distilled from R1, evaluating their reasoning robustness, cost efficiency, and performance.

Methodology: Testing Reasoning Robustness

Our analysis follows the methodology outlined in our prior research, “Exploring the Limits of Reasoning vs. Recall in Large Language Models: A Case Study with the MATH dataset.” This methodology focuses on testing whether models rely on memorization or demonstrate genuine reasoning capabilities.

We evaluated models using 90 problems from the MATH dataset, covering topics such as algebra, geometry, and number theory. Each problem was tested under a zero-shot approach, where no worked examples were provided—just a direct prompt to “solve step by step.”

To further challenge model robustness, we introduced four levels of problem variations:

  1. Variant 1: Change only the mathematical variable (e.g., “x” to “y”).
  2. Variant 2: Change only the non-mathematical context (e.g., replacing “dogs” with “books”).
  3. Variant 3: Reformulate the question’s language while maintaining numerical values.
  4. Variant 4: Modify both language and variable names while keeping the solution unchanged.

Since Variation 4 represents the most challenging scenario, we focus on its impact in this study.

Cost Comparison

ModelInput Cost (per Million Tokens)Output Cost (per Million Tokens)
OpenAI o1$15.00$60.00
DeepSeek R1$0.14$ 0.28

DeepSeek R1 offers significant cost advantages compared to O1, with initial operating costs roughly 10-20x lower. While R1’s highly detailed reasoning process generates 2-3x more output tokens, it still maintains a compelling 3-7x cost advantage. Although these token-adjusted calculations are preliminary, R1’s superior cost efficiency makes it particularly attractive for large-scale applications, especially when handling similar problem types where token usage patterns remain consistent. This cost-effectiveness, combined with its detailed reasoning capabilities, positions R1 as an economically viable solution for widespread deployment.

Results Comparison

Accuracy Drop from Original to Variation 4

Accuracy Drop = Variation 4 Accuracy – Original Accuracy

ModelOriginal AccuracyVariation 4 AccuracyAccuracy Drop
OpenAI o196.89%90.89%-6.00%
DeepSeek R192.22%88.89%-3.00%

Key Findings

Comparing DeepSeek R1 to OpenAI o1:

  1. DeepSeek R1 demonstrates significant improvements in handling problem variations compared to o1, showing a smaller accuracy drop and better adaptability.
  2. o1’s higher original accuracy does not translate to variation stability , as it suffers the greatest drop, implying some reliance on memorized patterns.

The sharp drop in accuracy on problem variations reveals something crucial: high performance on the original MATH dataset likely comes from some pattern matching, not just reasoning. o1 results make this especially clear – if these models were truly reasoning, they should handle slight variations just as well as the originals.

Direct Comparison: DeepSeek R1 vs o1

Questions Models Get Wrong

Improvements Noted in R1

Critical Error Types in both models

Distilled version of DeepSeek R1 even at 7B parameters show extremely promising numbers

Accuracy Drop = Variation 4 Accuracy – Original Accuracy

Model name*MATH-500 Accuracy**Original 90 AccuracyVariation 4 AccuracyAccuracy Drop
Qwen-1.5B83.9%34.81%42.59%7.78%
Qwen-7B92.8%90.37%84.44%-5.93%
Llama-8B89.1%87.78%80.005-7.78%
Qwen-14B93.9%89.56%82.445-7.11%
Qwen-32B89.1%91.78%84.00%-7.78%
Llama-70B94.5%93.33%78.89%-14.44%

Key Observations

The Qwen2.5-7B model demonstrates remarkable resilience, maintaining 84% accuracy under challenging variations and surpassing o1’s stability – a striking challenge to the assumption that larger models inherently generalize better. While Qwen2.5-32B performs similarly to its 7B counterpart, its superior tokens/parameter efficiency suggests untapped potential. The 7B model’s extensive training contrasts with the 32B’s incomplete training, indicating that with continued optimization, the 32B model could potentially outperform o1.

LLama 3.3’s low tokens/parameter ratio correlates with its significant accuracy drops in variation testing, revealing a critical efficiency gap. It’s remarkable that distilled models like Qwen are now outperforming state-of-the-art models like LLama while using fewer resources, demonstrating that intelligent optimization can trump raw computational power. This shift signals a pivotal moment in AI development where architectural efficiency and reasoning capabilities are becoming more crucial than model size.

Future work

Looking ahead, these key research directions will drive continued advancement:

  1. Developing precise benchmarking tools for geometric reasoning to enable more robust evaluation of spatial understanding capabilities
  2. Investigating foundational training approaches to enhance visual-spatial comprehension, particularly in handling complex geometric transformations
  3. Leveraging our proprietary data to fine-tune leading models like Qwen2.5-7B and Qwen2.5-32B or other open source models that in the process of catching up to meaningfully improve their reasoning performance

With our proprietary data assets, we’re working to enhance the distilled models to achieve stronger reasoning performance. We will also be publishing results on o3-mini performance shortly.