Updated on February 21, 2025

Deepseek might have disrupted the global AI model in January 2025, but Google and Open AI soon regained ground with their O3 and Gemini 2 Pro releases.
This leaves us with the old question we first tackled with the previous versions of these models – Which AI model should you use for customer service?
To evaluate these models on customer service tasks, we must look deeper at their performance across various AI benchmarks. We also need to understand the unique properties of each of these models.
This article will comprehensively review these models’ technical performance and provide insights into how businesses can use these models for customer service. We’ll cover:
1. What’s New in Deepseek R1, Open AI O3, and Gemini 2 Pro?
2. Deepseek R1 vs Open AI O3 vs Gemini 2 Pro: Checking Performance
3. Which Model Provides the Best Performance for Customer Service?
4. The Verdict
What’s New in Deepseek R1, Open AI O3, and Gemini 2 Pro?
When OpenAI O1 was released, it was the only “reasoning” model. Open AI had started scaling their models using test-time computing, where the model got more time before answering complex questions. O1 had remarkable capabilities in solving puzzles and problems at a graduate-student level.
Deepseek R1 replicated this capability and offered the same performance. Before pushing these state-of-the-art AI models against one another, let’s understand what sets them apart. We’re going to examine each of these models in turn.
Deepseek R1
We’ve covered how Deepseek R1 caused a market crash in NVIDIA stocks. Despite that, an NVIDIA spokesperson said, “DeepSeek is an excellent AI advancement and a perfect example of Test Time Scaling. DeepSeek’s work illustrates how new models can be created using that technique, leveraging widely-available models and compute that is fully export control compliant.”
Deepseek made some unique advances in AI training that have been widely lauded. These are:
1. Creating a “reasoning” model like O1 with a fraction of the investment.
2. Optimizing GPU to GPU communication so that training becomes more efficient and faster.
3. Improving the Transformer model to provide faster answers.
4. Improving the accuracy of the answers that the model gives out.
5. Improving the cost-efficiency of LLMs (Open AI o1 costs $15 / 1 M tokens, while Deepseek R1 costs $2.19 / 1 M tokens.
When it comes to business use cases, Deepseek R1 is one of the cheapest reasoning models available. This translates to more cost savings for businesses and allows them to integrate AI into every domain without spending millions in capex.
Additionally, the model is entirely Open Source and comes with a detailed technical paper. This allows businesses to deploy the model on their cloud infrastructure without payments to the parent company.
Open AI O3 was released right after Deepseek R1 and innovated a lot, too.
Open AI O3
Open AI provided the first glimpse of O 3’s performance in December 2024 with the announcement that it had scored 88% on the ARC-AGI test.
The ARC-AGI test is supposed to test the ability of AI models to recognize and complete novel tasks. So, O3 could solve new problems on its own and beat other similar models (O1 and Claude New Sonnet) by a mile.
However, O3 achieved this score by spending over $1000 in computational power for every task. So, while O3 was intelligent, it wasn’t efficient at doing these tasks, making it challenging to offer this model to the broader public.
So, Open AI launched O3-Mini instead. O3-Mini is also a reasoning model, and it offers a higher level of efficiency than O3. However, it is not as accurate as complete O3.
For comparison, this is how O3-Mini compares with O1-mini on General Knowledge tasks:
Category | Eval | o1-mini | o1-mini (low) | o3-mini (medium) | o3-mini (high) |
General | MMLU (pass@1) | 65.2 | 84.5 | 85.0 | 85.9 |
Math | Math (pass@1) | 90.9 | 95.8 | 97.2 | 97.9 |
Math | GSM (pass@1) | 83.9 | 85.1 | 89.8 | 92.0 |
Modality | SingleQA | 76 | 86 | 87.4 | 88.8 |
The contributions of O3-Mini are as follows:
1. It’s a specialized model with core expertise in coding and other technical tasks.
2. O3-Mini reduces errors by 39% in comparison to O1-Mini.
3. 56% of testers prefer O3 Mini over O1-Mini
4. O3-Mini answers questions 2.5 seconds faster than O1-Mini.
Since O3-Mini scores higher on evaluations and is considerably faster than the alternative model O1-Mini, it’s a great model to start with. The model currently has some rate limits for Plus users (People who pay $20/month); it’s also available commercially and costs $4.40 per 1 million tokens.
However, unlike Deepseek R1, O3 Mini is completely closed-source and can’t be deployed on a company’s cloud infrastructure.
The latest entrant into the competition is Google Gemini 2 Pro, a capable model that shows remarkable performance across the board.

Gemini 2 Pro

Logan Kilpatrick, the current Product Lead for Google’s AI Studio and Deepmind, launched Gemini, saying, “This is our strongest frontier model yet, with all the things developers love about our pro model lineup.”
Gemini 2 is exceptional in performance, outpacing many current models with features like:
1. Two Million Context Window – Google has a 2 million token-long context window. You can analyze all the books with the Gemini 2 Pro without problems. Logan has also shown that Gemini 2 excels at document processing, outperforming all current OCR models.
2. Tool use – Recent models like O3 Mini and Deepseek all come with some tool use. Similarly, with Gemini 2, you get the power of Google search in your AI model. This is great for developers and businesses that want to provide grounded and accurate answers to their customers.
3. Coding – The Gemini 2 Pro model is built to be a technical expert. It offers coding expertise at a similar level to that of O3-mini.
4. Complex Reasoning and Prompts – Like the above models, Gemini 2 is proficient at understanding complex prompts and reasoning. This allows the model to perform complicated tasks and provide detailed answers.
Gemini 2 Flash, a model that offers faster responses than Gemini 2 Pro while maintaining similar performance levels, is priced at $0.7/ 1 million tokens, making it the cheapest option for developers.
These three models are commercially available and can be used to build your customer service chatbots or email ticketing clients. But now that we have a core idea about these models and their unique features let’s look at how they perform against each other.
Deepseek R1 vs Open AI O3 vs Gemini 2 Pro: Checking Performance
The overall performance of these models is as follows.
Model | Reasoning | Math | Language | Factuality | Coding | Price per 1M Output Tokens |
OpenAI O3‑mini | 86.9% | 97.9% | 50.68% | 13.8% | 82.74% | $4.40 |
Gemini 2‑flash | 77.6% | 90.9% | 51.29% | 29.9% | 63.49% | $0.70 |
DeepSeek R1 | 84% | 79.8% | N.A | 30.1% | 66.74% | $2.19 |
These evaluations are based on several benchmarks, which we have listed below:
Performance Category | Underlying Benchmarks |
Reasoning | MMLU, GpQA, and other chain‐of‐thought reasoning tasks |
Math | Math-specific benchmarks (e.g., MATH benchmark, numerical reasoning tasks) |
Language | Standard NLP/language understanding tasks (e.g., commonsense reasoning, NLI, etc.) |
Factuality | SimpleQA tests the ability of the model to answer general knowledge questions. |
Coding | Coding problems from Leetcode and AtCoder |
Let’s explore these performance categories and try to understand which model is better at which task.
1. Reasoning – The GPQA (Graduate-level Google Proof QA Benchmark) and MMLU (Massive Multitask Language Understanding benchmark) test an AI model on how it reasons and solves complex problems. These problems can’t be solved by googling, so solving them without proper reasoning is impossible.
Open AI O3 Mini is the best at completing complex tasks requiring reasoning power.
2. Math – Tested with the MATH benchmarks, these tests check how efficiently the model solves maths problems. Since these problems need technical expertise and familiarity with mathematical concepts, it showcases the capabilities of an AI model to solve complex technical tasks. Open AI O3 Mini is the best at solving maths problems.
3. Language – The language tasks provided to these LLMs include the NYT Connection puzzles, word puzzles, and synopsis tasks. Currently, Gemini 2 Flash performs the best at these tasks.
4. Factuality – In this benchmark, the model is asked some general knowledge questions that are domain-specific. This tests the underlying knowledge present in the model. Deepseek R1 outperforms Gemini 2 and O3 mini in this test.
5. Coding – This is a specific benchmark that tests the capability of these models to generate and complete programming tasks. Open AI O3 Mini is the best at coding.
6. Pricing – Cost effectiveness is one of the primary factors in evaluating AI models for customer service. Gemini 2 offers the most cost-efficient service with a cost of $0.7 per million output tokens.
Now, by seeing the performance, you can see that Open AI O3 Mini is the most technically proficient. However, it is also the highest in price. On the other hand, Deepseek R1 provides the best accuracy, and Gemini 2 is the best at document-oriented tasks that require understanding language.
This concrete performance overview lets us understand which model best suits customer service.
Which Model Provides the Best Performance for Customer Service?
Considering the capabilities of these models, we can construct a definitive model for choosing the right model.

Why is Gemini 2 the best model for customer service?
We evaluate customer service models on the following parameters:
1. Cost – At any customer service project, you must connect with people at scale. Cost efficiency plays a key part in this, and Gemini 2 is the most cost-efficient state-of-the-art model available.
2. Language Efficiency – Your AI chatbot and email ticketing system must understand customer complaints and categorize tickets well. Gemini 2 is the best at solving language-oriented tasks.
3. Factuality – Accuracy is one of the key things we must focus on when evaluating AI models. However, in customer service tasks, the required information is supplied to the model while it answers a question using RAG. So, while Deepseek scores the highest on accuracy, Gemini 2’s score will allow it to provide accurate answers to customers efficiently.
4. Technical Expertise – While these models are all great at coding and technical tasks, most customer complaints don’t list computer bugs or complex problems. If we center the idea that we want to solve and automate L1 customer complaints using AI, Open AI O3’s high scores on technical expertise are the lowest priority in our ratings.
So, with a specific focus on customer support, Gemini 2 is the best model for customer service. However, we recognize that most organizations have different requirements, and the other two models are better at solving various problems.

The Verdict
Our deep dive into DeepSeek R1, OpenAI O3-mini, and Gemini 2 Flash reveals a crucial point: no single “best” AI model for customer service exists. Your choice depends heavily on your specific needs, priorities, and the nature of your customer interactions.
While OpenAI O3-mini consistently leads in raw benchmark scores, particularly in reasoning, math, and coding, its higher cost and closed-source nature make it less accessible for some organizations. DeepSeek R1 offers a compelling open-source alternative with strong factuality, but its overall performance doesn’t quite match O3-mini’s technical prowess.
Our analysis points to Gemini 2 Flash as the most well-rounded choice. Its strengths in language understanding, combined with its unmatched cost-effectiveness and large context window (ideal for processing customer histories and documentation), make it exceptionally well-suited for tasks like:
- Chatbot Interactions: Handling common inquiries, guiding users through troubleshooting steps, and escalating complex issues.
- Email Ticketing: Categorizing support requests, providing automated responses to frequently asked questions, and summarizing long email threads.
- Document Processing: Extracting relevant information from customer-submitted documents (like invoices, contracts, or feedback forms).
However, it’s crucial to remember the nuances:
- Need highly technical support? If your customer service frequently involves debugging code or solving complex mathematical problems, O3-mini’s superior technical skills might justify the higher cost.
- Do you want to prioritize open-source and on-premise deployment? DeepSeek R1 is the clear winner, providing control and cost savings.
- Dealing with extensive documentation or requiring long context windows? Gemini 2’s two million token context window is your AI model of choice.
Want to understand which AI model will work best for your customer service email ticketing and chatbot? We’re here to help!

As a seasoned technologist, Adarsh brings over 14+ years of experience in software development, artificial intelligence, and machine learning to his role. His expertise in building scalable and robust tech solutions has been instrumental in the company’s growth and success.