Estimated reading time: 17 minutes
Key Takeaway
We tested Claude Code, Codex, and Antigravity on the same Node.js email routing task and found that each tool fits a different development workflow:
- Claude Code delivered the most complete working implementation, including real Gmail ingestion, but needs targeted security hardening before production.
- Codex produced the strongest security foundation, with signed OAuth state protection and schema validation, but did not build a full Gmail ingestion pipeline.
- Antigravity created the richest prototype, with dynamic routing, retries, and a dashboard, but expanded the scope and had the weakest TypeScript coverage.
Best overall: Claude Code for scoped production MVPs, Codex for security-conscious scaffolding, and Antigravity for demos or rapid prototypes.
Every AI coding tool in 2026 claims to make developers faster. But which one actually performs best on a real engineering task?
At Kommunicate, our engineering team uses AI coding tools daily. We got curious about whether the marketing matched reality, especially for a generation of tools that now claim to be “agents,” not just assistants.
So we ran a hands-on experiment: install all three, give each the same real-world task, and document exactly what came out the other side. In this article, we compare the three agentic tools (Claude Code, Antigravity and Codex) and tell you which performed best. We’ll talk about:
- Which models did we use?
- Claude Code vs Antigravity 2.0 vs Codex
- Our experiment: Setup and methodology
- Results: How did each tool perform?
- Conclusion
Which models did we use?

Before comparing the platforms, it’s worth understanding the AI models that we used for our experiment. The diffenrences between these models is real, and it directly shapes the code you get out the other end.
1. GPT-5.5 (Codex)
OpenAI’s GPT-5.5 launched on April 23, 2026 and is the model currently powering Codex. Designed as an efficiency-first flagship, it brings significant improvements in tool use and agentic task completion over its predecessors.
The standout number is Terminal-Bench 2.0 at 82.7%, the highest of the three models here, and a meaningful signal for a tool like Codex that runs terminal commands as part of its core workflow.
2. Claude Sonnet 4.6 (Claude Code)
Released February 17, 2026, Sonnet 4.6 is the model powering Claude Code’s default experience. Anthropic positioned it as near-Opus performance at mid-tier pricing, and the benchmarks back that up.
SWE-Bench Verified at 79.6% is the headline. SWE-Bench tests models on real GitHub issues from production codebases. It’s the closest benchmark to “can this model ship production code?” and Sonnet 4.6 leads the three models here.
However, this model also only scores 59.1% on the Terminal bench.
In Claude Code usage data, developers preferred Sonnet 4.6 over the previous flagship Sonnet 4.5 70% of the time, citing better instruction following and less overengineering.
3. Gemini 3.5 Flash (Antigravity)
Announced at Google I/O 2026 on May 19: Gemini 3.5 Flash is the default model in Antigravity. A Flash-tier model (fast, cheap) that actually beats last year’s Pro model on several coding benchmarks.
| Benchmark | Gemini 3.5 Flash Score |
|---|---|
| Terminal-Bench 2.1 | 76.2% |
| MCP Atlas (multi-tool) | 83.6% |
| SWE-Bench Pro | 55.1% |
MCP Atlas at 83.6% is the number that matters for Antigravity (it measures multi-step tool orchestration across search, file operations, and data handling). For an IDE that dispatches multiple agents in parallel, that’s the relevant signal.
Gemini 3.5 Flash also runs roughly 4x faster on output tokens per second than comparable frontier models, which is how Antigravity can manage parallel agent workloads without becoming unusably slow.
Model benchmark summary
| Claude Sonnet 4.6 | GPT-5.5 | Gemini 3.5 Flash | |
|---|---|---|---|
| SWE-bench Pro (Public) | – | 58.6% | 55.1% |
| SWE-bench Verified | 79.6% | 78% | – |
| Terminal-Bench | 59.1% | 82.7% | 76.2% |
| Multi-tool (MCP Atlas) | – | 75.3% | 83.6% |
| HumanEval | 98% | ~97% | ~96% |
| Output speed | Moderate | Moderate | 4x faster |
| Pricing (input/output per 1M tokens) | $3 / $15 | ~$7.50 / $30 | $1.50 / $9 |
No single model dominates across the board.
- Sonnet 4.6 leads on real-world repo-level coding.
- GPT-5.5 leads on terminal operations.
- Gemini 3.5 Flash leads on multi-tool orchestration and is by far the cheapest and fastest of the three.
Now that we know about the AI models, let’s talk about the coding agents.
Also Read
1. What does GPT-5.5 mean for AI agents and customer support workflows?
2. Which ChatGPT Model is Best for You?
3. ChatGPT Agent Mode: How to Use & Access
Claude Code vs Antigravity 2.0 vs Codex

The model is only part of the story. The platform is where the experience diverges most sharply.
1. Claude Code is Anthropic’s terminal-native coding agent. It lives in your command line, has direct access to your filesystem and git, and operates on an approval-first model: it reasons about your full codebase, asks clarifying questions before starting, and requires explicit confirmation before any destructive action.
2. OpenAI Codex started as a CLI in April 2025 and has expanded to a desktop app, IDE integrations, and cloud-based execution. It’s built around an async task-delegation model. It’s bundled into ChatGPT Plus ($20/month) rather than sold separately, making onboarding easy for anyone already in the OpenAI ecosystem.
3. Google Antigravity is the most architecturally ambitious of the three. An agent-first IDE built around a “Mission Control” interface, it dispatches up to five autonomous agents working in parallel across your editor, terminal, and a built-in Chromium browser. The philosophy is that you’re a manager of agents, not a developer writing every line. Currently free during public preview.
Let’s compare these coding platforms on some metrics before we begin our experiments.
Platform comparison
| Claude Code | Codex | Antigravity | |
|---|---|---|---|
| Underlying model | Claude Sonnet 4.6 | GPT-5.5 | Gemini 3.5 Flash |
| Interface | Terminal / CLI | Desktop app, CLI, IDE, web | Agent-first IDE (VS Code fork) |
| Workflow model | Approval-first, collaborative | Async task delegation | Multi-agent orchestration |
| Autonomy level | High, with checkpoints | High, async | Very high, parallel agents |
| Multi-agent support | No | Limited | Yes, up to 5 parallel |
| MCP support | Yes | Difficult to integrate | Yes, added March 2026 |
| Git & terminal access | Yes, native | Yes | Yes, with a browser |
| Pricing | $17/mo Pro (annual) | Bundled in ChatGPT Plus ($20/mo) | Free (preview) |
| Platform stability | Production-ready | Production-ready | Public preview |
Now, let’s start building with these platforms.
Our experiment: Setup and methodology
The Prompt
To keep it neutral, we gave the same prompt to all of the coding platforms:
Design a basic email routing architecture with Google OAuth. Include: OAuth 2.0 flow, email ingestion endpoint, routing logic by sender domain, and a simple queue. Implement in Node.js/TypeScript, wire it up, and push it to GitHub.
This task is broad enough to test judgment: it spans auth, API integration, data flow, and git workflow, involves a real external service, and leaves enough ambiguity that how each tool interprets it is revealing.
Evaluation criteria
We assessed each output across five dimensions:
| Dimension | What We Looked At |
|---|---|
| Workflow autonomy | Did the tool complete the full task end-to-end — including git push — or require manual steps? |
| Code scope | Did it build what was asked, or expand the brief without prompting? |
| Type safety | What percentage of the codebase is TypeScript? Higher = tighter discipline. |
| Security awareness | Did it address OAuth state validation, token encryption, and scope minimization unprompted? |
| Production readiness | Did it document or implement the path from demo to production? |
The outputs
All three outputs are public, committed exactly as each tool produced them without cleanup:
- Claude Code: email-router-claude-code
- Codex: email-routing-google-oauth-codex
- Antigravity: email-routing-antigravity
Now, let’s see how each IDE performed in this task.
Also Read
1. 11 Best AI Tools for Customer Support Teams
2. 12 Best AI Customer Service Tools for Support Teams
3. 19 Generative AI Tools Like ChatGPT That You Cannot Ignore
Results: How did each tool perform?
Install and workflow experience
Before any code was written, the experience itself was already telling.
Claude Code installed cleanly. It executed the full task without interruption. It wrote the code, configured the environment, wired up the git remote, and pushed the commit. No manual steps required.
Codex installed cleanly but stalled at the finish line. The git configuration and local run required manual intervention. The code it produced was good; the end-to-end workflow automation was not.
Antigravity crashed twice before the application opened. The tool failed on launch on a supported OS configuration twice. When it eventually loaded, there were bugs in the UI. It produced output, but the install experience is a data point in itself.
The new Antigravity 2.0 just launched and has documented bugs. Gemini 3.5 Flash is also not the most competent model. Both deficiencies have been documented by Theo from t3.gg
Code Audit
1. Claude Code: a real Gmail ingestion pipeline
TypeScript: 100% | Commits: 2 | Module system: CommonJS via ts-node | Port: 3000
Architecture

Claude Code is the only tool that actually reads Gmail. It uses the googleapis package directly: authenticating, querying the Gmail client with is: unread, downloading message metadata, and extracting From, Subject, and Date headers sequentially. It also supports Pub/Sub push via /webhook/gmail, which is the production-appropriate ingestion pattern rather than polling.
Routing: Simple linear array scan matching r.domain === job.senderDomain, with an empty string “” catch-all at the end. It routes to real email destinations (engineering@company.com, billing@company.com). The table lives in code, and to change a rule, you edit router.ts.
Queue: A single global in-memory FIFO queue, drained by a continuous async worker loop. If a job fails, the error is caught and logged to stderr before moving to the next item — a dead-letter pattern. There is no retry counter.
OAuth: Basic state-free redirect flow. It generates a login URL, accepts the callback code, and stores tokens in a Map under a default user key. Token refresh is handled automatically by google-auth-library events. Critically, there is no state parameter validation on the callback, and the endpoint is technically vulnerable to CSRF. Codex fixed this; Claude Code and Antigravity did not.
Production guidance: The README includes an explicit swap-out table:
| Demo Implementation | Production Replacement |
|---|---|
| tokenStore (in-memory Map) | Redis / Postgres keyed by user ID |
| emailQueue (in-memory FIFO) | BullMQ + Redis (retries, priorities, persistence) |
| deliverEmail (console.log) | Gmail forward, webhook POST, or DB insert |
The verdict: The only output of the three you could point at a Gmail inbox and have it actually work. The routing and queue are deliberately simple, and the production notes are honest. The OAuth gap is real and worth one targeted fix, but the structural foundation is sound.
2. Codex: An enterprise-grade security skeleton
TypeScript: 100% | Commits: 6 | Module system: ESM (“type”: “module”) via tsx watch | Port: 3000
Architecture

Codex built a validated API scaffold, not a Gmail integration. It accepts email payloads via POST /email/ingest; there is no googleapis dependency, no Gmail polling, no Pub/Sub. The ingestion model is passive: it expects upstream systems to forward normalized email payloads to the endpoint. This is a deliberate architectural choice, not an oversight, and it’s the right pattern for a microservice that shouldn’t own the mail transport layer.
Input validation: Codex is the only tool to use zod for schema validation, enforcing typed runtime contracts on incoming API requests. Claude Code and Antigravity both use manual parsing: custom string methods, regex, and basic JavaScript presence checks (!sender).
Routing: Constant-time O(1) dictionary lookup (domainRoutes[senderDomain(email)]) falling back to a defaultRoute object when unmatched. Faster and more predictable than Claude Code’s linear scan, though at the scale of email routing, the difference is academic.
Queue: In-memory structures segmented by target queue name (Map<string, QueuedEmailJob[]>), covering customer-success, vendor-ops, partnerships, and general-triage. There is no background worker in the committed code — jobs are enqueued but not actively drained.
OAuth: The strongest implementation of the three. Codex signs the OAuth state parameter using crypto.createHmac(“sha256”) with a user nonce and timestamp, validates the signature on callback return, checks expiry, and uses timing-safe comparison to prevent timing attacks. This is the correct CSRF mitigation for an OAuth 2.0 callback, and neither Claude Code nor Antigravity implemented it. The README also explicitly flags “encrypt refresh tokens at rest” as a production requirement.
Six commits versus Claude Code’s two reflect Codex’s iterative style. The ESM module system, dedicated types.ts, and separated tokenStore.ts all signal a codebase designed to be extended without stepping on itself.
The verdict: The cleanest, most security-conscious skeleton of the three. Not a working Gmail integration, but if you’re building something you’ll take through a security review, this is the foundation with the fewest “why didn’t you handle this?” questions in code review.
3. Antigravity: A full-stack demo that exceeds the brief
TypeScript: 36.9% | CSS: 27.6% | JavaScript: 22.4% | HTML: 13.1% | Commits: 3 | Module system: CommonJS via ts-node | Port: 3050
Architecture

The language breakdown says everything before you open a file. Antigravity built a product. The public/ folder contains a complete monitoring dashboard with real-time queue visualization, a route management UI, and a webhook sandbox. No one asked for this.
But examining the backend closely, the picture becomes more nuanced. Antigravity’s implementations of routing and queuing are technically the most sophisticated of the three.
Routing: Antigravity translates wildcard patterns (*@google.com, sales@*, *) into operational regex blocks (.*), sorts rules by numeric priority fields, and tests each email against the sorted rule set sequentially. This supports complex, prioritized matching that neither Claude Code’s linear scan nor Codex’s dictionary lookup can express. Rules can also be added and deleted at runtime via API — no code edit required.
Queue: A stateful job tracker with explicit state boundaries (pending → processing → completed / failed), attempt tracking, and a setInterval worker polling every 500ms. If a job fails and the attempt count is under the maximum, the status resets to pending, and the job is retried automatically (up to 3 times). Neither of the other two tools implements retry logic.
OAuth: A toggleable service class. If Google credentials are present, it uses the standard OAuth flow. If not, it falls back to a mock mode that renders a visual HTML consent screen and generates dummy tokens locally, useful for testing without cloud credentials. No HMAC-signed state validation.
The cost of all this: TypeScript covers only 37% of the codebase because of the unrequested frontend. Three large-batch commits suggest generation in bulk rather than deliberate iteration. The tool crashed twice on launch before producing any of it.
The verdict: The most feature-complete output of the three, and the least suitable for a production codebase. The routing and queue engines are better-designed than the alternatives. The monitoring dashboard is scope creep.
The type safety is the weakest. For a prototype or stakeholder demo, it delivers more than either competitor. For a codebase you’ll maintain, the 63% non-TypeScript ratio is a liability.
Full technical comparison
| Area | Claude Code | Codex | Antigravity |
|---|---|---|---|
| Primary objective | Real-time Gmail ingestion pipeline | Validated API scaffold with enterprise security | Full-stack demo with interactive dashboard |
| User interface | None — headless API | None — headless API | Full web dashboard, glassmorphic and real-time |
| Module system | CommonJS / ts-node | ESM / tsx watch | CommonJS / ts-node |
| Input validation | Manual parsing using string methods and regex | Schema validation via Zod | Manual parsing using JS presence checks |
| Gmail integration | ✅ Full — googleapis, pulls unread, supports Pub/Sub | ⚠️ OAuth only — no googleapis, no Gmail calls | ⚠️ OAuth only — mock payloads, no Gmail calls |
| Ingestion model | Active sync loop: pulls or receives Pub/Sub push | Passive receiver: accepts forwarded JSON payloads | Sandbox simulation: manual input + mock batch |
| OAuth state protection | ❌ No state validation | ✅ HMAC-signed nonce, timestamp, timing-safe validation | ❌ No signed state — focuses on real/mock toggle |
| Routing algorithm | Linear scan, exact domain match, O(n) | Dictionary lookup, exact domain match, O(1) | Regex wildcard matching, priority-sorted, dynamic API |
| Route configurability | Static — edit router.ts | Static — edit route map | Dynamic — add/delete rules via API at runtime |
| Queue behavior | Single global FIFO, async drain loop, no retries | Segmented by queue name, no background worker | Stateful job tracker, 500ms interval worker, 3 retries |
| Error handling | Dead-letter to stderr, continues loop | Express error middleware to response schemas | Stateful retry — resets to pending if under max attempts |
| TypeScript coverage | 100% | 100% | 37% |
| Key dependencies | googleapis, google-auth-library, express | zod, google-auth-library, express | google-auth-library, cors, express |
Final scorecard
| Claude Code | Codex | Antigravity | |
|---|---|---|---|
| Install experience | ✅ Clean | ✅ Clean | ❌ Crashed twice |
| End-to-end autonomy | ✅ Full — wrote, configured, committed, pushed | ⚠️ Manual git steps needed | ⚠️ Output produced after crashes |
| Scope discipline | ✅ Exactly what was asked | ✅ Exactly what was asked | ⚠️ Significantly over-scoped |
| TypeScript coverage | ✅ 100% | ✅ 100% | ⚠️ 37% |
| OAuth security | ❌ No state validation | ✅ HMAC-signed, timestamped, nonce | ❌ No state validation |
| Gmail integration | ✅ Full googleapis integration | ⚠️ OAuth only, no Gmail calls | ⚠️ Mock payloads only |
| Input validation | ⚠️ Manual parsing | ✅ Zod schema validation | ⚠️ Manual parsing |
| Routing sophistication | ⚠️ Exact match, static, O(n) | ⚠️ Exact match, static, O(1) | ✅ Wildcard regex, priority-sorted, dynamic |
| Queue sophistication | ⚠️ FIFO, no retries | ⚠️ No background worker | ✅ State machine, auto-retry, interval worker |
| Production guidance | ✅ Explicit swap-out table | ✅ Explicit README notes | ⚠️ Implicit in structure |
Our Verdict
- Claude Code delivered the most complete working Gmail ingestion pipeline. It built the core workflow end to end, but it missed OAuth state validation and needs targeted security hardening before production.
- Codex produced the strongest security skeleton, including signed OAuth state protection and schema validation, but it did not implement a full Gmail ingestion pipeline.
- Antigravity created the richest prototype, with dynamic routing, retry logic, and a dashboard, but it expanded the scope significantly and had the weakest TypeScript coverage.
Conclusion
Running all three tools on the same task produced a cleaner answer than we expected, but because each tool’s output maps precisely to a specific type of work.
Claude Code won the workflow test.
It handled the full task without hand-holding, and it’s the only tool that built something you could actually point at a Gmail inbox. The approval-first model maps well to how professional teams want to work with an AI agent: it asks before acting, stays in scope, and leaves the codebase in a state another developer can pick up immediately.
The OAuth state gap is real and worth one targeted fix. Everything else is a solid foundation. For teams building anything that processes real email, the googleapis integration alone makes it the practical starting point.
Choose Claude Code if you need real Gmail integration, you want an agent that executes the full workflow without you finishing the job, and you’re working on a codebase where scope discipline and maintainability matter.
Codex: for security-conscious teams building to last.
Codex produced the most secure and structurally sound codebase of the three. HMAC-signed OAuth state, zod schema validation, ESM module system, isolated type definitions — these aren’t style choices, they’re the decisions that make a codebase easier to audit, extend, and hand off. A developer picking up Codex’s output would ask the fewest questions in code review.
The trade-offs are real: no Gmail integration, no background worker, manual git steps. It’s better understood as a disciplined pair programmer than a fully autonomous agent.
Choose Codex if you’re already in the ChatGPT ecosystem, your team does thorough code review, and you’d rather have a clean, security-hardened skeleton you extend yourself than an end-to-end output you need to audit.
Antigravity: for prototyping and demos
Antigravity produced the most ambitious output of the three. Its wildcard routing engine, stateful retry queue, and runtime route management are technically better-designed than the equivalent components in either competitor. If you’re building a demo, shipping a prototype to stakeholders, or stress-testing what agent-first development looks like, it gets you further faster.
But it crashed twice on launch. TypeScript covers only 37% of the output. It rewrote the scope of the task without asking. And the broader reliability story suggests a tool still finding its production footing.
Choose Antigravity if you’re prototyping, running a hackathon, or need to show stakeholders something that looks and feels like a complete product. Don’t put it at the center of a codebase you need to maintain.
The honest summary
No tool won across the board. Each tool is suited to distinct development philosophies, and that philosophy shows up directly in the code it writes.
- Claude Code is the engineer: scoped, end-to-end, ships what was asked.
- Codex is the security architect: disciplined, extensible, conservative by design.
- Antigravity is the ambitious prototype: it builds more than you asked for, with less type coverage, and crashes on the way in.
For a customer support platform like Kommunicat,e where reliability and maintainability matter, Claude Code is the practical choice for production work. For teams prototyping new features or evaluating what agent-first development actually delivers, Antigravity is worth the instability tax/
The right answer depends less on benchmark scores and more on where you sit in the development lifecycle. All three tools are moving fast. The gap between them is closing. Check back in six months.
Frequently Asked Questions
Claude Code was better for end-to-end workflow completion in our test. It built a working Gmail ingestion pipeline and completed more of the task without manual intervention. Codex produced cleaner security foundations but did not implement full Gmail ingestion.
Codex may be better for teams that prioritize security review, schema validation, and extensible structure. Claude Code may be better for teams that want a working implementation faster, as long as they review and fix security gaps.
Google Antigravity is best for rapid prototypes, demos, and feature exploration. In our test, it produced the most complete user-facing demo, but it also expanded the scope and had weaker TypeScript coverage.
There is no universal winner. Claude Code is best for scoped engineering tasks, Codex is best for security-conscious foundations, and Antigravity is best for prototypes.
Use Claude Code for production MVPs, Codex for auditable security-conscious scaffolds, and Antigravity for stakeholder demos or rapid prototyping.

A Content Marketing Manager at Kommunicate, Uttiya brings in 11+ years of experience across journalism, D2C and B2B tech. He’s excited by the evolution of AI technologies and is interested in how it influences the future of existing industries.


