Estimated reading time: 17 minutes

Key Takeaway

We tested Claude Code, Codex, and Antigravity on the same Node.js email routing task and found that each tool fits a different development workflow:

  • Claude Code delivered the most complete working implementation, including real Gmail ingestion, but needs targeted security hardening before production.
  • Codex produced the strongest security foundation, with signed OAuth state protection and schema validation, but did not build a full Gmail ingestion pipeline.
  • Antigravity created the richest prototype, with dynamic routing, retries, and a dashboard, but expanded the scope and had the weakest TypeScript coverage.

Best overall: Claude Code for scoped production MVPs, Codex for security-conscious scaffolding, and Antigravity for demos or rapid prototypes.

Every AI coding tool in 2026 claims to make developers faster. But which one actually performs best on a real engineering task?

At Kommunicate, our engineering team uses AI coding tools daily. We got curious about whether the marketing matched reality, especially for a generation of tools that now claim to be “agents,” not just assistants. 

So we ran a hands-on experiment: install all three, give each the same real-world task, and document exactly what came out the other side. In this article, we compare the three agentic tools (Claude Code, Antigravity and Codex) and tell you which performed best. We’ll talk about:

  1. Which models did we use?
  2. Claude Code vs Antigravity 2.0 vs Codex
  3. Our experiment: Setup and methodology
  4. Results: How did each tool perform?
  5. Conclusion

Which models did we use?

Comparison graphic showing the models behind three AI coding tools: Claude Sonnet 4.6 for Claude Code, GPT-5.5 for Codex, and Gemini 3.5 Flash for Antigravity, with benchmark highlights for repo-level code, terminal operations, and multi-tool performance.
Models Behind AI Coding Tools

Before comparing the platforms, it’s worth understanding the AI models that we used for our experiment. The diffenrences between these models is real, and it directly shapes the code you get out the other end.

1. GPT-5.5 (Codex)

OpenAI’s GPT-5.5 launched on April 23, 2026 and is the model currently powering Codex. Designed as an efficiency-first flagship, it brings significant improvements in tool use and agentic task completion over its predecessors.

The standout number is Terminal-Bench 2.0 at 82.7%, the highest of the three models here, and a meaningful signal for a tool like Codex that runs terminal commands as part of its core workflow.

2. Claude Sonnet 4.6 (Claude Code)

Released February 17, 2026, Sonnet 4.6 is the model powering Claude Code’s default experience. Anthropic positioned it as near-Opus performance at mid-tier pricing, and the benchmarks back that up.

SWE-Bench Verified at 79.6% is the headline. SWE-Bench tests models on real GitHub issues from production codebases. It’s the closest benchmark to “can this model ship production code?” and Sonnet 4.6 leads the three models here. 

However, this model also only scores 59.1% on the Terminal bench.

In Claude Code usage data, developers preferred Sonnet 4.6 over the previous flagship Sonnet 4.5 70% of the time, citing better instruction following and less overengineering.

3. Gemini 3.5 Flash (Antigravity)

Announced at Google I/O 2026 on May 19: Gemini 3.5 Flash is the default model in Antigravity. A Flash-tier model (fast, cheap) that actually beats last year’s Pro model on several coding benchmarks.

Benchmark Gemini 3.5 Flash Score
Terminal-Bench 2.1 76.2%
MCP Atlas (multi-tool) 83.6%
SWE-Bench Pro 55.1%

MCP Atlas at 83.6% is the number that matters for Antigravity (it measures multi-step tool orchestration across search, file operations, and data handling). For an IDE that dispatches multiple agents in parallel, that’s the relevant signal. 

Gemini 3.5 Flash also runs roughly 4x faster on output tokens per second than comparable frontier models, which is how Antigravity can manage parallel agent workloads without becoming unusably slow.

Model benchmark summary

Claude Sonnet 4.6 GPT-5.5 Gemini 3.5 Flash
SWE-bench Pro (Public) 58.6% 55.1%
SWE-bench Verified 79.6% 78%
Terminal-Bench 59.1% 82.7% 76.2%
Multi-tool (MCP Atlas) 75.3% 83.6%
HumanEval 98% ~97% ~96%
Output speed Moderate Moderate 4x faster
Pricing (input/output per 1M tokens) $3 / $15 ~$7.50 / $30 $1.50 / $9

No single model dominates across the board. 

  • Sonnet 4.6 leads on real-world repo-level coding. 
  • GPT-5.5 leads on terminal operations. 
  • Gemini 3.5 Flash leads on multi-tool orchestration and is by far the cheapest and fastest of the three.

Now that we know about the AI models, let’s talk about the coding agents. 

Also Read

1. What does GPT-5.5 mean for AI agents and customer support workflows?
2. Which ChatGPT Model is Best for You?
3. ChatGPT Agent Mode: How to Use & Access 

Claude Code vs Antigravity 2.0 vs Codex

Comparison table showing Claude Code, Codex, and Antigravity 2.0 across workflow model, autonomy, and best use case, with Claude Code positioned for production engineering, Codex for security-conscious teams, and Antigravity for rapid prototyping.
Three AI Coding Agent Philosophies

The model is only part of the story. The platform is where the experience diverges most sharply.

1. Claude Code is Anthropic’s terminal-native coding agent. It lives in your command line, has direct access to your filesystem and git, and operates on an approval-first model: it reasons about your full codebase, asks clarifying questions before starting, and requires explicit confirmation before any destructive action.

2. OpenAI Codex started as a CLI in April 2025 and has expanded to a desktop app, IDE integrations, and cloud-based execution. It’s built around an async task-delegation model. It’s bundled into ChatGPT Plus ($20/month) rather than sold separately, making onboarding easy for anyone already in the OpenAI ecosystem.

3. Google Antigravity is the most architecturally ambitious of the three. An agent-first IDE built around a “Mission Control” interface, it dispatches up to five autonomous agents working in parallel across your editor, terminal, and a built-in Chromium browser. The philosophy is that you’re a manager of agents, not a developer writing every line. Currently free during public preview.

    Let’s compare these coding platforms on some metrics before we begin our experiments.

    Platform comparison

    Claude Code Codex Antigravity
    Underlying model Claude Sonnet 4.6 GPT-5.5 Gemini 3.5 Flash
    Interface Terminal / CLI Desktop app, CLI, IDE, web Agent-first IDE (VS Code fork)
    Workflow model Approval-first, collaborative Async task delegation Multi-agent orchestration
    Autonomy level High, with checkpoints High, async Very high, parallel agents
    Multi-agent support No Limited Yes, up to 5 parallel
    MCP support Yes Difficult to integrate Yes, added March 2026
    Git & terminal access Yes, native Yes Yes, with a browser
    Pricing $17/mo Pro (annual) Bundled in ChatGPT Plus ($20/mo) Free (preview)
    Platform stability Production-ready Production-ready Public preview

    Now, let’s start building with these platforms.

    Our experiment: Setup and methodology

    The Prompt

    To keep it neutral, we gave the same prompt to all of the coding platforms:

    Design a basic email routing architecture with Google OAuth. Include: OAuth 2.0 flow, email ingestion endpoint, routing logic by sender domain, and a simple queue. Implement in Node.js/TypeScript, wire it up, and push it to GitHub.

    This task is broad enough to test judgment: it spans auth, API integration, data flow, and git workflow, involves a real external service, and leaves enough ambiguity that how each tool interprets it is revealing.

    Evaluation criteria

    We assessed each output across five dimensions:

    Dimension What We Looked At
    Workflow autonomy Did the tool complete the full task end-to-end — including git push — or require manual steps?
    Code scope Did it build what was asked, or expand the brief without prompting?
    Type safety What percentage of the codebase is TypeScript? Higher = tighter discipline.
    Security awareness Did it address OAuth state validation, token encryption, and scope minimization unprompted?
    Production readiness Did it document or implement the path from demo to production?

    The outputs

    All three outputs are public, committed exactly as each tool produced them without cleanup:

    Now, let’s see how each IDE performed in this task. 

    Also Read
    
    1. 11 Best AI Tools for Customer Support Teams
    2. 12 Best AI Customer Service Tools for Support Teams
    3. 19 Generative AI Tools Like ChatGPT That You Cannot Ignore

    Results: How did each tool perform?

    Install and workflow experience

    Before any code was written, the experience itself was already telling.

    Claude Code installed cleanly. It executed the full task without interruption. It wrote the code, configured the environment, wired up the git remote, and pushed the commit. No manual steps required.

    Codex installed cleanly but stalled at the finish line. The git configuration and local run required manual intervention. The code it produced was good; the end-to-end workflow automation was not.

    Antigravity crashed twice before the application opened. The tool failed on launch on a supported OS configuration twice. When it eventually loaded, there were bugs in the UI. It produced output, but the install experience is a data point in itself.

    The new Antigravity 2.0 just launched and has documented bugs. Gemini 3.5 Flash is also not the most competent model. Both deficiencies have been documented by Theo from t3.gg 

    Code Audit

    1. Claude Code: a real Gmail ingestion pipeline

    TypeScript: 100% | Commits: 2 | Module system: CommonJS via ts-node | Port: 3000

    Architecture

    Architecture diagram showing the Claude Code email router structure, with src files for Express app, Google OAuth, Gmail ingestion, in-memory FIFO queue, domain router, and web monitoring dashboard.
    Claude Code Email Router Architecture

    Claude Code is the only tool that actually reads Gmail. It uses the googleapis package directly: authenticating, querying the Gmail client with is: unread, downloading message metadata, and extracting From, Subject, and Date headers sequentially. It also supports Pub/Sub push via /webhook/gmail, which is the production-appropriate ingestion pattern rather than polling.

    Routing: Simple linear array scan matching r.domain === job.senderDomain, with an empty string “” catch-all at the end. It routes to real email destinations (engineering@company.com, billing@company.com). The table lives in code, and to change a rule, you edit router.ts.

    Queue: A single global in-memory FIFO queue, drained by a continuous async worker loop. If a job fails, the error is caught and logged to stderr before moving to the next item — a dead-letter pattern. There is no retry counter.

    OAuth: Basic state-free redirect flow. It generates a login URL, accepts the callback code, and stores tokens in a Map under a default user key. Token refresh is handled automatically by google-auth-library events. Critically, there is no state parameter validation on the callback, and the endpoint is technically vulnerable to CSRF. Codex fixed this; Claude Code and Antigravity did not.

    Production guidance: The README includes an explicit swap-out table:

    Demo Implementation Production Replacement
    tokenStore (in-memory Map) Redis / Postgres keyed by user ID
    emailQueue (in-memory FIFO) BullMQ + Redis (retries, priorities, persistence)
    deliverEmail (console.log) Gmail forward, webhook POST, or DB insert

    The verdict: The only output of the three you could point at a Gmail inbox and have it actually work. The routing and queue are deliberately simple, and the production notes are honest. The OAuth gap is real and worth one targeted fix, but the structural foundation is sound.

    2. Codex: An enterprise-grade security skeleton

    TypeScript: 100% | Commits: 6 | Module system: ESM (“type”: “module”) via tsx watch | Port: 3000

    Architecture

    Architecture diagram showing the Codex email router structure, with source files for server setup, configuration, shared TypeScript types, token storage, Google OAuth, email ingestion, queue management, and domain routing.
    Codex Email Router Architecture

    Codex built a validated API scaffold, not a Gmail integration. It accepts email payloads via POST /email/ingest; there is no googleapis dependency, no Gmail polling, no Pub/Sub. The ingestion model is passive: it expects upstream systems to forward normalized email payloads to the endpoint. This is a deliberate architectural choice, not an oversight, and it’s the right pattern for a microservice that shouldn’t own the mail transport layer.

    Input validation: Codex is the only tool to use zod for schema validation, enforcing typed runtime contracts on incoming API requests. Claude Code and Antigravity both use manual parsing: custom string methods, regex, and basic JavaScript presence checks (!sender).

    Routing: Constant-time O(1) dictionary lookup (domainRoutes[senderDomain(email)]) falling back to a defaultRoute object when unmatched. Faster and more predictable than Claude Code’s linear scan, though at the scale of email routing, the difference is academic.

    Queue: In-memory structures segmented by target queue name (Map<string, QueuedEmailJob[]>), covering customer-success, vendor-ops, partnerships, and general-triage. There is no background worker in the committed code — jobs are enqueued but not actively drained.

    OAuth: The strongest implementation of the three. Codex signs the OAuth state parameter using crypto.createHmac(“sha256”) with a user nonce and timestamp, validates the signature on callback return, checks expiry, and uses timing-safe comparison to prevent timing attacks. This is the correct CSRF mitigation for an OAuth 2.0 callback, and neither Claude Code nor Antigravity implemented it. The README also explicitly flags “encrypt refresh tokens at rest” as a production requirement.

    Six commits versus Claude Code’s two reflect Codex’s iterative style. The ESM module system, dedicated types.ts, and separated tokenStore.ts all signal a codebase designed to be extended without stepping on itself.

    The verdict: The cleanest, most security-conscious skeleton of the three. Not a working Gmail integration, but if you’re building something you’ll take through a security review, this is the foundation with the fewest “why didn’t you handle this?” questions in code review.

    3. Antigravity: A full-stack demo that exceeds the brief

    TypeScript: 36.9% | CSS: 27.6% | JavaScript: 22.4% | HTML: 13.1% | Commits: 3 | Module system: CommonJS via ts-node | Port: 3050

    Architecture

    Architecture diagram showing the Antigravity email router structure, with a dashboard UI, styling and live controls, plus backend files for OAuth, email ingestion, stateful retries, and priority wildcard routing.
    Antigravity Email Router Architecture

    The language breakdown says everything before you open a file. Antigravity built a product. The public/ folder contains a complete monitoring dashboard with real-time queue visualization, a route management UI, and a webhook sandbox. No one asked for this.

    But examining the backend closely, the picture becomes more nuanced. Antigravity’s implementations of routing and queuing are technically the most sophisticated of the three.

    Routing: Antigravity translates wildcard patterns (*@google.com, sales@*, *) into operational regex blocks (.*), sorts rules by numeric priority fields, and tests each email against the sorted rule set sequentially. This supports complex, prioritized matching that neither Claude Code’s linear scan nor Codex’s dictionary lookup can express. Rules can also be added and deleted at runtime via API — no code edit required.

    Queue: A stateful job tracker with explicit state boundaries (pending → processing → completed / failed), attempt tracking, and a setInterval worker polling every 500ms. If a job fails and the attempt count is under the maximum, the status resets to pending, and the job is retried automatically (up to 3 times). Neither of the other two tools implements retry logic.

    OAuth: A toggleable service class. If Google credentials are present, it uses the standard OAuth flow. If not, it falls back to a mock mode that renders a visual HTML consent screen and generates dummy tokens locally, useful for testing without cloud credentials. No HMAC-signed state validation.

    The cost of all this: TypeScript covers only 37% of the codebase because of the unrequested frontend. Three large-batch commits suggest generation in bulk rather than deliberate iteration. The tool crashed twice on launch before producing any of it.

    The verdict: The most feature-complete output of the three, and the least suitable for a production codebase. The routing and queue engines are better-designed than the alternatives. The monitoring dashboard is scope creep. 

    The type safety is the weakest. For a prototype or stakeholder demo, it delivers more than either competitor. For a codebase you’ll maintain, the 63% non-TypeScript ratio is a liability.

    Full technical comparison

    Area Claude Code Codex Antigravity
    Primary objective Real-time Gmail ingestion pipeline Validated API scaffold with enterprise security Full-stack demo with interactive dashboard
    User interface None — headless API None — headless API Full web dashboard, glassmorphic and real-time
    Module system CommonJS / ts-node ESM / tsx watch CommonJS / ts-node
    Input validation Manual parsing using string methods and regex Schema validation via Zod Manual parsing using JS presence checks
    Gmail integration ✅ Full — googleapis, pulls unread, supports Pub/Sub ⚠️ OAuth only — no googleapis, no Gmail calls ⚠️ OAuth only — mock payloads, no Gmail calls
    Ingestion model Active sync loop: pulls or receives Pub/Sub push Passive receiver: accepts forwarded JSON payloads Sandbox simulation: manual input + mock batch
    OAuth state protection ❌ No state validation ✅ HMAC-signed nonce, timestamp, timing-safe validation ❌ No signed state — focuses on real/mock toggle
    Routing algorithm Linear scan, exact domain match, O(n) Dictionary lookup, exact domain match, O(1) Regex wildcard matching, priority-sorted, dynamic API
    Route configurability Static — edit router.ts Static — edit route map Dynamic — add/delete rules via API at runtime
    Queue behavior Single global FIFO, async drain loop, no retries Segmented by queue name, no background worker Stateful job tracker, 500ms interval worker, 3 retries
    Error handling Dead-letter to stderr, continues loop Express error middleware to response schemas Stateful retry — resets to pending if under max attempts
    TypeScript coverage 100% 100% 37%
    Key dependencies googleapis, google-auth-library, express zod, google-auth-library, express google-auth-library, cors, express

    Final scorecard

    Claude Code Codex Antigravity
    Install experience ✅ Clean ✅ Clean ❌ Crashed twice
    End-to-end autonomy ✅ Full — wrote, configured, committed, pushed ⚠️ Manual git steps needed ⚠️ Output produced after crashes
    Scope discipline ✅ Exactly what was asked ✅ Exactly what was asked ⚠️ Significantly over-scoped
    TypeScript coverage ✅ 100% ✅ 100% ⚠️ 37%
    OAuth security ❌ No state validation ✅ HMAC-signed, timestamped, nonce ❌ No state validation
    Gmail integration ✅ Full googleapis integration ⚠️ OAuth only, no Gmail calls ⚠️ Mock payloads only
    Input validation ⚠️ Manual parsing ✅ Zod schema validation ⚠️ Manual parsing
    Routing sophistication ⚠️ Exact match, static, O(n) ⚠️ Exact match, static, O(1) ✅ Wildcard regex, priority-sorted, dynamic
    Queue sophistication ⚠️ FIFO, no retries ⚠️ No background worker ✅ State machine, auto-retry, interval worker
    Production guidance ✅ Explicit swap-out table ✅ Explicit README notes ⚠️ Implicit in structure

    Our Verdict

    1. Claude Code delivered the most complete working Gmail ingestion pipeline. It built the core workflow end to end, but it missed OAuth state validation and needs targeted security hardening before production.
    2. Codex produced the strongest security skeleton, including signed OAuth state protection and schema validation, but it did not implement a full Gmail ingestion pipeline.
    3. Antigravity created the richest prototype, with dynamic routing, retry logic, and a dashboard, but it expanded the scope significantly and had the weakest TypeScript coverage.

    Conclusion

    Running all three tools on the same task produced a cleaner answer than we expected, but because each tool’s output maps precisely to a specific type of work.

    Claude Code won the workflow test.

    It handled the full task without hand-holding, and it’s the only tool that built something you could actually point at a Gmail inbox. The approval-first model maps well to how professional teams want to work with an AI agent: it asks before acting, stays in scope, and leaves the codebase in a state another developer can pick up immediately.

    The OAuth state gap is real and worth one targeted fix. Everything else is a solid foundation. For teams building anything that processes real email, the googleapis integration alone makes it the practical starting point.

    Choose Claude Code if you need real Gmail integration, you want an agent that executes the full workflow without you finishing the job, and you’re working on a codebase where scope discipline and maintainability matter.

    Codex: for security-conscious teams building to last. 

    Codex produced the most secure and structurally sound codebase of the three. HMAC-signed OAuth state, zod schema validation, ESM module system, isolated type definitions — these aren’t style choices, they’re the decisions that make a codebase easier to audit, extend, and hand off. A developer picking up Codex’s output would ask the fewest questions in code review.

    The trade-offs are real: no Gmail integration, no background worker, manual git steps. It’s better understood as a disciplined pair programmer than a fully autonomous agent.

    Choose Codex if you’re already in the ChatGPT ecosystem, your team does thorough code review, and you’d rather have a clean, security-hardened skeleton you extend yourself than an end-to-end output you need to audit.

    Antigravity: for prototyping and demos 

    Antigravity produced the most ambitious output of the three. Its wildcard routing engine, stateful retry queue, and runtime route management are technically better-designed than the equivalent components in either competitor. If you’re building a demo, shipping a prototype to stakeholders, or stress-testing what agent-first development looks like, it gets you further faster.

    But it crashed twice on launch. TypeScript covers only 37% of the output. It rewrote the scope of the task without asking. And the broader reliability story suggests a tool still finding its production footing.

    Choose Antigravity if you’re prototyping, running a hackathon, or need to show stakeholders something that looks and feels like a complete product. Don’t put it at the center of a codebase you need to maintain.

    The honest summary

    No tool won across the board. Each tool is suited to distinct development philosophies, and that philosophy shows up directly in the code it writes.

    1. Claude Code is the engineer: scoped, end-to-end, ships what was asked. 
    2. Codex is the security architect: disciplined, extensible, conservative by design.
    3. Antigravity is the ambitious prototype: it builds more than you asked for, with less type coverage, and crashes on the way in.

    For a customer support platform like Kommunicat,e where reliability and maintainability matter, Claude Code is the practical choice for production work. For teams prototyping new features or evaluating what agent-first development actually delivers, Antigravity is worth the instability tax/

    The right answer depends less on benchmark scores and more on where you sit in the development lifecycle. All three tools are moving fast. The gap between them is closing. Check back in six months.

    Frequently Asked Questions

    Is Claude Code better than Codex?

    Claude Code was better for end-to-end workflow completion in our test. It built a working Gmail ingestion pipeline and completed more of the task without manual intervention. Codex produced cleaner security foundations but did not implement full Gmail ingestion.

    Is Codex better than Claude Code for production code?

    Codex may be better for teams that prioritize security review, schema validation, and extensible structure. Claude Code may be better for teams that want a working implementation faster, as long as they review and fix security gaps.

    What is Google Antigravity best for?

    Google Antigravity is best for rapid prototypes, demos, and feature exploration. In our test, it produced the most complete user-facing demo, but it also expanded the scope and had weaker TypeScript coverage.

    Which AI coding agent is best overall?

    There is no universal winner. Claude Code is best for scoped engineering tasks, Codex is best for security-conscious foundations, and Antigravity is best for prototypes.

    Which tool should developers use in 2026?

    Use Claude Code for production MVPs, Codex for auditable security-conscious scaffolds, and Antigravity for stakeholder demos or rapid prototyping.

    Write A Comment

    You’ve unlocked 30 days for $0
    Kommunicate Offer
    Kommunicate Blog
    ×