Claude Opus 4.5 Anthropic partial | 76.80% SWE-bench Verified · medium | Input: $5 Output: $25 | $0.5 read; $6.25 5m write not disclosed | Coding agentsRepo refactorCode review | Strong SWE-bench evidence in captured source, but expensive output pricing; do not present as universal best model. Source: Anthropic Claude API pricing · checked 2026-05-28 · confidence medium | Calculate This Model’s Task Cost |
Claude Sonnet 4.5 Anthropic partial | 71.40% SWE-bench Verified · medium | Input: $3 Output: $15 | $0.3 read; $3.75 5m write not disclosed | Coding agentsFrontend generationRepo refactorTest generation | Good candidate for default coding workflow shortlist, but scenario label must cite benchmark and price fields rather than claim overall superiority. Source: Anthropic Claude API pricing · checked 2026-05-28 · confidence medium | Calculate This Model’s Task Cost |
Claude Haiku 4.5 Anthropic partial | 66.60% SWE-bench Verified · medium | Input: $1 Output: $5 | $0.1 read; $1.25 5m write not disclosed | Low-cost automationTest generationBug fixing | Lower token price does not guarantee lower task cost if retry rate rises; calculator must expose retry assumptions. Source: Anthropic Claude API pricing · checked 2026-05-28 · confidence medium | Calculate This Model’s Task Cost |
GPT-5.4 mini OpenAI partial | 56.20% SWE-bench Verified · medium | Input: $0.75 Output: $4.5 | $0.075 read; write not disclosed not disclosed | Low-cost automationTest generation | OpenAI API pricing is source-backed for GPT-5.4 mini standard short-context pricing; benchmark mapping remains caveated until exact leaderboard alias is verified. Source: OpenAI API pricing · checked 2026-06-02 · confidence medium | Calculate This Model’s Task Cost |
Gemini 3 Flash — high reasoning Google partial | 75.80% SWE-bench Verified · medium | Input: not disclosed Output: not disclosed | not disclosed not disclosed | Coding agentsLong contextFrontend generation | Strong captured SWE-bench result, but price/context must remain unknown until exact Gemini model docs are mapped. Source: Gemini Developer API pricing · checked 2026-05-28 · confidence low | Calculate This Model’s Task Cost |
DeepSeek V4 Flash DeepSeek partial | not_publicly_benchmarked SWE-bench / Aider · low | Input: $0.14 Output: $0.28 | $0.0028 read; write not disclosed 1,000,000 tokens | Low-cost automationChinese coding workflowLong context | Excellent token price and context signal, but exact public coding benchmark row for V4 Flash was not captured; mark coding evidence as incomplete. Source: DeepSeek API Docs — Models & Pricing · checked 2026-05-28 · confidence medium | Calculate This Model’s Task Cost |
DeepSeek V4 Pro DeepSeek partial | not_publicly_benchmarked SWE-bench / Aider · low | Input: $0.435 Output: $0.87 | $0.0036 read; write not disclosed 1,000,000 tokens | Chinese coding workflowLong contextRepo refactor | Pricing/context are source-backed; coding benchmark evidence for the exact V4 Pro model still needs source verification. Source: DeepSeek API Docs — Models & Pricing · checked 2026-05-28 · confidence medium | Calculate This Model’s Task Cost |
Kimi K2.5 — high reasoning Moonshot AI / Kimi partial | 70.80% SWE-bench Verified · medium | Input: not disclosed Output: not disclosed | not disclosed not disclosed | Chinese coding workflowCoding agentsRepo refactor | Useful Chinese coding workflow candidate, but price/context must remain unknown until exact Moonshot pricing/model docs are captured. Source: Kimi API Platform pricing index · checked 2026-05-28 · confidence low | Calculate This Model’s Task Cost |
Qwen3 235B A22B Alibaba Cloud / Qwen partial | 59.6% Aider polyglot coding benchmark · medium | Input: not disclosed Output: not disclosed | not disclosed not disclosed | Chinese coding workflowLow-cost automation | Benchmark-backed partial row only. Do not show price until exact Alibaba Cloud model pricing is captured. Source: Alibaba Cloud Model Studio pricing search result · checked 2026-05-28 · confidence low | Calculate This Model’s Task Cost |