JS
JS AI Trading and Advisory LLC
AI Services & Solutions
AI Model Reference Guide
Last updated: June 2026

AI Model Comparison: Cost & Capability

Claude API models vs. open-source alternatives — June 2026

๐Ÿ”ต CLAUDE API MODELS (ANTHROPIC) โ€” Pay per token ยท Hosted by Anthropic
Model Tier Input /1M tok Output /1M tok Context MMLU GPQA Diamond SWE-bench Best For
Claude Opus 4.8
Anthropic · Newest (May 2026)
Frontier $5.00 $25.00 1M
~92%
~88%
~75%
Complex reasoning, agentic tasks, long-horizon coding, adaptive thinking. Most capable.
Claude Opus 4.7
Anthropic · Apr 2026
Frontier $5.00 $25.00 1M
~91%
~86%
~73%
Legal, financial analysis, complex multi-step tasks, high-res vision.
Claude Sonnet 4.6
Anthropic · Recommended
Balanced $3.00 $15.00 1M
~88%
~80%
~65%
Best price/quality. General apps, coding, RAG, production services. Sweet spot for most businesses.
Claude Haiku 4.5
Anthropic · Cheapest current
Fast / Low Cost $1.00 $5.00 200K
~74%
~55%
~45%
High-volume simple tasks: classification, routing, summarisation, free tier user serving, chatbots.
Claude Haiku 3 (legacy)
Anthropic · Absolute cheapest
Fast / Low Cost $0.25 $1.25 200K
~63%
~42%
Ultra-budget bulk. Being phased out. Use Haiku 4.5 for new builds unless cost is the only factor.
๐Ÿ’ก Batch API = 50% off all prices above  |  Prompt caching = up to 90% off cached input tokens  |  These are YOUR costs as the service builder โ€” not what you charge customers.
๐ŸŸข OPEN-SOURCE / OPEN-WEIGHT MODELS โ€” Free to self-host via Ollama, LM Studio, etc.
Model Tier Hosted API Cost Self-Host Cost License Context MMLU GPQA Diamond SWE-bench Hardware (Self-Host) Best For
DeepSeek V3.2
DeepSeek · 671B MoE (37B active)
Frontier OSS ~$0.14โ€“0.27/M $0 API fees MIT 128K
88.5%
85%+
72%+
โš  Server-grade~140 GB VRAM (Q4)4โ€“8ร— A100 80GB or H100. Not laptop-viable. Use hosted API for most teams.
Best all-round open model. General reasoning, coding, agentic workflows. MIT license = commercial use fully permitted.
DeepSeek R1
DeepSeek · 671B MoE, reasoning specialist
Deep Reasoning ~$0.55/M $0 API fees MIT 128K
84%
71%
49%
โš  Server-grade~136 GB VRAM (Q4)4โ€“8ร— A100 80GB or H100. Smaller distilled versions (8Bโ€“70B) run on consumer hardware.
Math, logic, chain-of-thought. MATH-500: 97.3% โ€” highest open model. Shows step-by-step reasoning. Great for tutoring, finance.
Qwen 3 235B
Alibaba · MoE (22B active)
Reasoning ~$0.14/M $0 API fees Apache 2.0 131K
84.4%
81.1%
~60%
โš  Server-grade~120 GB VRAM (Q4)2โ€“4ร— A100 80GB. Smaller Qwen3-32B runs on 2ร— RTX 4090.
Maths & multilingual. AIME 2025: 92.3%. Best fully Apache 2.0 option. Strong for non-English services.
Llama 4 Maverick
Meta · 400B MoE (17B active)
General / Long-ctx ~$0.20/M $0 API fees Llama 4 1M
~82%
~70%
~62%
โ—† Multi-GPU~80 GB VRAM (Q4)2ร— A100 80GB or 4ร— RTX 4090. Mac Studio M4 Ultra (192GB) viable.
Large document RAG, 1M context. Within 3โ€“5% of Sonnet on most everyday tasks. Good production choice.
Llama 4 Scout
Meta · 109B MoE (17B active)
Speed / Long-ctx ~$0.10/M $0 API fees Llama 4 10M (!)
~79%
~65%
~55%
โ—† Multi-GPU~50โ€“60 GB VRAM (Q4)2ร— RTX 4090 (48GB) or 1ร— A100 80GB. Mac Studio M2 Ultra+ with 64GB RAM.
Longest context of any model (10M tokens). 2,600 tok/s throughput. Retrieving across massive document sets.
Mistral Small 4
Mistral · 24B dense
Single GPU ~$0.10/M $0 API fees Apache 2.0 256K
~72%
~52%
~38%
โœ” Single GPU~16โ€“24 GB VRAMRTX 3090 / 4090 (24GB), or Mac M2/M3 Pro with 18GB+ RAM. Intern laptop-friendly.
Runs on 1 consumer GPU. Ideal for intern dev machines, low-traffic services, European data sovereignty needs.
Gemma 3 27B
Google · 27B dense
Single GPU ~$0.10/M $0 API fees Gemma ToS 256K
~75%
~55%
~40%
โœ” Single GPU~16 GB VRAMRTX 3090 / 4090, or Mac M2/M3 with 16GB unified memory. Most accessible option.
Needs only 16GB VRAM. Runs on gaming PCs or Apple M-series. Good for simple everyday consumer services.
๐Ÿ’ก "Hosted API Cost" = running via third-party providers (Together.ai, Groq, Fireworks etc.)  |  "Self-Host Cost" = $0 in API fees via Ollama, but requires your hardware + electricity  |  Check Qwen/DeepSeek commercial license terms before large-scale deployment.
Benchmark guide:

MMLU โ€” General knowledge across 57 university-level subjects. A broad intelligence measure.
GPQA Diamond โ€” Graduate-level science reasoning. Very hard; expert humans score ~65%. Measures deep reasoning.
SWE-bench Verified โ€” Real GitHub coding tasks resolved autonomously. Best proxy for practical software engineering ability.

Scores marked ~ are directional estimates compiled from multiple published third-party sources as of mid-2026 โ€” treat as relative guidance, not absolute truth. Benchmarks can be gamed and are one signal among many.

For your customer-facing services (chatbots, FAQs, summaries, form filling, basic advice): even Haiku 4.5 or Mistral Small 4 exceeds what's needed. Save Sonnet/Opus for tasks requiring deep reasoning or complex instructions.