AI Model Comparison | JS AI Trading and Advisory LLC

🔵 CLAUDE API MODELS (ANTHROPIC) — Pay per token · Hosted by Anthropic

Model	Tier	Input /1M tok	Output /1M tok	Context	MMLU	GPQA Diamond	SWE-bench	Best For
Claude Opus 4.8 Anthropic · Newest (May 2026)	Frontier	$5.00	$25.00	1M	~92%	~88%	~75%	Complex reasoning, agentic tasks, long-horizon coding, adaptive thinking. Most capable.
Claude Opus 4.7 Anthropic · Apr 2026	Frontier	$5.00	$25.00	1M	~91%	~86%	~73%	Legal, financial analysis, complex multi-step tasks, high-res vision.
Claude Sonnet 4.6 Anthropic · Recommended	Balanced	$3.00	$15.00	1M	~88%	~80%	~65%	Best price/quality. General apps, coding, RAG, production services. Sweet spot for most businesses.
Claude Haiku 4.5 Anthropic · Cheapest current	Fast / Low Cost	$1.00	$5.00	200K	~74%	~55%	~45%	High-volume simple tasks: classification, routing, summarisation, free tier user serving, chatbots.
Claude Haiku 3 (legacy) Anthropic · Absolute cheapest	Fast / Low Cost	$0.25	$1.25	200K	~63%	~42%	—	Ultra-budget bulk. Being phased out. Use Haiku 4.5 for new builds unless cost is the only factor.

💡 Batch API = 50% off all prices above | Prompt caching = up to 90% off cached input tokens | These are YOUR costs as the service builder — not what you charge customers.

🟢 OPEN-SOURCE / OPEN-WEIGHT MODELS — Free to self-host via Ollama, LM Studio, etc.

Model	Tier	Hosted API Cost	Self-Host Cost	License	Context	MMLU	GPQA Diamond	SWE-bench	Hardware (Self-Host)	Best For
DeepSeek V3.2 DeepSeek · 671B MoE (37B active)	Frontier OSS	~$0.14–0.27/M	$0 API fees	MIT	128K	88.5%	85%+	72%+	⚠ Server-grade~140 GB VRAM (Q4)4–8× A100 80GB or H100. Not laptop-viable. Use hosted API for most teams.	Best all-round open model. General reasoning, coding, agentic workflows. MIT license = commercial use fully permitted.
DeepSeek R1 DeepSeek · 671B MoE, reasoning specialist	Deep Reasoning	~$0.55/M	$0 API fees	MIT	128K	84%	71%	49%	⚠ Server-grade~136 GB VRAM (Q4)4–8× A100 80GB or H100. Smaller distilled versions (8B–70B) run on consumer hardware.	Math, logic, chain-of-thought. MATH-500: 97.3% — highest open model. Shows step-by-step reasoning. Great for tutoring, finance.
Qwen 3 235B Alibaba · MoE (22B active)	Reasoning	~$0.14/M	$0 API fees	Apache 2.0	131K	84.4%	81.1%	~60%	⚠ Server-grade~120 GB VRAM (Q4)2–4× A100 80GB. Smaller Qwen3-32B runs on 2× RTX 4090.	Maths & multilingual. AIME 2025: 92.3%. Best fully Apache 2.0 option. Strong for non-English services.
Llama 4 Maverick Meta · 400B MoE (17B active)	General / Long-ctx	~$0.20/M	$0 API fees	Llama 4	1M	~82%	~70%	~62%	◆ Multi-GPU~80 GB VRAM (Q4)2× A100 80GB or 4× RTX 4090. Mac Studio M4 Ultra (192GB) viable.	Large document RAG, 1M context. Within 3–5% of Sonnet on most everyday tasks. Good production choice.
Llama 4 Scout Meta · 109B MoE (17B active)	Speed / Long-ctx	~$0.10/M	$0 API fees	Llama 4	10M (!)	~79%	~65%	~55%	◆ Multi-GPU~50–60 GB VRAM (Q4)2× RTX 4090 (48GB) or 1× A100 80GB. Mac Studio M2 Ultra+ with 64GB RAM.	Longest context of any model (10M tokens). 2,600 tok/s throughput. Retrieving across massive document sets.
Mistral Small 4 Mistral · 24B dense	Single GPU	~$0.10/M	$0 API fees	Apache 2.0	256K	~72%	~52%	~38%	✔ Single GPU~16–24 GB VRAMRTX 3090 / 4090 (24GB), or Mac M2/M3 Pro with 18GB+ RAM. Intern laptop-friendly.	Runs on 1 consumer GPU. Ideal for intern dev machines, low-traffic services, European data sovereignty needs.
Gemma 3 27B Google · 27B dense	Single GPU	~$0.10/M	$0 API fees	Gemma ToS	256K	~75%	~55%	~40%	✔ Single GPU~16 GB VRAMRTX 3090 / 4090, or Mac M2/M3 with 16GB unified memory. Most accessible option.	Needs only 16GB VRAM. Runs on gaming PCs or Apple M-series. Good for simple everyday consumer services.

💡 "Hosted API Cost" = running via third-party providers (Together.ai, Groq, Fireworks etc.) | "Self-Host Cost" = $0 in API fees via Ollama, but requires your hardware + electricity | Check Qwen/DeepSeek commercial license terms before large-scale deployment.

Benchmark guide:

MMLU — General knowledge across 57 university-level subjects. A broad intelligence measure.
GPQA Diamond — Graduate-level science reasoning. Very hard; expert humans score ~65%. Measures deep reasoning.
SWE-bench Verified — Real GitHub coding tasks resolved autonomously. Best proxy for practical software engineering ability.

Scores marked ~ are directional estimates compiled from multiple published third-party sources as of mid-2026 — treat as relative guidance, not absolute truth. Benchmarks can be gamed and are one signal among many.

For your customer-facing services (chatbots, FAQs, summaries, form filling, basic advice): even Haiku 4.5 or Mistral Small 4 exceeds what's needed. Save Sonnet/Opus for tasks requiring deep reasoning or complex instructions.

AI Model Comparison: Cost & Capability