Best Open Source LLM 2026: Llama 4 vs Qwen 3.5 vs DeepSeek V4 vs Kimi K2.5 vs GLM-5 vs Mistral

Last updated: May 2026

The best open-source LLM in May 2026 depends on workload: DeepSeek V4 Pro (Max) leads the overall open-weight leaderboard at BenchLM (score 87), Qwen 3.5 397B is the strongest Apache 2.0 all-rounder, Kimi K2.5 leads HumanEval (99%), GLM-5 Reasoning leads knowledge benchmarks (MMLU 96), and Llama 4 Scout owns the long-context segment with a 10M token window. This article ranks every flagship open-weight model released between November 2025 and April 2026, with license terms, parameter counts, benchmark scores, and the realistic path to running each one.

If you need free hosted access to these models without renting a GPU, see Free LLM API Credits — Every Route from $0 to $10K and Free GPU Compute — Where to Get Hours Without a Credit Card.

Open-weight LLMs at a glance (May 2026)

ModelTotal / ActiveLicenseBest forContext
DeepSeek V4 Pro (Max)1.6T / 49BMITOverall + coding1M
Qwen 3.5 397B397B / 17BApache 2.0All-round, multilingual256K
Qwen 3 235B-A22B235B / 22BApache 2.0Broad benchmarks128K
Llama 4 Maverick400B / 17BLlama CommunityGeneral + ecosystem1M
Llama 4 Scout109B / 17BLlama CommunityLong-context (10M)10M
Kimi K2.5 / K2.6~1T / ~32BMITCode (HumanEval 99)200K
GLM-5 (Reasoning)744BMITKnowledge (MMLU 96)128K
Mistral Medium 3.5675B / 41BApache 2.0EU-friendly all-round128K
Gemma 4 medium~27B denseGemmaSingle-GPU inference256K
GPT-oss 120B120B denseApache 2.0OpenAI-format, easy port128K
Qwen3-Coder-Next80B / 3BApache 2.0Code on small VRAM128K

"Active" = parameters used per token in MoE models — what determines GPU VRAM and latency. "Total" = full weights to download.

DeepSeek V4 Pro — overall leader

DeepSeek V4 (released April 24, 2026) ships in Pro (Max + High) and Flash variants. Pro Max scores 87 on BenchLM overall, 89.8 on coding (the best of any open-weight model), and supports a 1M token context. The architecture is the most interesting sparse-MoE design in the open ecosystem — 256 experts with 9 active per token. License is MIT, so commercial use is unrestricted.

Practical reality: Pro Max needs an 8× H100 server to serve at decent latency. DeepSeek V4 Flash (smaller variant) runs on 2× H100 and trades ~6 points of overall score for half the cost. For self-hosting, start with Flash unless you already have multi-H100 capacity.

Qwen 3.5 397B — best Apache 2.0 all-rounder

Alibaba's Qwen 3.5 family completed its rollout in early March 2026. The flagship Qwen 3.5 397B (17B active) is the strongest open model under a fully permissive Apache 2.0 license — no MAU caps, no use-case carve-outs. Benchmarks land within 2-3 points of DeepSeek V4 Pro on coding (Qwen3.5 397B Reasoning scores 86.7) and ahead on multilingual reasoning.

The smaller Qwen 3 235B-A22B (still Apache 2.0) is the most popular real-world deployment — it fits in 4× H100 in 4-bit, runs at a few hundred tokens/sec, and matches Llama 4 Maverick on most benchmarks. If you want one model to default to and never worry about licensing, this is the pick.

Llama 4 — the long-context play

Meta released Llama 4 Scout and Maverick in April 2026. Scout (109B / 17B active) ships with the 10M token context window — by a wide margin the longest in open-weight LLMs. Maverick (400B / 17B active) is the flagship for general tasks.

License: the Llama Community License permits commercial use below 700M monthly active users. Above that threshold, you need a separate agreement with Meta. For 99% of teams this is a non-issue; for hyperscalers and enterprise SaaS at scale, this is the reason to default to Qwen 3.5 instead.

Llama 4 Scout's 10M context unlocks workflows that no other open model supports practically — full codebase analysis, multi-document RAG, long-horizon agents. Inference cost is the trade-off; expect ~$0.30 per million input tokens on hosted providers vs ~$0.05 for a 128K-context Qwen 3.5.

Kimi K2.5 / K2.6 — the coding monster

Moonshot AI's Kimi K2.5 leads HumanEval at 99% and posts MMLU 92.0, MMLU-Pro 87.1, AIME 2025 96.1, GPQA Diamond 87.6. The follow-up Kimi K2.6 trades a small drop on knowledge for further coding gains. License: MIT, fully commercial.

Kimi is the right pick for code-heavy production: code completion, code review, autonomous coding agents. The trade-off is ecosystem maturity — Moonshot's tooling and quantization support lag Llama / Qwen by 1-2 months. If you self-host, expect to wait for community quants (typically released within a week of the official weights).

GLM-5 — knowledge and reasoning specialist

Zhipu AI's GLM-5 (744B params) is the knowledge benchmark leader: MMLU 96, GPQA 94, SuperGPQA 92, SimpleQA 92, HumanEval 90, SWE-bench 77.8. The Reasoning variant uses chain-of-thought scratchpad similar to OpenAI o-series. MIT license, commercial use unrestricted.

GLM-5 is the pick for research, analysis, and tasks where factual recall matters more than raw code generation. It's also the open-weight model closest in style to the OpenAI reasoning family — if your application was built for o1 / o3, GLM-5 Reasoning is the easiest drop-in.

Mistral Medium 3.5 — EU-native all-rounder

Mistral released Medium 3.5 on April 29, 2026 (675B / 41B active). Apache 2.0 license, EU-hosted inference, the most active-parameter count in the flagship MoE field — which translates to lower per-token VRAM efficiency but more capability per active expert. Mistral's chief edge is regulatory: GDPR-aligned hosting, EU data residency, and the strongest enterprise support contracts among open-weight labs.

If you are in the EU, healthcare, finance, or any regulated industry, Mistral Medium 3.5 is the default open model. Otherwise the price/perf is harder to beat with Qwen 3.5 or DeepSeek V4 Flash.

Gemma 4 — Google's single-GPU pick

Google's Gemma 4 family runs from ~2B (mobile / edge) up to ~27B medium. The medium variant fits on a single 24GB consumer GPU (RTX 4090 / 5090) in 4-bit and offers a 256K context window. The Gemma license is permissive for commercial use with use-case restrictions (no weapons, no surveillance, etc).

Gemma 4 is the right pick for self-hosters with one good GPU who want predictable behavior, strong safety tuning, and Google's distillation discipline. It loses to MoE flagships on raw benchmarks but wins on local-inference cost.

GPT-oss 120B — the OpenAI-format port

OpenAI's GPT-oss 120B (released late 2025) is a dense 120B model under Apache 2.0. The selling point is native OpenAI API format — if your code is built against the OpenAI SDK, GPT-oss is a one-line endpoint swap. Benchmarks land mid-pack (well below DeepSeek V4 and Qwen 3.5), but for teams already on the OpenAI stack, the migration cost is the lowest of any open model.

Qwen3-Coder-Next — small-VRAM code model

Released early February 2026, Qwen3-Coder-Next is an 80B / 3B active MoE specialized for code. The 3B active parameter count means it runs at consumer-GPU speeds and fits on a single 24GB GPU in 4-bit. It outperforms much larger models like DeepSeek V3.2 (37B active) on coding tasks at a fraction of the inference cost. Apache 2.0.

For a single-developer self-hosted coding agent, Qwen3-Coder-Next is currently the price/performance leader.

How to pick — by workload

Coding agents and code completion: DeepSeek V4 Pro (best overall), Kimi K2.5 (HumanEval leader), Qwen3-Coder-Next (cheap and fast).

General-purpose chat / RAG: Qwen 3.5 397B or Qwen 3 235B-A22B (Apache 2.0, safe default).

Knowledge / research / reasoning: GLM-5 Reasoning, DeepSeek V4 Pro Max.

Long-context (>1M tokens): Llama 4 Scout (10M) is the only practical choice.

EU / regulated industries: Mistral Medium 3.5.

Single consumer GPU, self-host: Gemma 4 medium, Qwen3-Coder-Next, Llama 4 Scout in 4-bit.

OpenAI SDK migration: GPT-oss 120B.

Where to run them

Hosted with free tier (no GPU needed):

  • OpenRouter — aggregates almost every open model. Free tier covers Llama 4, Qwen 3.5, DeepSeek V4 Flash, GLM-5 with rate limits.
  • Groq — fastest inference per token on its LPU hardware. Llama and Qwen variants. Free tier with generous rate limits.
  • Together AI — broad open-model catalog, $1 free credits.
  • HuggingFace Inference API — every open model on the Hub, free tier.
  • Fireworks, Replicate, DeepInfra — alternatives with different price points.

Rent a GPU (you control inference):

Self-host (you own the hardware):

  • llama.cpp — CPU + single GPU, GGUF quantization, broadest hardware support.
  • vLLM — production inference server, multi-GPU, OpenAI-compatible API.
  • Ollama — easiest local install, wraps llama.cpp with model registry.

The cross-over point: at <20% GPU utilization, hosted is cheaper; at >50%, renting an H100 wins; at >70% sustained, owning hardware wins by month 6.

What changed in the last 90 days

  • April 24, 2026 — DeepSeek V4 released, took overall leaderboard.
  • April 29, 2026 — Mistral Medium 3.5 released, Apache 2.0.
  • April 2026 — Llama 4 Scout + Maverick released by Meta.
  • Early March 2026 — Qwen 3.5 family completed rollout.
  • Early February 2026 — Qwen3-Coder-Next released, set new price/perf bar for code.
  • Late 2025 — GPT-oss 120B released under Apache 2.0.

If you are reading this 30+ days after the publication date, expect at least one new flagship release. Open-weight LLM cadence is roughly monthly as of mid-2026.

enjoyed this?

Follow me for more on AI agents, dev tools, and building with LLMs.

X / Twitter LinkedIn GitHub
← Back to blog