Revisiting LegalBench: New Models, A Bug I Missed, and a New Leader

LegalBench revisited — new models, bug fix, and a new leader

Last month I published benchmark results comparing five LLMs on LegalBench, a suite of 161 legal reasoning tasks. The 27B Qwen3.5 model won at 0.7936, beating a 120B reasoning model by 6 points. The headline was that bigger isn’t better for legal work.

Since then, two things happened. First, I added two more models to the lineup: Qwen3.6-35B (which dropped yesterday) and Qwen3.5-122B (which I simply didn’t evaluate in round one). Second, I found a bug in my benchmarking code that was quietly suppressing scores on one category of tasks. Fixing it changes the leaderboard.

The Additional Models

Both are MoE architectures from the Qwen team:

Model Total params Active params Quantization
Qwen3.6-35B 35B ~3B (A3B) AWQ 4-bit
Qwen3.5-122B 122B ~10B (A10B) AWQ 4-bit

Both AWQ 4-bit quantizations were produced by cyankiwi. The 3.6-35B ran on my RTX 5090; the 122B ran on an RTX 6000 Pro. Both served via vLLM in no-think mode, same prompts as the original benchmark.

The Bug I Missed

When I pulled the raw outputs for the MAUD category — 34 tasks on M&A agreement interpretation that use A/B/C/D/E multiple-choice — I noticed something weird. Qwen3.6-35B had scored 0.012 on maud_fiduciary_exception_board_determination_trigger_(no_shop). That’s below random chance for a binary question.

A look at the generations explained it: the model was answering "Option B" while the gold label was "B". My extract_answer() function was returning the full string "Option B", which never matched "B" in the grader.

Worse, on some tasks the model answered "Yes" when the question was A/B multiple choice. The “disproportionate impact modifier” prompts read like yes/no questions, and the model took the bait.

This was present in every model’s results to varying degrees:

Model MAUD tasks affected
Nemotron-30B 15
Qwen3.6-35B 20
Qwen3.5-35B 6
gpt-oss-120b 4
Qwen3.5-27B 4
Qwen3.5-9B 2
Qwen3.5-122B 0

The 122B got clean letters on everything — the issue was specific to how smaller models handled the MAUD prompt format. Still, a benchmark bug is a benchmark bug.

The Fix

Two changes to run_legalbench.py:

  1. Output extraction — added a regex to strip "Option X" prefix: ^Option\s+([A-Z])\b → \1
  2. System prompt — added an explicit instruction: “If the question offers lettered answer choices (A, B, C, …), reply with ONLY the letter — never ‘Yes’ or ‘No’, never ‘Option X’, just the letter.”

I re-ran the 20 problematic MAUD tasks for Qwen3.6-35B with both fixes in place. The results were dramatic:

Task Before After Delta
fiduciary_exception_board_determination_trigger 0.012 0.964 +0.952
specific_performance 0.317 0.994 +0.677
pandemic_or_other_public_health_event (disproportionate) 0.025 0.650 +0.625
ordinary_course_efforts_standard 0.325 0.933 +0.608
cor_standard_(intervening_event) 0.183 0.762 +0.579
general_economic_and_financial_conditions 0.006 0.524 +0.518
(15 others) +0.12 to +0.45

Every one of the 20 tasks improved. No regressions. Qwen3.6-35B’s overall score went from 0.7483 to 0.7982 — a +5.0 point jump from an extraction fix alone.

I didn’t re-run the fix on the other models. Their rankings in the original post stand, but be aware that Nemotron and the older 35B are underreported. If I re-ran Nemotron with the fix, I’d expect it to gain 5-8 points and climb out of last place.

Updated Leaderboard

Rank Model Score Notes
1 Qwen3.5-122B 0.7990 MoE, 10B active
2 Qwen3.6-35B 0.7982 MoE, 3B active — after MAUD fix
3 Qwen3.5-27B 0.7936 Dense
4 Qwen3.5-35B 0.7612 MoE, 3B active
5 Qwen3.5-9B 0.7583 Dense
6 gpt-oss-120b 0.7313 Reasoning model
7 Nemotron-30B 0.5509 MoE (would gain ~5-8 pts with fix)

The top three models are separated by less than one point. The 122B edges out the 3.6-35B by 0.0008 — statistical noise.

What the 122B Buys You

The 122B has 3.5x more total parameters than the 3.6-35B and runs with 3.3x more active parameters per token. For a one-point gain over the 3.6-35B, is it worth it?

Looking at head-to-head on the 34 MAUD tasks (where the 122B should theoretically benefit most from its extra capacity):

  Score Task wins
Qwen3.6-35B (post-fix) 0.626 17
Qwen3.5-122B 0.618 15 (+ 2 ties)

Essentially a tie. The 122B wins on tasks that require memorized legal domain knowledge (accuracy_of_target_capitalization_rw: 0.755 vs 0.399). The 3.6-35B wins where the MAUD fix saved it (fiduciary_exception_board_determination_trigger: 0.964 vs 0.494).

Outside MAUD, both models perform similarly on contract NLI, CUAD clause detection, and privacy policy tasks — in the 0.90s range for most of them.

Verdict: The 122B gives you minimal gains — like the 3rd decimal point. It takes up ~3x the memory and runs at about half the speed. The real “gain” was that it followed instructions and answered without the word “Option” prefix. A better system prompt fixed that on the 3.6-35B. So the original verdict stands: moving from smaller models that fit on consumer GPU cards like the 5090 to workstation-class models did not offer a noticeable improvement on this benchmark.

Qwen3.6-35B vs Qwen3.5-35B

The most interesting comparison is between the two 35B MoE models. Same parameter count, same active params, same quantization. Just a generation apart:

  Qwen3.5-35B Qwen3.6-35B (post-fix)
Overall 0.7612 0.7982
Gain +3.7 points

A clean 3.7-point improvement at fixed parameter count. That’s the “raw model quality” delta between 3.5 and 3.6 — separate from any quantization or architectural choice.

Does the Original Blog’s Conclusion Still Hold?

The original post argued that smaller, well-quantized local models can beat a 120B reasoning model on legal work. That conclusion is stronger now, not weaker:

  • The 27B Qwen3.5 (dense, 16GB VRAM) still beats gpt-oss-120b by 6 points.
  • The 3.6-35B (MoE, 20GB VRAM) beats gpt-oss-120b by 7 points.
  • Even the 9B (single GPU) beats gpt-oss-120b by 3 points.

The 122B scoring 0.7990 is notable — it’s the first local model to cross 0.79 — but it’s not enough to change the fundamental story. Parameter count continues to be a bad predictor of legal reasoning ability relative to model generation and training data.

And the MAUD bug is a reminder: benchmarks measure your whole pipeline, not just the model. A small string in an extraction function can cost 5 points.

What’s Next

The most interesting finding here is the generational jump from Qwen3.5-35B to Qwen3.6-35B: +3.7 points at fixed parameter count, active parameter count, and quantization. That’s a clean measurement of how much the 3.5 → 3.6 update is worth on legal reasoning.

And Qwen3.6-35B dropped yesterday. There’s no 3.6-122B yet, only the 3.5-122B I tested here. If the same 3.7-point generational improvement carries over to the larger MoE, the eventual Qwen3.6-122B could push past 0.83 on this benchmark. I’ll re-run as soon as it’s released.

Zooming out: on this legal benchmark, local Qwen models are consistently strong against other local open-weight options. The 27B, 9B, 35B, new 3.6-35B, and 122B all outperform gpt-oss-120b. That’s not a knock on OpenAI’s open-weight model — it’s a real legal benchmark, and these Qwen models are very good at it.