Benchmark Results¶

Comprehensive benchmark results testing frontier multimodal LLMs on jigsaw puzzle solving across different grid sizes.

Summary¶

Grid Size	Pieces	GPT-5.2 Accuracy	GPT-5.2 Solve Rate	Gemini 3 Pro Accuracy	Gemini 3 Pro Solve Rate
3×3	9	96.7%	95%	93.3%	85%
4×3/3×4	12	93.5%	85%	85.8%	62.5%
4×4	16	76.9%	40%	71.9%	25%
5×5	25	46.4%	0%	49.2%	10%

Test configuration

Results averaged across 20 images per model per grid size. All models received the reference image, correct piece count, and last 3 moves as context. Claude Opus 4.5 tested only on 3×3 (20% solve rate, 47.2% piece accuracy). Gemini-3.0 Pro and GPT-5.2 were using low reasoning effort, while Opus 4.5 was using high reasoning effort.

Performance vs Puzzle Complexity¶

Key Findings¶

📉 Steep Difficulty Scaling

Solve rates drop dramatically as puzzle complexity increases
- GPT-5.2: 95% → 0% solve rate from 3×3 to 5×5
- Gemini 3 Pro: 85% → 10% solve rate from 3×3 to 5×5
🪙 Token Usage Increases

Models require significantly more tokens for larger puzzles
- GPT-5.2: ~15K → ~116K tokens (3×3 to 5×5)
- Gemini 3 Pro: ~55K → ~345K tokens
❌ 5×5 Remains Unsolved

No model reliably solves 5×5 puzzles – even frontier models struggle with 25 pieces
📊 Partial Progress Common

Piece accuracy remains reasonable (50-80%) even when puzzles aren't fully solved

Detailed Results by Grid Size¶

3×3 Grid (9 pieces)¶

Model	Piece Accuracy	Solve Rate	Avg Turns	Avg Tokens
GPT-5.2	96.7% ± 14.9%	95%	2.9	14,487
Gemini 3 Pro	93.3% ± 17.4%	85%	3.8	54,770
Claude Opus 4.5	47.2% ± 33.2%	20%	11.3	33,822

Best performance

Both GPT-5.2 and Gemini 3 Pro solve 3×3 puzzles reliably with 85%+ solve rates.

4×4 Grid (16 pieces)¶

Model	Piece Accuracy	Solve Rate	Avg Turns	Avg Tokens
GPT-5.2	76.9% ± 21.9%	40%	13.6	76,936
Gemini 3 Pro	71.9% ± 24.3%	25%	14.8	281,648

Degraded performance

Performance drops significantly – only 25-40% of puzzles are solved completely. Using higher reasoning would probably have a high positive impact on solve rate.

5×5 Grid (25 pieces)¶

Model	Piece Accuracy	Solve Rate	Avg Turns	Avg Tokens
GPT-5.2	46.4% ± 13.9%	0%	20.0	115,918
Gemini 3 Pro	49.2% ± 27.3%	10%	18.6	345,060

Spatial reasoning limit

Neither model can reliably solve 25-piece puzzles. Models consistently hit a wall around 50% accuracy. Higher reasoning effort marginally improves accuracy but has minimal impact on solve rate (conclusions from limited testing on several images).

Methodology¶

Parameter	Value
Images	20 diverse test images
Categories	Landscapes, portraits, abstract art, photos
Seed	Fixed seed (42) for reproducibility
Max turns	12 (3×3), 17 (4×4), 20 (5×5)
Hints	Reference image, correct count, last 3 moves
Image size	Resized to 512px shortest side

Reasoning Effort¶

Reasoning configuration

GPT-5.2 and Gemini 3 Pro used low reasoning effort; Claude Opus 4.5 used high reasoning effort. Neither can solve puzzles without reasoning even for 3×3 grid sizes.

Higher reasoning tradeoffs

Informal testing with high reasoning effort for GPT-5.2 and Gemini 3 Pro showed slightly better performance (up to ~10%), but at significantly higher cost:

A single puzzle could consume ~1M tokens with Gemini
Much longer solving times
Requests would quite often time out for medium or high reasoning for GPT-5.2
Both GPT and Gemini would still be stuck at ~50%-80% piece accuracy on average

We opted for low reasoning to keep the benchmark practical.

Reproducing Results¶

# Run benchmark for a specific grid size
python benchmark.py \
  --grid-size 3 \
  --models openai/gpt-5.2 google/gemini-3-pro-preview \
  --images-dir images/ \
  --seed 42 \
  --resize 512

See the CLI Usage Guide and Benchmark Guide for full documentation.