Benchmark Results¶
Comprehensive benchmark results testing frontier multimodal LLMs on jigsaw puzzle solving across different grid sizes.
Summary¶
| Grid Size | Pieces | GPT-5.2 Accuracy | GPT-5.2 Solve Rate | Gemini 3 Pro Accuracy | Gemini 3 Pro Solve Rate |
|---|---|---|---|---|---|
| 3×3 | 9 | 96.7% | 95% | 93.3% | 85% |
| 4×3/3×4 | 12 | 93.5% | 85% | 85.8% | 62.5% |
| 4×4 | 16 | 76.9% | 40% | 71.9% | 25% |
| 5×5 | 25 | 46.4% | 0% | 49.2% | 10% |
Test configuration
Results averaged across 20 images per model per grid size. All models received the reference image, correct piece count, and last 3 moves as context. Claude Opus 4.5 tested only on 3×3 (20% solve rate, 47.2% piece accuracy). Gemini-3.0 Pro and GPT-5.2 were using low reasoning effort, while Opus 4.5 was using high reasoning effort.
Performance vs Puzzle Complexity¶
Key Findings¶
-
📉 Steep Difficulty Scaling
Solve rates drop dramatically as puzzle complexity increases
- GPT-5.2: 95% → 0% solve rate from 3×3 to 5×5
- Gemini 3 Pro: 85% → 10% solve rate from 3×3 to 5×5
-
🪙 Token Usage Increases
Models require significantly more tokens for larger puzzles
- GPT-5.2: ~15K → ~116K tokens (3×3 to 5×5)
- Gemini 3 Pro: ~55K → ~345K tokens
-
❌ 5×5 Remains Unsolved
No model reliably solves 5×5 puzzles – even frontier models struggle with 25 pieces
-
📊 Partial Progress Common
Piece accuracy remains reasonable (50-80%) even when puzzles aren't fully solved
Detailed Results by Grid Size¶
3×3 Grid (9 pieces)¶
| Model | Piece Accuracy | Solve Rate | Avg Turns | Avg Tokens |
|---|---|---|---|---|
| GPT-5.2 | 96.7% ± 14.9% | 95% | 2.9 | 14,487 |
| Gemini 3 Pro | 93.3% ± 17.4% | 85% | 3.8 | 54,770 |
| Claude Opus 4.5 | 47.2% ± 33.2% | 20% | 11.3 | 33,822 |
Best performance
Both GPT-5.2 and Gemini 3 Pro solve 3×3 puzzles reliably with 85%+ solve rates.
4×4 Grid (16 pieces)¶
| Model | Piece Accuracy | Solve Rate | Avg Turns | Avg Tokens |
|---|---|---|---|---|
| GPT-5.2 | 76.9% ± 21.9% | 40% | 13.6 | 76,936 |
| Gemini 3 Pro | 71.9% ± 24.3% | 25% | 14.8 | 281,648 |
Degraded performance
Performance drops significantly – only 25-40% of puzzles are solved completely. Using higher reasoning would probably have a high positive impact on solve rate.
5×5 Grid (25 pieces)¶
| Model | Piece Accuracy | Solve Rate | Avg Turns | Avg Tokens |
|---|---|---|---|---|
| GPT-5.2 | 46.4% ± 13.9% | 0% | 20.0 | 115,918 |
| Gemini 3 Pro | 49.2% ± 27.3% | 10% | 18.6 | 345,060 |
Spatial reasoning limit
Neither model can reliably solve 25-piece puzzles. Models consistently hit a wall around 50% accuracy. Higher reasoning effort marginally improves accuracy but has minimal impact on solve rate (conclusions from limited testing on several images).
Methodology¶
| Parameter | Value |
|---|---|
| Images | 20 diverse test images |
| Categories | Landscapes, portraits, abstract art, photos |
| Seed | Fixed seed (42) for reproducibility |
| Max turns | 12 (3×3), 17 (4×4), 20 (5×5) |
| Hints | Reference image, correct count, last 3 moves |
| Image size | Resized to 512px shortest side |
Reasoning Effort¶
Reasoning configuration
GPT-5.2 and Gemini 3 Pro used low reasoning effort; Claude Opus 4.5 used high reasoning effort. Neither can solve puzzles without reasoning even for 3×3 grid sizes.
Higher reasoning tradeoffs
Informal testing with high reasoning effort for GPT-5.2 and Gemini 3 Pro showed slightly better performance (up to ~10%), but at significantly higher cost:
- A single puzzle could consume ~1M tokens with Gemini
- Much longer solving times
- Requests would quite often time out for
mediumorhighreasoning for GPT-5.2 - Both GPT and Gemini would still be stuck at ~50%-80% piece accuracy on average
We opted for low reasoning to keep the benchmark practical.
Reproducing Results¶
# Run benchmark for a specific grid size
python benchmark.py \
--grid-size 3 \
--models openai/gpt-5.2 google/gemini-3-pro-preview \
--images-dir images/ \
--seed 42 \
--resize 512
See the CLI Usage Guide and Benchmark Guide for full documentation.
