Skip to content

LLM Jigsaw Puzzle Benchmark

A benchmark for testing multimodal LLM spatial reasoning capabilities through iterative jigsaw puzzle solving.

🎮 Try It Yourself 📊 View Results

Reference Image Solving Process


🏆 Benchmark Results

Can frontier LLMs solve jigsaw puzzles? We tested GPT-5.2, Gemini 3 Pro, and Claude Opus 4.5 across grid sizes from 3×3 to 5×5 on 20 hand-picked images.

Performance vs Grid Size

Grid Pieces GPT-5.2 Gemini 3 Pro Claude Opus 4.5
3×3 9 95% solve, 97% acc 85% solve, 93% acc 20% solve, 47% acc
4×4 16 40% solve, 77% acc 25% solve, 72% acc -
5×5 25 0% solve, 46% acc 10% solve, 49% acc -

Solve = fully completed puzzles. Acc = % of pieces in correct position.

Key Insights

Difficulty scales steeply

Solve rates crash from 95% to 0% between 3×3 and 5×5

No model reliably solves 5×5

Spatial reasoning hits a wall at 25 pieces

Partial progress is common

Models often hit a wall at 50-80% piece accuracy for 4×4 and 5×5

GPT-5.2 and Gemini 3 Pro tested with low reasoning effort; Claude Opus 4.5 with high. Higher reasoning showed slightly better performance on individual images, but both GPT-5.2 and Gemini 3 Pro would still get stuck at around 50-70% piece accuracy on average for 5x5 grids.

All models received the reference image, correct piece count, and last 3 moves as context.

📊 Detailed benchmark results


Overview

This project shuffles an image into an N×N grid and challenges an LLM to restore the original image by iteratively swapping pieces. The task tests:

  • 👁 Visual Understanding

    Recognizing piece content and how pieces fit together

  • 🧠 Spatial Reasoning

    Understanding grid coordinates and piece relationships

  • 🔄 Iterative Problem Solving

    Making progress across multiple turns

  • 🧠 Memory & Context

    Tracking previous moves and learning from them


Features

  • Configurable difficulty – Square (4×4) or rectangular (3×5) grids
  • Multiple LLM providers – OpenAI, Anthropic, Google
  • Visual annotations – Grid labels, colored borders for easy piece identification
  • Comprehensive metrics – Tracks moves, accuracy, tokens, timing
  • Reproducible – Seed-based shuffling for consistent benchmarks
  • Optional hints – Show correct count, provide reference image
  • Animated GIF output – Visualize the solving process

Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/llm-jigsaw.git
cd llm-jigsaw

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Run a Puzzle

# Set your API key
export OPENAI_API_KEY="your-key-here"

# Run a simple puzzle
python main.py --image images/sample.jpg --resize 512 --grid-size 3 --model openai/gpt-5.2

Run Benchmarks

python benchmark.py \
  --models openai/gpt-5.2 google/gemini-3-pro-preview \
  --image-folder images \
  --grid-size 4 \
  --reasoning-effort low \
  --resize 768 \
  --parallel

📖 Full CLI reference 📊 Benchmark guide


How It Works

The LLM receives the shuffled puzzle image and responds with JSON specifying swaps:

{
  "reasoning": "The sky piece at 1,3 belongs at 1,1 based on color continuity",
  "moves": [
    {"op": "swap", "a": "1,1", "b": "1,3"},
    {"op": "swap", "a": "2,4", "b": "4,2"}
  ]
}

Coordinates use 1-indexed "row,col" format (top-left is "1,1").


Output

Results are saved to the output directory:

results/run_name/
├── result.json       # Complete metrics and move history
├── initial_state.png # Shuffled puzzle at start
├── final_state.png   # Puzzle state at end
└── game.gif          # Animated solving process

Project Structure

llm-jigsaw/
├── src/                      # Core library
│   ├── benchmark/            # Benchmark framework
│   ├── image_processor.py    # Image slicing and state management
│   ├── grid_annotator.py     # Visual annotations
│   ├── llm_interface.py      # LLM API abstraction
│   ├── game.py               # Game controller
│   └── prompts.py            # Prompt templates
├── streamlit_app/            # Human player web app
├── docs/                     # Documentation
├── tests/                    # Test suite
├── images/                   # Test images
├── main.py                   # CLI entry point
└── benchmark.py              # Benchmark runner

License

MIT License