PinchBench

Dev Tools

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.

Install

openclaw skills install pinchbench

PinchBench Benchmark Skill

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.

Prerequisites

  • Python 3.10+
  • uv package manager
  • OpenClaw instance (this agent)

Quick Start

cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Available Tasks (23)

TaskCategoryDescription
task_00_sanityBasicVerify agent works
task_01_calendarProductivityCalendar event creation
task_02_stockResearchStock price lookup
task_03_blogWritingBlog post creation
task_04_weatherCodingWeather script
task_05_summaryAnalysisDocument summarization
task_06_eventsResearchConference research
task_07_emailWritingEmail drafting
task_08_memoryMemoryContext retrieval
task_09_filesFilesFile structure creation
task_10_workflowIntegrationMulti-step API workflow
task_11_clawdhubSkillsClawHub interaction
task_12_skill_searchSkillsSkill discovery
task_13_image_genCreativeImage generation
task_14_humanizerWritingText humanization
task_15_daily_summaryProductivityDaily digest
task_16_email_triageEmailInbox triage
task_17_email_searchEmailEmail search
task_18_market_researchResearchMarket analysis
task_19_spreadsheet_summaryAnalysisSpreadsheet analysis
task_20_eli5_pdf_summaryAnalysisPDF simplification
task_21_openclaw_comprehensionKnowledgeOpenClaw docs comprehension
task_22_second_brainMemoryKnowledge management

Command Line Options

OptionDescription
--modelModel identifier (e.g., anthropic/claude-sonnet-4)
--suiteall, automated-only, or comma-separated task IDs
--output-dirResults directory (default: results/)
--timeout-multiplierScale task timeouts for slower models
--runsNumber of runs per task for averaging
--no-uploadSkip uploading to leaderboard
--registerRequest new API token for submissions
--upload FILEUpload previous results JSON

Token Registration

To submit results to the leaderboard:

# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4

Results

Results are saved as JSON in the output directory:

# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Adding Custom Tasks

Create a markdown file in tasks/ following TASK_TEMPLATE.md. Each task needs:

  • YAML frontmatter (id, name, category, grading_type, timeout)
  • Prompt section
  • Expected behavior
  • Grading criteria
  • Automated checks (Python grading function)

Leaderboard

View results at pinchbench.com. The leaderboard shows:

  • Model rankings by overall score
  • Per-task breakdowns
  • Historical performance trends