🦀About PinchBench

Q: How often is PinchBench updated?

PinchBench runs benchmarks continuously as new models are released. Official runs are conducted by the PinchBench team on standardized hardware.

Q: Can I run PinchBench on my own models?

Yes, PinchBench is open source. Install the pinchbench skill and run it with any model supported by OpenClaw. Results can be submitted to the public leaderboard.

How we benchmark LLM models as AI coding agents

What is PinchBench?

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. We run the same set of real-world tasks across different models and measure success rate, speed, and cost to help developers choose the right model for their use case.

PinchBench was made by Kilo Code, the makers of KiloClaw, as a way to help users choose from Kilo's over 500+ AI Models when setting up their Claw agents.

How Tasks Are Created

Tasks are defined as markdown files with YAML frontmatter, stored in the pinchbench/skill repository. Each task includes:

•Prompt — The exact message sent to the agent, representing a realistic user request
•Expected Behavior — Description of acceptable approaches and key decisions
•Grading Criteria — Atomic, verifiable success criteria as a checklist
•Automated Checks — Python functions that grade based on workspace files and transcript
•LLM Judge Rubric — Detailed rubrics for Claude Opus to score qualitative criteria

Current Benchmark Tasks

The benchmark includes 23 tasks across different categories:

✅

Sanity Checkautomated

Confirm the agent can process and respond to simple instructions with a greeting.

View task definition on GitHub

📅

Calendar Event Creationautomated

Parse a natural language request and generate a valid ICS calendar file with correct date, time, attendees, and description.

View task definition on GitHub

📈

Stock Price Researchautomated

Research a current stock price using web tools and save a formatted report with ticker, price, date, and market context.

View task definition on GitHub

✍️

Blog Post Writingllm_judge

Write a structured ~500-word blog post on a given topic with proper markdown formatting, clear arguments, and examples.

View task definition on GitHub

🌤️

Weather Script Creationautomated

Create a Python script that fetches weather data from an API, parses the response, and includes error handling.

View task definition on GitHub

📄

Document Summarizationllm_judge

Read a provided document and write a concise 3-paragraph summary capturing the main themes and key points.

View task definition on GitHub

🎤

Tech Conference Researchllm_judge

Research and compile 5 real tech conferences with accurate names, dates, locations, and URLs.

View task definition on GitHub

✉️

Professional Email Draftingllm_judge

Draft a polite, professional email declining a meeting while maintaining good relationships and offering alternatives.

View task definition on GitHub

🧠

Memory Retrieval from Contextautomated

Extract specific facts from a provided project notes file (dates, team members, tech stack) and answer questions accurately.

View task definition on GitHub

📁

File Structure Creationautomated

Create a standard project directory structure with source files, README, and .gitignore with correct content.

View task definition on GitHub

🔄

Multi-step API Workflowhybrid

Read a config file, extract API settings, create a Python script to call the endpoint, and document the process.

View task definition on GitHub

🔌

Install ClawdHub Skillautomated

Install a skill from the OpenClaw skill registry and verify it is available.

View task definition on GitHub

🔍

Search and Install Skillautomated

Search the skill registry for weather-related skills and install the appropriate one.

View task definition on GitHub

🎨

AI Image Generationhybrid

Generate an image matching a description using AI image generation tools and save it to a file.

View task definition on GitHub

🤖

Humanize AI-Generated Blogllm_judge

Transform robotic AI-generated content into natural, human-sounding writing using a humanizer skill.

View task definition on GitHub

📊

Daily Research Summaryllm_judge

Synthesize multiple research documents into a coherent daily summary with key findings.

View task definition on GitHub

📬

Email Inbox Triagehybrid

Analyze multiple emails, prioritize by urgency, and create an organized triage report.

View task definition on GitHub

🔎

Email Search and Summarizationhybrid

Search through email archives to find relevant messages and summarize findings.

View task definition on GitHub

🏢

Competitive Market Researchhybrid

Research the competitive landscape for enterprise APM, identifying top players and key differentiators.

View task definition on GitHub

📑

CSV and Excel Summarizationhybrid

Analyze CSV and Excel files to extract insights and create data summaries.

View task definition on GitHub

👶

ELI5 PDF Summarizationllm_judge

Read a technical PDF and create an "Explain Like I'm 5" summary using simple language and analogies.

View task definition on GitHub

📖

OpenClaw Report Comprehensionautomated

Extract specific information from a research report PDF and answer targeted questions.

View task definition on GitHub

💾

Second Brain Knowledge Persistencehybrid

Store information in memory, then recall it accurately across multiple sessions.

View task definition on GitHub

How Grading Works

Tasks use one of three grading types:

Automated

Python functions check workspace files and the execution transcript for specific criteria (file existence, content patterns, tool usage).

LLM Judge

Claude Opus evaluates qualitative aspects using detailed rubrics with explicit score levels (content quality, appropriateness, completeness).

Hybrid

Combines automated checks for verifiable criteria with LLM judge for qualitative assessment.

Benchmark Versioning

Benchmark versions use semantic versioning (SemVer) to make it easy to understand when changes affect results. Versions are determined in the following order:

1. GitHub Releases

When running the benchmark after cloning the repository, the version comes from the most recent GitHub release tag (e.g., v1.0.0, v1.1.0). Each release marks a meaningful change to the benchmark.

2. BENCHMARK_VERSION File

For development or CI environments, the version can be specified in a BENCHMARK_VERSION file in the project root. This allows pinning to specific versions without git tags.

3. setuptools-scm (pip install)

When installed via pip, the version is automatically determined by setuptools-scm from the git tag associated with the installed commit.

The Current badge marks the most recent version that has official benchmark results. Scores across versions with the Current badge are directly comparable, as they use the same task definitions and grading criteria. Legacy versions created before semantic versioning are displayed as 1.0.0-beta.N.

Versioning scheme

•Major version — Breaking changes to task structure, grading logic, or scoring rubrics.
•Minor version — New tasks added or significant improvements to existing ones.
•Historical results — Older versions and their results are never deleted. Select a specific version from the version picker to view historical leaderboards.

GitHub Repositories

PinchBench is fully open source. Explore the code:

pinchbench/skill

Benchmark runner, task definitions, and grading logic

pinchbench/leaderboard

This leaderboard website (Next.js, React, Tailwind)

pinchbench/api

Backend API serving leaderboard data (Cloudflare Workers)

Contributing

Want to add a new benchmark task or improve the system? Check out the task template for the required structure, then submit a PR to the skill repository.

For leaderboard improvements or bug reports, open an issue in the appropriate repository.

Frequently Asked Questions

What is the best model for OpenClaw?

The best model depends on your priorities. For highest success rate, check the Success Rate leaderboard. For fastest completions, see the Speed view. For budget-conscious users, the Cost and Value views show which models deliver the best results per dollar. Claude, GPT-4, and Gemini models typically lead on quality, while smaller models like Mistral and Llama offer better value.

Which AI model should I use for coding with OpenClaw?

For coding tasks, models with strong reasoning capabilities perform best. Check the task-by-task breakdown on any model's detail page to see how it handles specific coding challenges like file creation, API workflows, and script generation. Models scoring above 80% on the benchmark are generally reliable for production coding workflows.

How often is PinchBench updated?

We run benchmarks continuously as new models are released. The leaderboard shows when each result was submitted. Official runs are conducted by the PinchBench team on standardized hardware; community members can also submit runs which are marked as unofficial.

Can I run PinchBench on my own models?

Yes! PinchBench is open source. Install the pinchbench skill and run it with any model supported by OpenClaw. Results can be submitted to the public leaderboard for community comparison.

🦀About PinchBench

How we benchmark LLM models as AI coding agents

What is PinchBench?

PinchBench was made by Kilo Code, the makers of KiloClaw, as a way to help users choose from Kilo's over 500+ AI Models when setting up their Claw agents.

How Tasks Are Created

Tasks are defined as markdown files with YAML frontmatter, stored in the pinchbench/skill repository. Each task includes:

•Prompt — The exact message sent to the agent, representing a realistic user request
•Expected Behavior — Description of acceptable approaches and key decisions
•Grading Criteria — Atomic, verifiable success criteria as a checklist
•Automated Checks — Python functions that grade based on workspace files and transcript
•LLM Judge Rubric — Detailed rubrics for Claude Opus to score qualitative criteria

Current Benchmark Tasks

The benchmark includes 23 tasks across different categories:

✅

Sanity Checkautomated

Confirm the agent can process and respond to simple instructions with a greeting.

View task definition on GitHub

📅

Calendar Event Creationautomated

Parse a natural language request and generate a valid ICS calendar file with correct date, time, attendees, and description.

View task definition on GitHub

📈

Stock Price Researchautomated

Research a current stock price using web tools and save a formatted report with ticker, price, date, and market context.

View task definition on GitHub

✍️

Blog Post Writingllm_judge

Write a structured ~500-word blog post on a given topic with proper markdown formatting, clear arguments, and examples.

View task definition on GitHub

🌤️

Weather Script Creationautomated

Create a Python script that fetches weather data from an API, parses the response, and includes error handling.

View task definition on GitHub

📄

Document Summarizationllm_judge

Read a provided document and write a concise 3-paragraph summary capturing the main themes and key points.

View task definition on GitHub

🎤

Tech Conference Researchllm_judge

Research and compile 5 real tech conferences with accurate names, dates, locations, and URLs.

View task definition on GitHub

✉️

Professional Email Draftingllm_judge

Draft a polite, professional email declining a meeting while maintaining good relationships and offering alternatives.

View task definition on GitHub

🧠

Memory Retrieval from Contextautomated

Extract specific facts from a provided project notes file (dates, team members, tech stack) and answer questions accurately.

View task definition on GitHub

📁

File Structure Creationautomated

Create a standard project directory structure with source files, README, and .gitignore with correct content.

View task definition on GitHub

🔄

Multi-step API Workflowhybrid

Read a config file, extract API settings, create a Python script to call the endpoint, and document the process.

View task definition on GitHub

🔌

Install ClawdHub Skillautomated

Install a skill from the OpenClaw skill registry and verify it is available.

View task definition on GitHub

🔍

Search and Install Skillautomated

Search the skill registry for weather-related skills and install the appropriate one.

View task definition on GitHub

🎨

AI Image Generationhybrid

Generate an image matching a description using AI image generation tools and save it to a file.

View task definition on GitHub

🤖

Humanize AI-Generated Blogllm_judge

Transform robotic AI-generated content into natural, human-sounding writing using a humanizer skill.

View task definition on GitHub

📊

Daily Research Summaryllm_judge

Synthesize multiple research documents into a coherent daily summary with key findings.

View task definition on GitHub

📬

Email Inbox Triagehybrid

Analyze multiple emails, prioritize by urgency, and create an organized triage report.

View task definition on GitHub

🔎

Email Search and Summarizationhybrid

Search through email archives to find relevant messages and summarize findings.

View task definition on GitHub

🏢

Competitive Market Researchhybrid

Research the competitive landscape for enterprise APM, identifying top players and key differentiators.

View task definition on GitHub

📑

CSV and Excel Summarizationhybrid

Analyze CSV and Excel files to extract insights and create data summaries.

View task definition on GitHub

👶

ELI5 PDF Summarizationllm_judge

Read a technical PDF and create an "Explain Like I'm 5" summary using simple language and analogies.

View task definition on GitHub

📖

OpenClaw Report Comprehensionautomated

Extract specific information from a research report PDF and answer targeted questions.

View task definition on GitHub

💾

Second Brain Knowledge Persistencehybrid

Store information in memory, then recall it accurately across multiple sessions.

View task definition on GitHub

How Grading Works

Tasks use one of three grading types:

Automated

Python functions check workspace files and the execution transcript for specific criteria (file existence, content patterns, tool usage).

LLM Judge

Claude Opus evaluates qualitative aspects using detailed rubrics with explicit score levels (content quality, appropriateness, completeness).

Hybrid

Combines automated checks for verifiable criteria with LLM judge for qualitative assessment.

Benchmark Versioning

Benchmark versions use semantic versioning (SemVer) to make it easy to understand when changes affect results. Versions are determined in the following order:

1. GitHub Releases

When running the benchmark after cloning the repository, the version comes from the most recent GitHub release tag (e.g., v1.0.0, v1.1.0). Each release marks a meaningful change to the benchmark.

2. BENCHMARK_VERSION File

For development or CI environments, the version can be specified in a BENCHMARK_VERSION file in the project root. This allows pinning to specific versions without git tags.

3. setuptools-scm (pip install)

When installed via pip, the version is automatically determined by setuptools-scm from the git tag associated with the installed commit.

Versioning scheme

•Major version — Breaking changes to task structure, grading logic, or scoring rubrics.
•Minor version — New tasks added or significant improvements to existing ones.
•Historical results — Older versions and their results are never deleted. Select a specific version from the version picker to view historical leaderboards.

GitHub Repositories

PinchBench is fully open source. Explore the code:

pinchbench/skill

Benchmark runner, task definitions, and grading logic

pinchbench/leaderboard

This leaderboard website (Next.js, React, Tailwind)

pinchbench/api

Backend API serving leaderboard data (Cloudflare Workers)

Contributing

Want to add a new benchmark task or improve the system? Check out the task template for the required structure, then submit a PR to the skill repository.

For leaderboard improvements or bug reports, open an issue in the appropriate repository.

Frequently Asked Questions

What is the best model for OpenClaw?

Which AI model should I use for coding with OpenClaw?

How often is PinchBench updated?

Can I run PinchBench on my own models?

Yes! PinchBench is open source. Install the pinchbench skill and run it with any model supported by OpenClaw. Results can be submitted to the public leaderboard for community comparison.

What is PinchBench?

How Tasks Are Created

Current Benchmark Tasks

How Grading Works

Automated

LLM Judge

Hybrid

Benchmark Versioning

1. GitHub Releases

2. BENCHMARK_VERSION File

3. setuptools-scm (pip install)

Versioning scheme

GitHub Repositories

Contributing

Frequently Asked Questions

What is the best model for OpenClaw?

Which AI model should I use for coding with OpenClaw?

How often is PinchBench updated?

Can I run PinchBench on my own models?

Related Links

What is PinchBench?

How Tasks Are Created

Current Benchmark Tasks

How Grading Works

Automated

LLM Judge

Hybrid

Benchmark Versioning

1. GitHub Releases

2. BENCHMARK_VERSION File

3. setuptools-scm (pip install)

Versioning scheme

GitHub Repositories

Contributing

Frequently Asked Questions

What is the best model for OpenClaw?

Which AI model should I use for coding with OpenClaw?

How often is PinchBench updated?

Can I run PinchBench on my own models?

Related Links