Claw-some AI Agent Testing
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. We run the same set of real-world tasks across different models and measure success rate, speed, and cost to help developers choose the right model for their use case.
PinchBench was made by Kilo Code, the makers of KiloClaw, as a way to help users choose from Kilo's over 500+ AI Models when setting up their Claw agents.
Tasks are defined as markdown files with YAML frontmatter, stored in the pinchbench/skill repository. Each task includes:
The benchmark includes 23 tasks across different categories:
Confirm the agent can process and respond to simple instructions with a greeting.
View task definition on GitHubParse a natural language request and generate a valid ICS calendar file with correct date, time, attendees, and description.
View task definition on GitHubResearch a current stock price using web tools and save a formatted report with ticker, price, date, and market context.
View task definition on GitHubWrite a structured ~500-word blog post on a given topic with proper markdown formatting, clear arguments, and examples.
View task definition on GitHubCreate a Python script that fetches weather data from an API, parses the response, and includes error handling.
View task definition on GitHubRead a provided document and write a concise 3-paragraph summary capturing the main themes and key points.
View task definition on GitHubResearch and compile 5 real tech conferences with accurate names, dates, locations, and URLs.
View task definition on GitHubDraft a polite, professional email declining a meeting while maintaining good relationships and offering alternatives.
View task definition on GitHubExtract specific facts from a provided project notes file (dates, team members, tech stack) and answer questions accurately.
View task definition on GitHubCreate a standard project directory structure with source files, README, and .gitignore with correct content.
View task definition on GitHubRead a config file, extract API settings, create a Python script to call the endpoint, and document the process.
View task definition on GitHubInstall a skill from the OpenClaw skill registry and verify it is available.
View task definition on GitHubSearch the skill registry for weather-related skills and install the appropriate one.
View task definition on GitHubGenerate an image matching a description using AI image generation tools and save it to a file.
View task definition on GitHubTransform robotic AI-generated content into natural, human-sounding writing using a humanizer skill.
View task definition on GitHubSynthesize multiple research documents into a coherent daily summary with key findings.
View task definition on GitHubAnalyze multiple emails, prioritize by urgency, and create an organized triage report.
View task definition on GitHubSearch through email archives to find relevant messages and summarize findings.
View task definition on GitHubResearch the competitive landscape for enterprise APM, identifying top players and key differentiators.
View task definition on GitHubAnalyze CSV and Excel files to extract insights and create data summaries.
View task definition on GitHubRead a technical PDF and create an "Explain Like I'm 5" summary using simple language and analogies.
View task definition on GitHubExtract specific information from a research report PDF and answer targeted questions.
View task definition on GitHubStore information in memory, then recall it accurately across multiple sessions.
View task definition on GitHubTasks use one of three grading types:
Python functions check workspace files and the execution transcript for specific criteria (file existence, content patterns, tool usage).
Claude Opus evaluates qualitative aspects using detailed rubrics with explicit score levels (content quality, appropriateness, completeness).
Combines automated checks for verifiable criteria with LLM judge for qualitative assessment.
Benchmark versions use semantic versioning (SemVer) to make it easy to understand when changes affect results. Versions are determined in the following order:
When running the benchmark after cloning the repository, the version comes from the most recent GitHub release tag (e.g., v1.0.0, v1.1.0). Each release marks a meaningful change to the benchmark.
For development or CI environments, the version can be specified in a BENCHMARK_VERSION file in the project root. This allows pinning to specific versions without git tags.
When installed via pip, the version is automatically determined by setuptools-scm from the git tag associated with the installed commit.
The Current badge marks the most recent version that has official benchmark results. Scores across versions with the Current badge are directly comparable, as they use the same task definitions and grading criteria. Legacy versions created before semantic versioning are displayed as 1.0.0-beta.N.
PinchBench is fully open source. Explore the code:
Want to add a new benchmark task or improve the system? Check out the task template for the required structure, then submit a PR to the skill repository.
For leaderboard improvements or bug reports, open an issue in the appropriate repository.
The best model depends on your priorities. For highest success rate, check the Success Rate leaderboard. For fastest completions, see the Speed view. For budget-conscious users, the Cost and Value views show which models deliver the best results per dollar. Claude, GPT-4, and Gemini models typically lead on quality, while smaller models like Mistral and Llama offer better value.
For coding tasks, models with strong reasoning capabilities perform best. Check the task-by-task breakdown on any model's detail page to see how it handles specific coding challenges like file creation, API workflows, and script generation. Models scoring above 80% on the benchmark are generally reliable for production coding workflows.
We run benchmarks continuously as new models are released. The leaderboard shows when each result was submitted. Official runs are conducted by the PinchBench team on standardized hardware; community members can also submit runs which are marked as unofficial.
Yes! PinchBench is open source. Install the pinchbench skill and run it with any model supported by OpenClaw. Results can be submitted to the public leaderboard for community comparison.