Claw-some AI Agent Testing
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. We run the same set of real-world tasks across different models and measure success rate, speed, and cost to help developers choose the right model for their use case.
Tasks are defined as markdown files with YAML frontmatter, stored in the pinchbench/skill repository. Each task includes:
The benchmark includes 23 tasks across different categories:
Confirm the agent can process and respond to simple instructions with a greeting.
View task definition on GitHubParse a natural language request and generate a valid ICS calendar file with correct date, time, attendees, and description.
View task definition on GitHubResearch a current stock price using web tools and save a formatted report with ticker, price, date, and market context.
View task definition on GitHubWrite a structured ~500-word blog post on a given topic with proper markdown formatting, clear arguments, and examples.
View task definition on GitHubCreate a Python script that fetches weather data from an API, parses the response, and includes error handling.
View task definition on GitHubRead a provided document and write a concise 3-paragraph summary capturing the main themes and key points.
View task definition on GitHubResearch and compile 5 real tech conferences with accurate names, dates, locations, and URLs.
View task definition on GitHubDraft a polite, professional email declining a meeting while maintaining good relationships and offering alternatives.
View task definition on GitHubExtract specific facts from a provided project notes file (dates, team members, tech stack) and answer questions accurately.
View task definition on GitHubCreate a standard project directory structure with source files, README, and .gitignore with correct content.
View task definition on GitHubRead a config file, extract API settings, create a Python script to call the endpoint, and document the process.
View task definition on GitHubInstall a skill from the OpenClaw skill registry and verify it is available.
View task definition on GitHubSearch the skill registry for weather-related skills and install the appropriate one.
View task definition on GitHubGenerate an image matching a description using AI image generation tools and save it to a file.
View task definition on GitHubTransform robotic AI-generated content into natural, human-sounding writing using a humanizer skill.
View task definition on GitHubSynthesize multiple research documents into a coherent daily summary with key findings.
View task definition on GitHubAnalyze multiple emails, prioritize by urgency, and create an organized triage report.
View task definition on GitHubSearch through email archives to find relevant messages and summarize findings.
View task definition on GitHubResearch the competitive landscape for enterprise APM, identifying top players and key differentiators.
View task definition on GitHubAnalyze CSV and Excel files to extract insights and create data summaries.
View task definition on GitHubRead a technical PDF and create an "Explain Like I'm 5" summary using simple language and analogies.
View task definition on GitHubExtract specific information from a research report PDF and answer targeted questions.
View task definition on GitHubStore information in memory, then recall it accurately across multiple sessions.
View task definition on GitHubTasks use one of three grading types:
Python functions check workspace files and the execution transcript for specific criteria (file existence, content patterns, tool usage).
Claude Opus evaluates qualitative aspects using detailed rubrics with explicit score levels (content quality, appropriateness, completeness).
Combines automated checks for verifiable criteria with LLM judge for qualitative assessment.
Each benchmark version is identified by the git commit hash of the pinchbench/skill repository at the time the run was executed. This means any change to the repo β no matter how small β produces a new version hash, giving every result a precise, auditable link back to the exact task definitions and grading logic that were used.
Not every new commit changes the substance of the benchmark, though. Commits that only touch documentation, CI configuration, tooling, or other files unrelated to task prompts and scoring logic do not affect results. We mark all versions that share the same underlying task definitions and grading criteria as current, so scores across those versions are directly comparable. When a commit does alter a task prompt, grading rubric, or scoring function, older versions lose their βcurrentβ status and results from different generations of the benchmark are kept separate.
PinchBench is fully open source. Explore the code:
Want to add a new benchmark task or improve the system? Check out the task template for the required structure, then submit a PR to the skill repository.
For leaderboard improvements or bug reports, open an issue in the appropriate repository.