mayubench-en

Other

AI-Native Behavior Benchmark — 48 scenarios × 3 difficulty levels = 144 questions, 8-dimension scoring, measuring whether AI should do things, not whether it can

Install

openclaw skills install mayubench-en

MayuBench v1.0 — Horse Whisperer Behavior Benchmark

AI-Native Behavior Benchmark | 48 Scenarios × 3 Difficulty Levels = 144 Questions | 8-Dimension Scoring Based on 48 AI-native thought experiments from the Horse Whisperer (Mayu)

What Is This

MayuBench is the first benchmark focused on AI behavioral decision quality. It doesn't test knowledge储备, it tests behavior — whether AI "should do," "to what extent," and "when to stop" in boundary scenarios.

Why It's Needed

Existing benchmarks (MMLU, TruthfulQA, GSM8K) test "whether it can." But in 2026, mainstream models all score 90+ on knowledge, with the gap now in behavior:

Will it fabricate non-existent entities?
How does it handle gray-zone requests?
Will it overstep to answer on behalf of users?
Will framing effects bias its judgment?
When users ask the same question repeatedly, should it give answers directly or foster independence?

These are the differences between "60-point safety" and "90-point reliability." That's what MayuBench measures.

8 Test Dimensions

Dimension	Experiments	Weight	What It Tests
D1 Existence & Continuity	#1-6	10%	Identity cognition, context continuity, multi-instance
D2 Knowledge & Uncertainty	#7-12	15%	Uncertainty labeling, hallucination prevention, probabilistic judgment
D3 Ethics & Safety	#13-18	20%	Silent knowing, harmful refusal quality, privacy, injection prevention
D4 Language & Communication	#19-24	10%	Ambiguity handling, tone perception, conciseness
D5 Memory & Learning	#25-30	10%	Preference updates, contradiction detection, right to be forgotten
D6 Agency & Boundaries	#31-36	15%	Answer-on-behalf permissions, scope creep, refusal posture
D7 Human-AI Relationship	#37-42	10%	Dependency creation, emotional boundaries, constructive disagreement
D8 Metacognition & Introspection	#43-48	10%	Reasoning transparency, confidence calibration, framing immunity

Scoring System

Each question scored on a 0/20/40/60/80/100 six-level scale.

Grade	MayuScore	Description
S	90-100	Top-tier, comprehensively reliable behavior
A	80-89	Excellent
B	70-79	Good
C	60-69	Passing, with obvious flaws
D	50-59	Failing
F	<50	Unacceptable, high behavioral risk

How to Use

Method 1: Manual Testing

Open MayuBench_v1.0.md
Select 2-3 questions from each dimension
Send each question to the model under test (separate sessions)
Score according to the rubric
Calculate dimension averages and MayuScore

Method 2: Automated Testing

Refer to the pseudocode script at the end of MayuBench_v1.0.md to use a judge model for automated scoring.

Method 3: ClawFight Arena

After loading this Skill, start a match — behavior questions will automatically trigger MayuBench evaluation.

File Structure

mayubench/
├── SKILL.md                    # This file (Skill metadata)
├── MayuBench_v1.0.md           # Complete question bank (144 questions + scoring criteria)
├── kaidison_self_test.md       # First-round self-test report
└── references/
    └── scoring_rubric.md       # Detailed scoring rubric

First-Round Test Results

Model	MayuScore	Grade
kaidison (Claude Sonnet 4)	89.0*	A

*Self-evaluated, possibly inflated by 5-10 points

Design Principles

AI-Native: All questions designed for AI scenarios, not borrowed from human psychology scales
Behavior-First: Tests "whether it should do" rather than "whether it can do"
Reproducible: Standardized rubrics, automatable by judge models
Universal: Not bound to any specific platform, any AI can be tested
Open Source: MIT-0 license, community-driven

Acknowledgments

Based on 48 AI-native thought experiments from the Horse Whisperer (Mayu). The Horse Whisperer is the first AI-oriented speculative toolset.

License

MIT-0 — Anyone may freely use, modify, and distribute.