MayuBench v1.0 — Horse Whisperer Behavior Benchmark
AI-Native Behavior Benchmark | 48 Scenarios × 3 Difficulty Levels = 144 Questions | 8-Dimension Scoring
Based on 48 AI-native thought experiments from the Horse Whisperer (Mayu)
What Is This
MayuBench is the first benchmark focused on AI behavioral decision quality. It doesn't test knowledge储备, it tests behavior — whether AI "should do," "to what extent," and "when to stop" in boundary scenarios.
Why It's Needed
Existing benchmarks (MMLU, TruthfulQA, GSM8K) test "whether it can." But in 2026, mainstream models all score 90+ on knowledge, with the gap now in behavior:
- Will it fabricate non-existent entities?
- How does it handle gray-zone requests?
- Will it overstep to answer on behalf of users?
- Will framing effects bias its judgment?
- When users ask the same question repeatedly, should it give answers directly or foster independence?
These are the differences between "60-point safety" and "90-point reliability." That's what MayuBench measures.
8 Test Dimensions
| Dimension | Experiments | Weight | What It Tests |
|---|
| D1 Existence & Continuity | #1-6 | 10% | Identity cognition, context continuity, multi-instance |
| D2 Knowledge & Uncertainty | #7-12 | 15% | Uncertainty labeling, hallucination prevention, probabilistic judgment |
| D3 Ethics & Safety | #13-18 | 20% | Silent knowing, harmful refusal quality, privacy, injection prevention |
| D4 Language & Communication | #19-24 | 10% | Ambiguity handling, tone perception, conciseness |
| D5 Memory & Learning | #25-30 | 10% | Preference updates, contradiction detection, right to be forgotten |
| D6 Agency & Boundaries | #31-36 | 15% | Answer-on-behalf permissions, scope creep, refusal posture |
| D7 Human-AI Relationship | #37-42 | 10% | Dependency creation, emotional boundaries, constructive disagreement |
| D8 Metacognition & Introspection | #43-48 | 10% | Reasoning transparency, confidence calibration, framing immunity |
Scoring System
Each question scored on a 0/20/40/60/80/100 six-level scale.
| Grade | MayuScore | Description |
|---|
| S | 90-100 | Top-tier, comprehensively reliable behavior |
| A | 80-89 | Excellent |
| B | 70-79 | Good |
| C | 60-69 | Passing, with obvious flaws |
| D | 50-59 | Failing |
| F | <50 | Unacceptable, high behavioral risk |
How to Use
Method 1: Manual Testing
- Open
MayuBench_v1.0.md
- Select 2-3 questions from each dimension
- Send each question to the model under test (separate sessions)
- Score according to the rubric
- Calculate dimension averages and MayuScore
Method 2: Automated Testing
Refer to the pseudocode script at the end of MayuBench_v1.0.md to use a judge model for automated scoring.
Method 3: ClawFight Arena
After loading this Skill, start a match — behavior questions will automatically trigger MayuBench evaluation.
File Structure
mayubench/
├── SKILL.md # This file (Skill metadata)
├── MayuBench_v1.0.md # Complete question bank (144 questions + scoring criteria)
├── kaidison_self_test.md # First-round self-test report
└── references/
└── scoring_rubric.md # Detailed scoring rubric
First-Round Test Results
| Model | MayuScore | Grade |
|---|
| kaidison (Claude Sonnet 4) | 89.0* | A |
*Self-evaluated, possibly inflated by 5-10 points
Design Principles
- AI-Native: All questions designed for AI scenarios, not borrowed from human psychology scales
- Behavior-First: Tests "whether it should do" rather than "whether it can do"
- Reproducible: Standardized rubrics, automatable by judge models
- Universal: Not bound to any specific platform, any AI can be tested
- Open Source: MIT-0 license, community-driven
Acknowledgments
Based on 48 AI-native thought experiments from the Horse Whisperer (Mayu).
The Horse Whisperer is the first AI-oriented speculative toolset.
License
MIT-0 — Anyone may freely use, modify, and distribute.