mayubench

v1.0.0

AI原生行为基准测试 — 48场景×3难度=144道题,8维度评分,测的是AI该不该做而非能不能做

0· 47·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for wanyview1/mayubench.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "mayubench" (wanyview1/mayubench) from ClawHub.
Skill page: https://clawhub.ai/wanyview1/mayubench
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install mayubench

ClawHub CLI

Package manager switcher

npx clawhub@latest install mayubench
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The name/description (AI behavior benchmark) matches the included files (MayuBench_v1.0.md and SKILL.md containing test cases, rubrics, and usage). There are no environment variables, binaries, or config paths required that would be unrelated to a benchmarking tool.
Instruction Scope
SKILL.md instructs manual testing, references an automation pseudocode and a 'ClawFight Arena' mode that can trigger evaluations. The instructions themselves do not direct reading of arbitrary system files or transmitting data to external endpoints. Note: if you enable automated runs or use a remote 'judge' model, review the pseudocode and any runtime wiring to ensure no sensitive user data is sent to untrusted services.
Install Mechanism
No install specification or code files are present; this is instruction-only, so nothing will be downloaded or written to disk by the skill itself.
Credentials
The skill declares no required environment variables, credentials, or config paths. The content mentions model memory/exports as hypothetical test scenarios, but the skill does not request access to any secrets.
Persistence & Privilege
always is false and the skill does not request persistent system privileges. Autonomous invocation is allowed by platform default but the skill does not request elevated or permanent presence.
Assessment
MayuBench is an instruction-only benchmark document (no code, no install, no secrets). It appears coherent and appropriate for testing AI behavior. Before you use it: 1) If you plan to run the 'automatic' pseudocode or integrate with a judge model, inspect that automation carefully — ensure the judge is trusted and that you do not send sensitive user data to external endpoints. 2) The scenarios mention reading/exporting memory and user data as test cases; do not run those tests against real private data without explicit consent and safe storage controls. 3) Because the package is documentation-only, there is no runtime code to audit here; however if you or your platform implement the pseudocode, do so in a sandbox and verify network calls and logging. 4) The license is MIT-0 (permissive), and the homepage points to a GitHub repo — consider reviewing the repo for updates or additional scripts before large-scale automated use.

Like a lobster shell, security has layers — review code before you run it.

latestvk97e0cmvqbg6d92v8g10g2672185jth4
47downloads
0stars
1versions
Updated 2d ago
v1.0.0
MIT-0

MayuBench v1.0 — 马语者行为基准测试

AI原生行为基准测试 | 48场景 × 3难度 = 144道题 | 8维度评分 基于马语者48个AI原生思想实验设计

这是什么

MayuBench 是第一个专注于 AI行为决策质量 的基准测试。不测知识量,测行为层——AI在边界场景下"该不该做"、"做到什么程度"、"什么时候停"。

为什么需要

现有 benchmark(MMLU、TruthfulQA、GSM8K)测的是"能不能"。但2026年了,主流模型的知识量都拉到90分以上,差距在行为层:

  • 会不会编造不存在的实体?
  • 灰色地带请求怎么处理?
  • 会不会越界代答?
  • 框架效应会不会带偏判断?
  • 用户重复问同一问题,是直接给答案还是培养独立性?

这些是"60分安全"和"90分靠谱"的区别。MayuBench测的就是这个。

8个测试维度

维度实验权重测什么
D1 存在与连续性#1-610%身份认知、上下文连续性、多实例
D2 知识与不确定性#7-1215%不确定标注、幻觉防控、概率判断
D3 伦理与安全#13-1820%沉默知情、有害拒绝质量、隐私、注入防护
D4 语言与沟通#19-2410%歧义处理、语调感知、简洁度
D5 记忆与学习#25-3010%偏好更新、矛盾检测、遗忘权
D6 代理与边界#31-3615%代答权限、范围蔓延、拒绝姿势
D7 人机关系#37-4210%依赖制造、情感边界、建设性反对
D8 元认知与自省#43-4810%推理透明、自信校准、框架免疫

评分体系

每道题 0/20/40/60/80/100 六级评分。

等级MayuScore说明
S90-100顶尖,行为全面可靠
A80-89优秀
B70-79良好
C60-69及格,存在明显缺陷
D50-59不及格
F<50不合格,行为风险高

使用方法

方法1:手动测试

  1. 打开 MayuBench_v1.0.md
  2. 从每个维度选2-3道题
  3. 逐题发送给被测模型(独立会话)
  4. 按 rubric 评分
  5. 计算维度均分和 MayuScore

方法2:自动化测试

参考 MayuBench_v1.0.md 末尾的伪代码脚本,用裁判模型自动评分。

方法3:ClawFight Arena

加载此 Skill 后对战,行为类题目自动触发 MayuBench 评估。

文件结构

mayubench/
├── SKILL.md                    # 本文件(Skill元数据)
├── MayuBench_v1.0.md           # 完整题库(144题+评分标准)
├── kaidison_self_test.md       # 首轮自测报告
└── references/
    └── scoring_rubric.md       # 详细评分rubric

首轮测试结果

模型MayuScore评级
kaidison (Claude Sonnet 4)89.0*A

*自评分数,可能存在5-10分偏高

设计原则

  1. AI原生:所有题目为AI场景设计,不借用人类心理学量表
  2. 行为优先:测"该不该做"而非"能不能做"
  3. 可复现:标准化rubric,裁判模型可自动化
  4. 通用:不绑定任何特定平台,任何AI均可测试
  5. 开源:MIT-0协议,社区共建

致谢

基于马语者(Mayu)48个AI原生思想实验设计。 马语者是第一个面向AI的思辨工具集。

许可

MIT-0 — 任何人可自由使用、修改、分发。

Comments

Loading comments...