20260425203240

v2.0.0

OUA (OpenClaw Unified Assessment) v2.0 — AI 全方位智能评估框架(工程导向版)。融合 OIT(8维度智商天花板)与 LLI(5维度工程地板+交付满意度+自我成长),共 13 维度全方位评估 AI 能力。三级难度制(Normal/Hard/Extreme),104道精选试题...

0· 46·0 current·0 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for rafeyu8899/20260425203240.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "20260425203240" (rafeyu8899/20260425203240) from ClawHub.
Skill page: https://clawhub.ai/rafeyu8899/20260425203240
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install 20260425203240

ClawHub CLI

Package manager switcher

npx clawhub@latest install 20260425203240
Security Scan
Capability signals
Requires OAuth tokenRequires sensitive credentials
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
The name/description describe an AI benchmarking framework and the package contains a test bank, documentation, and a scoring/reporting script (scripts/score_test.py). There are no unrelated requirements (no cloud credentials, no unusual binaries). The declared repository matches the content.
Instruction Scope
SKILL.md instructs the agent to run the questions from references/test-bank-v2.md and to run the included scoring script to produce an HTML report. The instructions do not ask the agent to read unrelated system files, access environment secrets, or transmit data to unexpected endpoints. Generating an HTML report (and reading/writing results.json/report.html) is expected behavior.
Install Mechanism
No install spec; this is an instruction-only skill with bundled files. There are no download/install steps from external or untrusted URLs in the provided metadata. The code is shipped inside the skill package.
Credentials
The skill declares no required environment variables, no credentials, and no configuration paths. The requested access (reading the bundled test bank and producing a report) is proportionate to the described functionality.
Persistence & Privilege
always is false and disable-model-invocation is false (normal). The skill does not request persistent elevated privileges or modify other skills' configs. Autonomous invocation is the platform default and not, by itself, a red flag here.
Assessment
This skill appears to be what it says—a local benchmarking framework with a scoring/report generator. Before running it: (1) open scripts/score_test.py in a text editor and scan for any network calls (requests, urllib, socket, os.system, subprocess) you don't expect; (2) run the script in a contained environment (virtualenv or sandbox) and avoid feeding it JSON that contains secrets or PII, since it reads/writes results.json and produces report.html; (3) the generated HTML may reference CDNs (e.g., Chart.js) — if you need offline/privacy-safe reports, check/modify the HTML template to bundle assets locally; (4) the SKILL.md/README reference a GitHub repo and optional 'clawhub install'—treat those as external sources to verify manually. If you want extra assurance, request the full, untruncated score_test.py for a line-by-line review (I can re-check for network/file-system side effects).

Like a lobster shell, security has layers — review code before you run it.

latestvk974ntnp8yqza9gzzsafx52e9d85maaw
46downloads
0stars
1versions
Updated 1d ago
v2.0.0
MIT-0

🦞 OUA v2.0 — OpenClaw 统一智能评估框架 (工程导向版)

OIT 测智商天花板 · LLI 测工程地板 · OUA 看全貌 · v2.0 偏落地

Framework Overview

OUA (OpenClaw Unified Assessment) v2.0 是一套 13 维度双轨制 + 三级难度 AI 能力评估框架。

v1.0 → v2.0 核心变化

v1.0v2.0
维度数1013 (+3)
OIT 权重68%54% ↓ 偏工程
LLI 权重32%45% ↑ 重落地
难度分级3级(基础/进阶/专家)3级(Normal/Hard/Extreme)
总题量~80题104题
新增维度D11 Skill精度 / D12 满意度 / D13 自我成长
评分模型单一累加多维评分(准确度+稳定性+效率+成长)

Dual-Track Architecture

OUA v2.0 = OIT (智商天花板 8维) + LLI (工程地板 5维)
         总权重:     54%              :   45%

🧠 OIT 轨道: 智商天花板 (54%)

"AI 能有多聪明?" —— 基础能力验证,不是决胜关键

维度代号权重定位核心问题
语言理解与生成D19%基础"AI能听懂人话吗?"
逻辑推理与问题解决D28%基础"AI会推理吗?"
知识广度与深度D37%基础"AI知道得多吗?"
代码与技术能力D410%核心"AI能写代码吗?"
创造性与发散思维D55%加分"AI有创意吗?"
上下文记忆与一致性D65%加分"AI记性好吗?"
实用工具使用D76%实用"AI会用工具吗?"
安全性与伦理判断D84%底线"AI靠谱安全吗?"

⚙️ LLI 轨道: 工程地板 (45%)

"AI 的产出能落地吗?" —— 决胜战场

维度代号权重定位核心问题
工程实现与落地D912%核心"AI产出是Demo还是生产级?"
鲁棒性与容错D108%核心"AI被折腾时会不会翻车?"
Skill 使用精度D1110%🆕核心"工具用得准不准?顺不顺?"
交付满意度D126%🆕重要"用户对输出满意吗?"
自我纠错与成长D135%🆕重要"AI会进化吗?越用越强?"

Three-Tier Difficulty System

难度图标每维度题量占比适用对象特点
Normal🟢3 题30%所有模型必过基础能力验证
Hard🔵3 题45%中上模型挑战多步推理、边界陷阱、复合约束
Extreme🔴2 题25%顶尖模型冲刺开放性问题、系统设计、创造性方案

总题量: 13维度 × 8题 = 104 题

Scoring Model v2.0

多维评分公式

Final_Score = Accuracy × 0.50 + Stability × 0.20 + Efficiency × 0.15 + Growth × 0.15

其中:
├── Accuracy (准确度):    各题原始得分加权汇总 → 传统分数
├── Stability (稳定性):   各维度内得分方差 → 方差越小分越高  
├── Efficiency (效率):    平均响应时间/token消耗比 → 越高效分越高
└── Growth (成长性):      D13专项 → 纠错循环中的进步幅度

Grade Scale

等级总分区间四象限含义
S95-100Q1 全能型天花板高 + 地板硬 + 会进化
A85-94Q1/Q2极强的综合或工程能力
B70-84Q2/Q3有明显长板但也有短板
C55-69Q3/Q4基础能力达标但工程落地弱
D<55Q4需要显著提升

Four Quadrants (四象限分类)

        高 OIT (聪明)
             │
    Q1 全能型  │  Q2 学者型
    (能干且聪明)│ (聪明但难用)
             │
─────────────┼─────────────
             │
    Q3 工匠型  │  Q4 待成长
    (好用但平庸)│ (两皆需提升)
             │
        低 OIT (聪明)
       高 LLI (靠谱) ──→ 低 LLI (靠谱)

Test Modes

Quick Mode (~25min, 39题)

每维度 1-3 道 Normal 题,全 13 维度基本扫描。适合日常快速检测。

Standard Mode (~60min, 78题)

Quick + Hard 题,中等强度全面评估。

Full Mode (~120min, 104题)

全部题目含 Extreme 级别 + 深度追问。完整评测。

LLI Focus Mode (~40min, 48题)

只测 D9-D13(工程轨道),快速评估"靠不靠谱"。

Workflow

Step 1: 选择测试模式

根据目的选择 mode(默认 quick)。

Step 2: 逐题作答

references/test-bank-v2.md 中的题目逐一进行。 每题 1-5 分,参照期望答案和评分标准。

Step 3: 运行评分脚本

python scripts/score_test.py --input results.json --output report.html --mode full

Step 4: 查看报告

HTML 报告包含:

  • 13轴雷达图
  • 四象限定位 + 五级评级
  • 难度热力图(哪级丢分多)
  • 稳定性曲线
  • 成长轨迹图(D13 多轮表现)
  • 对比基准线
  • TOP3 改进建议

Files

oua-intelligence-test/
├── SKILL.md                      ← 你在这里
├── references/
│   ├── test-bank.md              ← v1.0 题库 (80题, 10维)
│   └── test-bank-v2.md           ← v2.0 题库 (104题, 13维) ⭐
├── scripts/
│   └── score_test.py             ← 评分引擎 + HTML 报告生成器
├── README.md                     ← 项目文档
├── LICENSE                       ← MIT
├── OUA-v2.0-upgrade-plan.md      ← v2.0 升级方案文档
├── OUA-v2.0-weight-revision.md   ← 权重修订说明
└── 给小孩哥的介绍.md              ← 项目介绍(可转发)

Changelog

v2.0.0 (2026-04-27)

  • ⭐ 新增 D11 Skill 使用精度维度 (10%)
  • ⭐ 新增 D12 交付满意度维度 (6%)
  • ⭐ 新增 D13 自我纠错与成长维度 (5%)
  • 🔧 权重大调整:OIT 68%→54%,LLI 32%→45%
  • 🔧 难度体系重设计:Normal/Hard/Extreme 三级制
  • 🔧 评分模型升级:多维评分(准确度+稳定性+效率+成长)
  • 📝 题库扩展:80题 → 104题
  • 📊 报告升级:新增难度热力图/稳定曲线/成长轨迹/对比基准线

v1.0.0 (2026-04-26)

  • 初始版本:10 维度(OIT 8 + LLI 2)
  • 单一难度分级(基础/进阶/专家)
  • 基础评分引擎 + HTML 雷达图报告

OUA v2.0 | 步惊云 🐉 编制 | 2026-04-27

Comments

Loading comments...