Ml Model Eval Benchmark

v0.1.0

Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.

0· 346·3 current·3 all-time
byMuhammad Mazhar Saeed@0x-professor
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name and description match the included files: SKILL.md, a benchmarking guide, and a Python script that computes weighted scores and rankings. Nothing in the bundle requests unrelated capabilities or credentials.
Instruction Scope
Runtime instructions instruct the agent to run the bundled script and consult the guide. The script only reads a user-supplied JSON input (size-limited), computes scores, and writes an output artifact. The instructions do not ask the agent to read other system files, environment variables, or transmit data externally.
Install Mechanism
No install spec is provided (instruction-only with a bundled script). No downloads, package installs, or external package registry usage are present.
Credentials
The skill declares no environment variables, credentials, or config paths. The script operates solely on an explicit input file and an explicit output path; there are no hidden secret requirements.
Persistence & Privilege
always is false and the skill does not request persistent system presence or modify other skills. The script writes only to the user-specified output path and creates parent directories as needed.
Assessment
This skill appears low-risk and does what it says: run the bundled script with a JSON input to produce a leaderboard. Before installing/using it: (1) review or run the script locally on non-sensitive sample data to confirm behavior; (2) ensure the input JSON and requested output path are trusted (the script will create parent directories and may overwrite the specified output file); (3) note there are no network calls or credential accesses, so it won't exfiltrate data, but it also does minimal validation of metric values and tie-break behavior — verify the weighting/tie-break rules meet your policy for model promotion decisions.

Like a lobster shell, security has layers — review code before you run it.

latestvk974k1bk96443h22v2asvcr1fh81xtdq

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Comments