Ml Model Eval Benchmark
v0.1.0Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
Name and description match the included files: SKILL.md, a benchmarking guide, and a Python script that computes weighted scores and rankings. Nothing in the bundle requests unrelated capabilities or credentials.
Instruction Scope
Runtime instructions instruct the agent to run the bundled script and consult the guide. The script only reads a user-supplied JSON input (size-limited), computes scores, and writes an output artifact. The instructions do not ask the agent to read other system files, environment variables, or transmit data externally.
Install Mechanism
No install spec is provided (instruction-only with a bundled script). No downloads, package installs, or external package registry usage are present.
Credentials
The skill declares no environment variables, credentials, or config paths. The script operates solely on an explicit input file and an explicit output path; there are no hidden secret requirements.
Persistence & Privilege
always is false and the skill does not request persistent system presence or modify other skills. The script writes only to the user-specified output path and creates parent directories as needed.
Assessment
This skill appears low-risk and does what it says: run the bundled script with a JSON input to produce a leaderboard. Before installing/using it: (1) review or run the script locally on non-sensitive sample data to confirm behavior; (2) ensure the input JSON and requested output path are trusted (the script will create parent directories and may overwrite the specified output file); (3) note there are no network calls or credential accesses, so it won't exfiltrate data, but it also does minimal validation of metric values and tie-break behavior — verify the weighting/tie-break rules meet your policy for model promotion decisions.Like a lobster shell, security has layers — review code before you run it.
latest
ML Model Eval Benchmark
Overview
Produce consistent model ranking outputs from metric-weighted evaluation inputs.
Workflow
- Define metric weights and accepted metric ranges.
- Ingest model metrics for each candidate.
- Compute weighted score and ranking.
- Export leaderboard and promotion recommendation.
Use Bundled Resources
- Run
scripts/benchmark_models.pyto generate benchmark outputs. - Read
references/benchmarking-guide.mdfor weighting and tie-break guidance.
Guardrails
- Keep metric names and scales consistent across candidates.
- Record weighting assumptions in the output.
Comments
Loading comments...
