Skill flagged — suspicious patterns detected

ClawHub Security flagged this skill as suspicious. Review the scan results before using.

Improvement Discriminator

v1.0.0

当需要对改进候选多人盲审打分、用 LLM 做语义评估、判断候选是否应被接受、或打分结果全是 hold 想知道为什么时使用。支持 --panel 多审阅者盲审和 --llm-judge 语义评估。不用于结构评估(用 improvement-learner)或门禁决策(用 improvement-gate)。

0· 15·0 current·0 all-time
by_silhouette@lanyasheng
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Suspicious
View report →
OpenClawOpenClaw
Suspicious
medium confidence
Purpose & Capability
The name/description describe a multi-signal scoring engine and the repository contains matching components (heuristics, evaluator, human review, LLM judge). That capability set is coherent with the stated purpose. However, the code also includes a RealSkillEvaluator that can load and run arbitrary Python Skill modules from file paths and the critic modifies sys.path to import sibling skill code (benchmark-store). Those capabilities fit a broad evaluator but expand the skill's reach beyond simple scoring and are not emphasized in SKILL.md.
!
Instruction Scope
SKILL.md explains CLI usage and mentions --llm-judge but does not disclose that the runtime can: (1) call Anthropic/OpenAI SDKs (which will read ANTHROPIC_/OPENAI_ env vars), (2) accept a configurable base_url that could proxy LLM calls to arbitrary endpoints, and (3) load and execute local Python Skill modules via importlib (RealSkillEvaluator). Those actions involve reading environment credentials, making network requests, and executing local code — none of which are explicitly documented as risks or required by the provided SKILL.md examples.
Install Mechanism
There is no install spec (instruction-only install), which limits automatic installation risk. But the code expects optional external Python SDKs (anthropic, openai) that are not declared as dependencies; missing dependency declarations reduce transparency. No external download URLs are used in install steps.
!
Credentials
Registry metadata declares no required environment variables, yet llm_judge reads/relies on ANTHROPIC_BASE_URL (and anthropic/ANTHROPIC_API_KEY by SDK) and OpenAI usage (OPENAI_API_KEY). The JudgeConfig supports a base_url that could redirect requests to arbitrary endpoints. Requesting no env vars while using cloud LLM SDKs is a mismatch and hides the need to provide API credentials — this is disproportionate and increases the risk of accidental credential use or data exfiltration if base_url is set to an untrusted proxy.
Persistence & Privilege
The skill does not request always:true and does not modify other skills' configs. It does mutate sys.path at runtime to import sibling benchmark-store interfaces (giving it read access to other skill code shipped alongside it) and writes human-review receipts to disk. Those behaviors are plausible for a scoring tool but should be noted as they give the skill broader local visibility and file I/O capabilities.
What to consider before installing
This skill appears to implement the scoring functionality described, but there are several practical risks to consider before installing or running it: - LLM API usage is built-in but not declared: the code can call Anthropic/OpenAI SDKs and will rely on environment credentials (e.g., ANTHROPIC_API_KEY, OPENAI_API_KEY) and an optional base_url. The skill metadata claims no required env vars — treat that as inaccurate. If you don't want network calls or to expose keys, run with --llm-judge mock or ensure the environment keys are absent. - Arbitrary local code execution: RealSkillEvaluator can load Python modules from file paths and call evaluate()/execute(), which will run code on your machine. Do not point it at untrusted Skill packages or unreviewed file paths. Prefer running in an isolated environment or container. - Custom base_url risk: JudgeConfig/base_url or ANTHROPIC_BASE_URL can redirect LLM calls to arbitrary endpoints (a proxy). Only set base_url to trusted endpoints. Avoid exposing secrets to unknown proxies. - Review the code paths you will use: inspect interfaces/critic_engine.py (RealSkillEvaluator), interfaces/llm_judge.py (_call_claude/_call_openai/_call_mock), and any scripts/score.py to confirm how candidates are supplied and whether they can contain file path execution directives. Recommended mitigations: run the tool with --llm-judge mock for evaluation without keys, run inside a restricted container, audit candidates.json and any candidate execution_plan entries before scoring, and only provide API keys when necessary and to trusted endpoints. If you need the skill for automated pipelines, update the metadata to declare required env vars and document the RealSkillEvaluator execution model so operators can make an informed decision.

Like a lobster shell, security has layers — review code before you run it.

latestvk979jr943e198vrxfht9yp7gc58453da

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Improvement Discriminator

Multi-signal scoring engine: heuristic rules + evaluator rubrics + LLM-as-Judge + multi-reviewer blind panel.

When to Use

  • 对改进候选打分和排序
  • 运行多审阅者盲审(CONSENSUS/VERIFIED/DISPUTED 认知标签)
  • 用 LLM-as-Judge 评估 4 个语义维度(clarity, specificity, consistency, safety)

When NOT to Use

  • 评估 skill 目录结构 → use improvement-learner
  • keep/revert/reject 决策 → use improvement-gate
  • 执行文件变更 → use improvement-executor

Scoring Modes

ModeFlagScoring
Heuristic only(default)category bonus + source refs + risk penalty
+ Evaluator--use-evaluator-evidenceHeuristic 70% + evaluator 30%
+ LLM Judge--llm-judge {claude,openai,mock}Heuristic 60% + LLM 40%
+ Panel--panel2+ reviewers independently, cognitive label decides
All combined--panel --llm-judge mock --use-evaluator-evidenceFull
<example> 正确用法: 多审阅者盲审 + LLM 语义打分 $ python3 scripts/score.py --input candidates.json --panel --llm-judge mock --output scored.json → 输出包含: panel_reviews: [{reviewer: "structural", score: 7.5}, {reviewer: "conservative", score: 5.0}] cognitive_label: "VERIFIED" (2人同意) llm_verdict: {score: 0.78, decision: "conditional", dimensions: {clarity: 0.85, ...}} </example> <anti-example> 常见误解: --panel 和 --llm-judge 互斥 → 错!两者可以同时使用。每个审阅者独立调用 LLM judge,得到独立的语义分数。 → 如果只用 --panel 不加 --llm-judge,panel 只做启发式评分,不做语义评估。 </anti-example>

CLI

# Basic scoring
python3 scripts/score.py --input candidates.json --output scored.json

# Full pipeline: panel + LLM judge
python3 scripts/score.py \
  --input candidates.json --panel --llm-judge mock --output scored.json

Output Artifacts

RequestDeliverable
ScoreJSON: per-candidate scores, blockers, recommendations, judge_notes
PanelJSON: panel_reviews[], cognitive_label, aggregated_score
LLM judgeJSON: llm_verdict (score, decision, dimensions, confidence)

Related Skills

  • improvement-generator: Produces the candidates that this skill scores
  • improvement-gate: Consumes scored candidates for keep/revert/reject
  • improvement-learner: Structural evaluation (6-dim); discriminator focuses on semantic
  • benchmark-store: Frozen benchmarks for regression checking

Files

14 total
Select a file
Select a file to preview.

Comments

Loading comments…