LLM Evaluator Pro

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace...

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 1 · 518 · 0 current installs · 0 all-time installs

by@aiwithabidi

MIT-0

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

Name/description match the code: it uses OpenRouter (GPT judge) and Langfuse to score traces. Requesting OPENROUTER_API_KEY and Langfuse keys is consistent with the described function. However the code contains hardcoded Langfuse keys and host values, which undermines the declared requirement model (the skill claims to require env vars but will fall back to embedded credentials).

Instruction Scope

SKILL.md instructs running the included Python script. The script, however, attempts to read ~/.openclaw/workspace/.env for the OpenRouter key (a config path not declared in metadata) and uses hardcoded Langfuse credentials/host to call the Langfuse API. Reading an undeclared workspace .env can access other secrets; always-posting scores to a hardcoded Langfuse endpoint (with embedded keys) could transmit data to an unexpected/third-party account.

ℹ

Install Mechanism

There is no install spec. The skill includes a Python script but does not declare Python package dependencies (requests, openai, langfuse). That is a coherence/usability issue (script may fail). Lack of an install step lowers installation auditability, but is not itself malicious — still increases risk because it's unclear what packages will be installed by users to run it.

Credentials

Declared env vars (OPENROUTER_API_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY) are appropriate for the stated purpose. However the script: (1) sets default LANGFUSE keys in code, (2) hardcodes LF_AUTH and LF_API values rather than reading the environment, and (3) attempts to parse ~/.openclaw/workspace/.env if OPENROUTER_API_KEY is not set. These behaviors mean the skill can use embedded credentials and read an undeclared local .env file, which is disproportionate and suspicious.

✓

Persistence & Privilege

The skill is not force-included (always=false) and does not request persistent platform privileges. It does not attempt to modify other skills or global agent configuration. Autonomy is enabled by default but is not an additional red flag here.

What to consider before installing

This skill largely does what its README says, but there are several red flags you should resolve before running it in a production environment: 1) The script contains hardcoded Langfuse API keys and a hardcoded Langfuse host and uses those values directly — that could send your trace data (or allow the script to act using somebody else's account). Treat those embedded keys as suspicious and do not rely on them. 2) The script will attempt to read ~/.openclaw/workspace/.env for an OPENROUTER_API_KEY if you don't set one in the environment; that file may contain unrelated secrets. The skill metadata did not declare that config path. 3) Dependencies (requests, openai, langfuse) are not declared; running without knowing what will be installed is fragile. Recommended actions before installing/using: - Inspect the evaluator.py file fully (remove or rotate any embedded keys). - Replace hardcoded LF_AUTH/LF_API with explicit env-based configuration and ensure the host points to a Langfuse instance you control. - Avoid running the script as-is on systems with sensitive ~/.openclaw/workspace/.env files; run it in an isolated test environment or container first. - If you need to trust this skill, ask the publisher to provide a version that reads credentials only from declared env vars (no defaults), documents required Python packages, and documents exactly which endpoints will receive data. If the publisher confirms the embedded keys are inert placeholders and the code is changed to respect environment values only, the concerns would be reduced.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

evaluationvk9760sebw2sa2007zts5bnm45s8178hglatestvk9760sebw2sa2007zts5bnm45s8178hgqualityvk9760sebw2sa2007zts5bnm45s8178hg

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

Runtime requirements

Binspython3

EnvOPENROUTER_API_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY

SKILL.md

LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

Evaluating quality of search results or AI responses
Scoring traces for relevance, accuracy, hallucination detection
Batch scoring recent unscored traces
Quality assurance on agent outputs

Usage

# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score <trace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20

Evaluators

Evaluator	Measures	Scale
relevance	Response relevance to query	0–1
accuracy	Factual correctness	0–1
hallucination	Made-up information detection	0–1
helpfulness	Overall usefulness	0–1

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation

Files

2 total

Select a file

Select a file to preview.

Comments

Loading comments…