Llm Evaluator

LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical trac...

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 182 · 0 current installs · 0 all-time installs

by@aiwithabidi

MIT-0

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

Purpose & Capability

Name/description (Langfuse + OpenRouter) matches the script's behavior: it evaluates traces and posts scores to Langfuse using an OpenRouter-backed judge model. However, the code embeds LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY values and a LANGFUSE_HOST that are not declared in requires.env or SKILL.md; shipping hardcoded service credentials is unexpected and disproportionate to the stated purpose.

Instruction Scope

SKILL.md directs running the included Python script, which is expected. The script also attempts to read a user-local file (~/.openclaw/workspace/.env) to find OPENROUTER_API_KEY if the env var isn't set; this config-file access is not declared in requires.config_paths and is an additional data access surface that users should be aware of.

✓

Install Mechanism

No install spec (instruction-only with an included script). That keeps install risk low — nothing is downloaded or executed automatically beyond running the bundled Python script.

Credentials

The registry declares only OPENROUTER_API_KEY as required (which is appropriate). But the code embeds Langfuse public/secret keys and a Langfuse host URL; these are effectively credentials baked into the skill rather than requested from the environment. Also the script will make network calls to Langfuse and OpenRouter, which is expected but worth noting.

✓

Persistence & Privilege

The skill does not request always:true and does not request system-wide persistence. It runs network operations and writes scores to Langfuse, which is consistent with its purpose and not an unusual privilege level.

Scan Findings in Context

[hardcoded-langfuse-keys] unexpected: The script sets LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY via os.environ.setdefault and defines LF_AUTH with apparent secret/public keys. Shipping embedded service credentials is unexpected and increases risk (could be demo keys, but should be verified or removed).

[reads-user-env-file] unexpected: If OPENROUTER_API_KEY is missing from env, the script attempts to read ~/.openclaw/workspace/.env to extract the key. This file access is not declared in the skill metadata and may expose user-local secrets or be considered scope creep.

[uses-openrouter] expected: The skill uses OpenRouter (via OpenAI-compatible client) to call a judge model (JUDGE_MODEL = 'openai/gpt-5-nano'); this aligns with the declared purpose and required OPENROUTER_API_KEY.

What to consider before installing

Before installing or running this skill, inspect the included scripts (scripts/evaluator.py) yourself. Pay particular attention to: 1) the hardcoded LANGFUSE_SECRET_KEY/LANGFUSE_PUBLIC_KEY and LANGFUSE_HOST — verify they are not production secrets and consider removing or replacing them with environment-configured values; 2) the code path that reads ~/.openclaw/workspace/.env to obtain an OpenRouter key — ensure you are comfortable with that file being read or set OPENROUTER_API_KEY explicitly instead; 3) network endpoints (openrouter.ai and the langfuse host) — run in an isolated environment if you do not fully trust them. If you plan to use this in production, rotate any exposed credentials, replace hardcoded keys with proper environment variables or configuration, and run the script in a sandbox while monitoring outbound network traffic. If you are unsure, request the author to remove embedded keys and document any file reads and external endpoints explicitly.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

latestvk97dkbkdwj5rj6se65bea0x1w582b0ww

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

Runtime requirements

⚖️ Clawdis

EnvOPENROUTER_API_KEY

Primary envOPENROUTER_API_KEY

SKILL.md

LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

Evaluating quality of search results or AI responses
Scoring traces for relevance, accuracy, hallucination detection
Batch scoring recent unscored traces
Quality assurance on agent outputs

Usage

# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score <trace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20

Evaluators

Evaluator	Measures	Scale
relevance	Response relevance to query	0–1
accuracy	Factual correctness	0–1
hallucination	Made-up information detection	0–1
helpfulness	Overall usefulness	0–1

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation

Files

2 total

Select a file

Select a file to preview.

Comments

Loading comments…