Nm Leyline Evaluation Framework

v1.0.0

Patterns for building evaluation and scoring systems, quality gates, rubrics, and decision frameworks. Use for any scored assessment

⭐ 0· 50·1 current·1 all-time

by@athola

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name, description, and included modules (scoring-patterns, decision-thresholds) match: the skill is an authoring/reference framework for scoring and thresholds and does not request unrelated capabilities or credentials.

ℹ

Instruction Scope

SKILL.md contains only patterns, examples, and pseudocode; it does not instruct the agent to read files, call external endpoints, or access secrets. Minor inconsistency: many examples include verification steps like 'Run the command with --help' or 'Run pytest -v', but no concrete command, tests, or dependencies are provided in the package—this appears to be template text for consumers rather than active instructions to access system resources.

✓

Install Mechanism

No install spec or code files are present (instruction-only). Nothing is downloaded or written to disk by the skill itself.

✓

Credentials

The skill declares no required environment variables, credentials, or config paths. The content contains no references to hidden tokens or unrelated service credentials.

✓

Persistence & Privilege

Skill is not always-on and is user-invocable. It does not request elevated persistence or modify other skills or system settings.

Assessment

This skill is documentation-only: it provides templates and code examples for building scoring/threshold systems and does not require credentials or install anything. Before using it in an automated agent: (1) review and adapt the example commands/tests—the SKILL.md includes generic 'run --help' and 'pytest -v' checks but no bundled binaries or tests; (2) if you intend to automate decisions based on these thresholds, validate them with historical data and add veto/safety checks to avoid unwanted automated actions; (3) the agent may call this skill autonomously (normal default), so ensure decision rules and escalation paths are explicit and safe for your environment.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

🦞 Clawdis

latestvk97b0qg015yckcm7fy47tn6tks84rtcf

50downloads

0stars

1versions

Updated 5d ago

v1.0.0

MIT-0

Night Market Skill — ported from claude-night-market/leyline. For the full experience with agents, hooks, and commands, install the Claude Code plugin.

Overview
When to Use
Core Pattern
1. Define Criteria
2. Score Each Criterion
3. Calculate Weighted Total
4. Apply Decision Thresholds
Quick Start
Define Your Evaluation
Example: Code Review Evaluation
Evaluation Workflow
Common Use Cases
Integration Pattern
Detailed Resources
Exit Criteria

Evaluation Framework

Overview

A generic framework for weighted scoring and threshold-based decision making. Provides reusable patterns for evaluating any artifact against configurable criteria with consistent scoring methodology.

This framework abstracts the common pattern of: define criteria → assign weights → score against criteria → apply thresholds → make decisions.

When To Use

Implementing quality gates or evaluation rubrics
Building scoring systems for artifacts, proposals, or submissions
Need consistent evaluation methodology across different domains
Want threshold-based automated decision making
Creating assessment tools with weighted criteria

When NOT To Use

Simple pass/fail without scoring needs

Core Pattern

1. Define Criteria

criteria:
  - name: criterion_name
    weight: 0.30          # 30% of total score
    description: What this measures
    scoring_guide:
      90-100: Exceptional
      70-89: Strong
      50-69: Acceptable
      30-49: Weak
      0-29: Poor

Verification: Run the command with --help flag to verify availability.

2. Score Each Criterion

scores = {
    "criterion_1": 85,  # Out of 100
    "criterion_2": 92,
    "criterion_3": 78,
}

Verification: Run the command with --help flag to verify availability.

3. Calculate Weighted Total

total = sum(score * weights[criterion] for criterion, score in scores.items())
# Example: (85 × 0.30) + (92 × 0.40) + (78 × 0.30) = 85.5

Verification: Run the command with --help flag to verify availability.

4. Apply Decision Thresholds

thresholds:
  80-100: Accept with priority
  60-79: Accept with conditions
  40-59: Review required
  20-39: Reject with feedback
  0-19: Reject

Verification: Run the command with --help flag to verify availability.

Quick Start

Define Your Evaluation

Identify criteria: What aspects matter for your domain?
Assign weights: Which criteria are most important? (sum to 1.0)
Create scoring guides: What does each score range mean?
Set thresholds: What total scores trigger which decisions?

Example: Code Review Evaluation

criteria:
  correctness: {weight: 0.40, description: Does code work as intended?}
  maintainability: {weight: 0.25, description: Is it readable?}
  performance: {weight: 0.20, description: Meets performance needs?}
  testing: {weight: 0.15, description: Tests detailed?}

thresholds:
  85-100: Approve immediately
  70-84: Approve with minor feedback
  50-69: Request changes
  0-49: Reject, major issues

Verification: Run pytest -v to verify tests pass.

Evaluation Workflow

**Verification:** Run the command with `--help` flag to verify availability.
1. Review artifact against each criterion
2. Assign 0-100 score for each criterion
3. Calculate: total = Σ(score × weight)
4. Compare total to thresholds
5. Take action based on threshold range

Verification: Run the command with --help flag to verify availability.

Common Use Cases

Quality Gates: Code review, PR approval, release readiness Content Evaluation: Document quality, knowledge intake, skill assessment Resource Allocation: Backlog prioritization, investment decisions, triage

Integration Pattern

# In your skill's frontmatter
dependencies: [leyline:evaluation-framework]

Verification: Run the command with --help flag to verify availability.

Then customize the framework for your domain:

Define domain-specific criteria
Set appropriate weights for your context
Establish meaningful thresholds
Document what each score range means

Detailed Resources

Scoring Patterns: See modules/scoring-patterns.md for detailed methodology
Decision Thresholds: See modules/decision-thresholds.md for threshold design

Exit Criteria

Criteria defined with clear descriptions
Weights assigned and sum to 1.0
Scoring guides documented for each criterion
Thresholds mapped to specific actions
Evaluation process documented and reproducible

Troubleshooting

Common Issues

Command not found Ensure all dependencies are installed and in PATH

Permission errors Check file permissions and run with appropriate privileges

Unexpected behavior Enable verbose logging with --verbose flag

Comments

Loading comments...

Nm Leyline Evaluation Framework

Runtime requirements

Table of Contents

Evaluation Framework

Overview

When To Use

When NOT To Use

Core Pattern

1. Define Criteria

2. Score Each Criterion

3. Calculate Weighted Total

4. Apply Decision Thresholds

Quick Start

Define Your Evaluation

Example: Code Review Evaluation

Evaluation Workflow

Common Use Cases

Integration Pattern

Detailed Resources

Exit Criteria

Troubleshooting

Common Issues

Comments