Session Feedback Analyzer

v1.0.1

Parse Claude Code session JSONL to extract implicit user feedback signals. Detects skill invocations (tool_use blocks with name="Skill" or /slash-commands),...

0· 101·1 current·1 all-time
by_silhouette@lanyasheng

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for lanyasheng/session-feedback-analyzer.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Session Feedback Analyzer" (lanyasheng/session-feedback-analyzer) from ClawHub.
Skill page: https://clawhub.ai/lanyasheng/session-feedback-analyzer
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install session-feedback-analyzer

ClawHub CLI

Package manager switcher

npx clawhub@latest install session-feedback-analyzer
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description match the behavior: the scripts parse ~/.claude/projects JSONL, detect skill invocations, classify user responses, and compute metrics. The requested resources (local session files) are proportional to the stated goal. Minor incoherence: the scripts import lib.common from a parent repo root (sys.path manipulation), but lib/common is not included in the bundle — this indicates the code expects to run inside a larger repo layout and may fail when run standalone.
Instruction Scope
SKILL.md instructs running python3 scripts/analyze.py against a session directory and to write feedback.jsonl. The instructions only reference local session files and the analyzer output; they do not send data externally or request unrelated system credentials. Privacy note: the analyzer reads user message text from session files (it offers --no-snippets to strip text), so running it will access potentially sensitive conversation content.
Install Mechanism
There is no install specification (instruction-only), so nothing will be downloaded or installed automatically. The code requires a Python runtime and expects to be executed from a repo layout that provides lib.common; that dependency is not bundled here and could cause runtime errors. No network downloads or external package installs are requested.
Credentials
The skill requires no environment variables, binaries, or external credentials. It accesses the filesystem (session JSONL under ~/.claude/projects by default), which is appropriate for a session-log analyzer but should be considered sensitive. The code may write output files (feedback.jsonl) to the current directory or a provided output path.
Persistence & Privilege
The skill does not request persistent/always-on privileges. Flags show always:false and no special persistence. It writes its own output file when run, which is expected behavior for a CLI analyzer.
Assessment
This skill appears to do what it says: it parses local Claude session JSONL files to emit feedback events and compute per-skill metrics. Before installing/running, consider: 1) it reads conversation files under ~/.claude/projects (sensitive data) — run with --no-snippets if you want to avoid storing message text; 2) the scripts expect to import lib.common from a parent repo root that is not included in the package, so you may need to run the tool from the larger repo or supply the missing helper functions; 3) run the bundled tests (pytest) in a safe environment to ensure the script runs correctly; 4) inspect where output (feedback.jsonl) will be written and treat that file as sensitive; 5) because there is no install step, no external code is fetched, but exercise caution when running any script that reads local data.

Like a lobster shell, security has layers — review code before you run it.

latestvk978ndmfqd4rqak03p0brfkj4x84c5hf
101downloads
0stars
2versions
Updated 3w ago
v1.0.1
MIT-0

Session Feedback Analyzer

Mines Claude Code session JSONL for implicit user feedback. When a user corrects, redoes, reverts, or partially accepts AI output after a skill invocation, that signals a skill gap. Outputs structured feedback.jsonl with per-event dimension attribution for the improvement pipeline.

When to Use

  • Compute per-skill correction rates to find which skills users correct most often.
  • Generate feedback.jsonl as input for improvement-generator's candidate prioritization.
  • Track correction trends over time (30-day rolling windows) to detect skill quality regression.
  • Identify hotspot dimensions (accuracy vs coverage vs trigger_quality vs efficiency) per skill.
  • Compare correction_rate before and after an improvement to validate whether a change actually helped.
  • Audit a single skill's feedback history with --skill-filter to understand why users reject its output.
  • Feed dimension hotspots into improvement-generator so candidates target the dimensions users care about.
  • Bootstrap the auto-improvement loop: analyzer output is the starting signal that tells the pipeline which skills need work.
  • Investigate spikes in correction_rate after a skill update to decide whether to rollback.

When NOT to Use

  • Synthetic task evaluation against a predefined task suite -- use improvement-evaluator instead.
  • Structural scoring of SKILL.md quality (knowledge_density, coverage, completeness) -- use improvement-learner instead.
  • Candidate multi-reviewer scoring with LLM judges -- use improvement-discriminator instead.
  • Gate/accept decisions on improvement candidates -- use improvement-gate instead.
  • Executing approved changes to skill files -- use improvement-executor instead.
  • Generating improvement candidates from scratch -- use improvement-generator instead.
  • Benchmark comparison against historical baselines -- use benchmark-store instead.
  • Orchestrating the full generate-score-evaluate-execute-gate pipeline -- use improvement-orchestrator instead.
  • Analyzing test runs or sub-agent sessions (these are filtered out automatically by iter_session_files).
<example name="find-worst-skills"> Run the analyzer against all sessions, then query metrics to find the three skills with the highest correction rate:
python3 scripts/analyze.py --session-dir ~/.claude/projects/ --output feedback-store/feedback.jsonl
from scripts.metrics import load_feedback_events, compute_all_skill_metrics, format_metrics_report
events = load_feedback_events(Path("feedback-store/feedback.jsonl"))
report = format_metrics_report(compute_all_skill_metrics(events))
print(report)
# Skill Feedback Metrics
# ========================================
#   cpp-expert: correction_rate=0.40 (n=20, corrections=6, partials=4, acceptances=10)
#     hotspots: accuracy=5, coverage=3
#   deslop: correction_rate=0.15 (n=40, corrections=4, partials=4, acceptances=32)
#     hotspots: accuracy=3, efficiency=1

The output tells you cpp-expert has the highest correction rate (0.40) and its hotspot dimension is accuracy -- users most often correct naming/format issues. Feed this into improvement-generator with --source feedback-store/feedback.jsonl to generate candidates that prioritize accuracy fixes. </example>

<example name="single-skill-audit"> Analyze only the `deslop` skill and suppress user message snippets for privacy:
python3 scripts/analyze.py --skill-filter deslop --no-snippets --output feedback-store/deslop-feedback.jsonl

Then compute trend to check if recent changes improved the skill:

from scripts.metrics import load_feedback_events, compute_correction_trend
events = load_feedback_events(Path("feedback-store/deslop-feedback.jsonl"))
trend = compute_correction_trend(events, "deslop")
print(trend)
# {'skill_id': 'deslop', 'trend': -0.12, 'recent_rate': 0.10, 'prior_rate': 0.22,
#  'recent_sample': 18, 'prior_sample': 22, 'direction': 'improving'}

A negative trend (-0.12) with direction "improving" means the last 30 days had fewer corrections than the prior 30 days. The improvement worked. </example>

<anti-example name="wrong-tool-for-task-suite"> Do NOT use session-feedback-analyzer to run synthetic evaluations. If you have a `task_suite.yaml` and want to measure execution pass rate, use improvement-evaluator instead:
# WRONG: session-feedback-analyzer does not execute tasks
python3 scripts/analyze.py --session-dir task_suite_results/  # meaningless

# RIGHT: use improvement-evaluator for synthetic task evaluation
python3 -m skills.improvement-evaluator.scripts.evaluate --task-suite task_suite.yaml
</anti-example> <anti-example name="wrong-tool-for-structural-scoring"> Do NOT use session-feedback-analyzer to score SKILL.md structure. The analyzer reads session JSONL (runtime user interactions), not SKILL.md files. For structural quality scoring (knowledge_density, coverage, completeness), use improvement-learner:
# WRONG: analyzer has no concept of SKILL.md structure
python3 scripts/analyze.py --session-dir ./skills/  # no JSONL files here

# RIGHT: use improvement-learner for SKILL.md structural scoring
python3 -m skills.improvement-learner.scripts.evaluate_skill --skill-dir skills/deslop/
</anti-example>

Why Implicit Feedback Matters

问题: improvement-evaluator 用预定义的 task_suite.yaml 来验证 skill 质量,但 task suite 只能覆盖作者预想到的场景。真实用户的使用方式远比 task suite 丰富 -- 他们会用 skill 做作者从未设想过的事情。当用户纠正 AI 输出时,这个纠正信号就是隐式反馈,指向 skill 在真实场景中的不足。

Because 隐式反馈来自真实使用而非合成测试,它能发现 task suite 永远发现不了的问题。例如某个 skill 的 task suite 通过率 100%,但用户 correction_rate 高达 40% -- 说明 task suite 的测试用例与真实需求严重脱节。session-feedback-analyzer 的输出(feedback.jsonl)直接喂给 improvement-generator,让生成的候选优先解决用户最常纠正的维度。

Tradeoff: 隐式反馈的局限性在于信号噪声。用户说 "不对" 可能是纠正 skill 输出,也可能是纠正自己之前的指令。当前的 keyword-based 分类器无法区分这两种情况,实测误判率约 8%。提高精度的方向是引入 LLM-based 分类(让一个小模型判断 "不对" 的指代对象),但这会引入延迟和成本。当前选择 keyword heuristic 是因为 8% 的误判率在 correction_rate 的统计聚合下被稀释 -- 单个事件的误判不影响 per-skill 的整体趋势判断。

Why 3-Turn Influence Window

Tradeoff: 窗口太窄(1 turn)会漏掉延迟纠正 -- 用户看到 AI 输出后继续问了一个问题,第 3 turn 才说 "刚才那个不对"。窗口太宽(5+ turns)会引入噪音 -- 用户可能已经在讨论完全不同的话题,此时的 "wrong" 不是对之前 skill 调用的纠正。实测中 3 turn 窗口的 precision/recall 平衡最好:捕获了 92% 的真实纠正,误判率约 8%。相比之下 1 turn 窗口只捕获 71% 的纠正,5 turn 窗口误判率升到 18%。

Because 窗口边界还受到 next-invocation 的约束:如果 3-turn 窗口内出现了新的 skill 调用,当前窗口在新调用处截断。这防止了将对第二个 skill 的反馈误归因给第一个 skill。代码中 classify_outcomenext_invocation_idx 参数实现了这个截断。实际效果是大多数窗口只有 1-2 turn(用户通常立即反馈),真正用到第 3 turn 的场景约占 15%。

窗口内的优先级规则: 当窗口内多个 turn 包含不同信号时(如 turn 1 说 "可以" 但 turn 2 说 "但是命名不对"),分类器按以下优先级判定:revert > redo > partial > correction > acceptance。这意味着只要窗口内出现任何纠正信号,即使第一个 turn 是 acceptance,最终结果仍然是 correction 或 partial。

CLI

# Basic: analyze all sessions, write to default output
python3 scripts/analyze.py

# Custom session directory and output
python3 scripts/analyze.py --session-dir ~/.claude/projects/ --output feedback-store/feedback.jsonl

# Privacy mode: strip user message snippets
python3 scripts/analyze.py --no-snippets

# Filter to a single skill
python3 scripts/analyze.py --skill-filter cpp-expert

# Require at least 10 invocations before computing metrics
python3 scripts/analyze.py --min-invocations 10

# Combine flags: audit one skill privately with high threshold
python3 scripts/analyze.py --skill-filter deslop --no-snippets --min-invocations 10 --output feedback-store/deslop-audit.jsonl
ParamDefaultDescription
--session-dir~/.claude/projects/Root directory containing session JSONL files
--outputfeedback-store/feedback.jsonlOutput path for the feedback JSONL file
--no-snippetsoffOmit user message snippets from output (privacy mode)
--skill-filternoneOnly analyze invocations of this specific skill
--min-invocations5Minimum invocations before correction_rate is considered statistically meaningful

Detection Rules

两种方式触发 skill 检测:

Tool use 检测:Assistant message 中出现 tool_use block, name == "Skill",从 input.skill 提取 skill_id。 这是标准的 Claude Code skill 调用路径。

Slash command 检测:System message 中 subtype == "local_command", 从 <command-name> tag 提取 skill name。 排除内建命令:help, clear, resume, compact, config。

PathCondition
Tool usetool_use block, name == "Skill", skill_id from input.skill
Slash commandsubtype == "local_command" + <command-name> tag

Outcome Classification (3-turn influence window)

OutcomeTypeConfidenceTrigger
correctionrejection0.9Keywords: "wrong", "incorrect", "no," (zh: "不对", "错了")
correctionrevert0.9Git revert commands in assistant tool_use (git checkout/restore/reset)
correctionredo0.9Keywords: "try again", "redo" (zh: "重新来", "换个方案")
partialpartial0.7Qualifier ("but", "however", "但是") + correction or acceptance keyword
acceptanceexplicit0.8Keywords: "lgtm", "looks good", "correct" (zh: "好", "可以", "对的")
acceptanceimplicit0.6User message >20 chars, no question marks, no correction keywords

Dimension Attribution

Each correction/partial gets a dimension_hint from keyword matching. 当用户的纠正消息包含特定关键词时,该纠正事件会被归因到对应的评估维度。这个归因结果通过 feedback.jsonl 传递给 improvement-generator,使其生成的候选优先针对用户最常纠正的维度。如果关键词匹配到多个维度,取 confidence 最高的那个;如果都匹配不上则标记为 "unknown"。

DimensionKeywords
accuracynaming, format, style, typo, 命名, 格式, 拼写
coveragemissing, forgot, incomplete, 缺少, 漏了
reliabilityagain, inconsistent, 重复, 不稳定
efficiencyslow, verbose, 太慢, 冗余
securitysecurity, secret, token, credential, 密钥
trigger_quality"wrong skill", "shouldn't trigger", "不该触发" -- wrong skill invoked entirely (distinct from accuracy which is correct skill, wrong output)

correction_rate Formula

correction_rate = (corrections + 0.5 * partials) / total_invocations

partial 按 0.5 权重计算——partial acceptance 意味着 skill 输出部分正确, 比完全纠正轻,但仍然需要改进。 当 sample_size < --min-invocations(默认 5)时返回 sufficient_data: false, 避免小样本下的统计噪音。

Trend 计算方式:last 30d correction_rate vs prior 30d correction_rate。 Positive delta = worsening(纠正率上升)。 Negative delta = improving(纠正率下降)。 |delta| <= 0.05 = stable(变化在统计噪音范围内)。 autoloop-controller 用 trend 判断是否继续迭代:连续两个周期 stable 则停止。

Output Artifacts

The primary output is feedback-store/feedback.jsonl -- one JSON object per line, one line per detected feedback event. Each event captures a single user reaction to a single skill invocation.

Schema (all fields present on every line):

{
  "event_id": "a1b2c3d4...",
  "timestamp": "2026-04-05T10:00:00Z",
  "session_id": "uuid",
  "skill_id": "cpp-expert",
  "invocation_uuid": "msg-uuid",
  "outcome": "correction",
  "confidence": 0.9,
  "correction_type": "rejection",
  "user_message_snippet": "not right, should use const ref...",
  "turns_to_feedback": 1,
  "ai_tools_used": ["Read", "Edit"],
  "dimension_hint": "accuracy"
}
FieldTypeDescription
event_idstringSHA-256 hash prefix (16 chars) of invocation_id:skill_id, guarantees deduplication
timestampISO 8601When the skill was invoked (not when the user responded)
session_idstringJSONL filename stem, identifies the Claude Code session
skill_idstringWhich skill was invoked (e.g. "cpp-expert", "deslop")
invocation_uuidstringUUID of the assistant message that triggered the skill
outcomeenumOne of "correction", "partial", "acceptance"
confidencefloat0.6-0.9, how confident the classifier is in the outcome label
correction_typestring?"rejection", "revert", "redo", "partial", or null for acceptances
user_message_snippetstringFirst 200 chars of the user's response (empty when --no-snippets)
turns_to_feedbackintHow many user turns after invocation the feedback appeared (1-3)
ai_tools_usedstring[]Tools the assistant called between invocation and user response
dimension_hintstring?Attributed evaluation dimension ("accuracy", "coverage", etc.) or null

Secondary artifacts:

  • feedback-store/archive/feedback-YYYYMMDD.jsonl -- events older than 90 days, auto-archived by archive_old_events().
  • Console summary printed to stdout after each run: event count, outcome distribution, top-5 skills by invocation count.

Metrics API (scripts/metrics.py)

from pathlib import Path
from scripts.metrics import (
    load_feedback_events,
    compute_correction_rate,
    compute_correction_trend,
    compute_hotspot_dimensions,
    compute_all_skill_metrics,
    format_metrics_report,
)

events = load_feedback_events(Path("feedback-store/feedback.jsonl"))

# Per-skill correction rate
compute_correction_rate(events, "cpp-expert")
# -> {"correction_rate": 0.35, "sample_size": 20, "sufficient_data": True,
#     "corrections": 5, "partials": 4, "acceptances": 11}

# Trend over rolling 30-day windows
compute_correction_trend(events, "cpp-expert")
# -> {"trend": -0.08, "direction": "improving", "recent_rate": 0.30,
#     "prior_rate": 0.38, "recent_sample": 12, "prior_sample": 8}

# Dimension hotspots (which dimensions get corrected most)
compute_hotspot_dimensions(events, "cpp-expert")
# -> {"accuracy": 5, "coverage": 3}

# All skills at once
all_metrics = compute_all_skill_metrics(events)
print(format_metrics_report(all_metrics))

Privacy Controls

--no-snippets strips user message snippets from feedback.jsonl output。 ~/.claude/feedback-config.json with {"enabled": false} disables all collection。 analyze.py 启动时检查此配置,如果 disabled 则直接退出。

自动跳过的目录:

  • pytest/ — 测试产生的 session 不是真实用户行为
  • /tmp/ — 临时 session
  • /subagents/ — 子 agent session 不反映用户直接意图

Auto-archives events >90 days old to feedback-store/archive/ to keep the active feedback store small and fast to query。

Related Skills

SkillRelationshipData Flow
improvement-generatorPrimary consumerReads feedback.jsonl via --source; uses dimension hotspots to prioritize candidates
improvement-evaluatorComplementarySynthetic evaluation (task_suite.yaml) covers designed scenarios; analyzer covers real usage
improvement-learnerOrthogonalLearner scores SKILL.md document structure; analyzer scores user interactions at runtime
improvement-discriminatorDownstreamDiscriminator scores candidates that were generated based on analyzer's feedback signals
autoloop-controllerControl loopUses correction_rate trend for plateau/convergence detection; stable trend = stop iterating
improvement-gateDownstreamGate validates changes; analyzer provides the "before" baseline that gate compares against
benchmark-storeHistoricalStores correction_rate snapshots for long-term Pareto front tracking

Generator auto-discovers feedback-store/ via lib/common.py:load_source_paths(). Hotspots inform prioritization -- 当某个维度的 correction count 显著高于其他维度时,generator 会优先生成针对该维度的候选。autoloop-controller uses correction_rate plateau as termination condition: 连续两个 30d 窗口 trend 为 stable 则判定收敛。

Scripts & Tests Reference

FilePurpose
scripts/analyze.pyMain analyzer: parses sessions, classifies outcomes, writes feedback.jsonl
scripts/metrics.pyMetrics library: correction_rate, trend, hotspots, report formatting
tests/test_analyze.py16 test cases covering invocation detection, outcome classification, deduplication, archival
tests/test_metrics.py11 test cases covering rate computation, trend direction, hotspot grouping, report format

Run all tests:

cd skills/session-feedback-analyzer && python3 -m pytest tests/ -v

Comments

Loading comments...