Debugging Reinforcement Learning

v1.0.0

Tools and methods for controlling randomness, ensuring reproducibility, analyzing agent behavior, and debugging reward issues in stochastic reinforcement lea...

⭐ 0· 82·1 current·1 all-time

byRoamer 徐@roamer-remote

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for roamer-remote/debugging-reinforcement-learning.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Debugging Reinforcement Learning" (roamer-remote/debugging-reinforcement-learning) from ClawHub.
Skill page: https://clawhub.ai/roamer-remote/debugging-reinforcement-learning
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install debugging-reinforcement-learning

ClawHub CLI

Package manager switcher

npx clawhub@latest install debugging-reinforcement-learning

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

medium confidence

ℹ

Purpose & Capability

The skill claims to provide tools for controlling randomness, reproducibility, behavior analysis, and reward debugging — and the bundled index.js implements functions with those names (seedAll, createEpisodeRecorder, diffTrajectories, etc.). However, SKILL.md explicitly references Python ecosystems (Gym/Gymnasium, PyTorch/TensorFlow/JAX, torch.backends.cudnn), while the packaged implementation is JavaScript. That mismatch is a documentation/compatibility concern: the JS utilities cannot directly manipulate Python RNGs or torch backends without a bridging layer.

✓

Instruction Scope

SKILL.md stays on-topic: it describes seed management, recording/replay, trajectory diffing, and reward debugging. It does not instruct reading arbitrary system files, contacting external endpoints, or accessing unrelated credentials. It does provide Python-specific tips (e.g., torch.backends.cudnn.deterministic = True) which are helpful guidance but imply an expectation that the user runs these checks in a Python runtime—again a compatibility note rather than a scope creep or exfiltration risk.

✓

Install Mechanism

No install specification is provided (instruction-only skill with code files). That is low risk because nothing is downloaded or executed automatically by an installer. The package.json is minimal and lists no dependencies, and there are no install scripts shown.

✓

Credentials

The skill does not request environment variables, credentials, or config paths. The functions operate on data you pass (trajectories, action probabilities, seeds) and do not reference or require external secrets.

✓

Persistence & Privilege

Flags show always:false and normal model invocation. The skill does not request persistent agent-level privileges or to modify other skills' configurations.

Assessment

This skill appears to be a coherent toolkit for RL debugging and contains no obvious attempts to access secrets or the network. Two practical cautions: (1) SKILL.md repeatedly references Python libraries (Gym, PyTorch, TensorFlow, torch.backends.cudnn) while the shipped implementation is JavaScript — confirm your agent/runtime can execute the provided JS utilities or that you have a Python wrapper if you expect to operate on Python-based environments. (2) Review the remainder of index.js (the file was truncated in the manifest) to ensure there are no hidden network calls, file writes, or dynamic code-eval behaviors before running it in a sensitive environment. If you need this to integrate with Python tooling, prefer a native Python implementation or add a well-audited bridging layer; if you only want the conceptual algorithms, you can port or call the relevant functions in a sandboxed environment first.

Like a lobster shell, security has layers — review code before you run it.

latestvk97ahv1gvdxddzhc7ce853pmm1850z05

82downloads

0stars

1versions

Updated 1w ago

v1.0.0

MIT-0

Debugging Non-Deterministic Agent Behavior in Reinforcement Learning Environments

Overview

This skill provides a comprehensive toolkit for debugging reinforcement learning (RL) agents that exhibit non-deterministic behavior — one of the most challenging aspects of RL development. Non-determinism arises from environment stochasticity, policy randomness, seed mismanagement, and subtle numerical issues, making bugs notoriously hard to reproduce and diagnose.

Core Modules

1. Stochasticity Control

Strategies for controlling and isolating sources of randomness in RL pipelines:

Seed Management: Set and track seeds across all random sources (Python random, NumPy, PyTorch/TF, environment RNG, custom samplers).
Entropy Scheduling: Monitor and clamp policy entropy to detect exploration collapse or excessive randomness.
Action Distribution Inspection: Log full action distributions (not just sampled actions) to verify the policy is learning correctly.
Environment Stochasticity Toggle: Identify which environment transitions are stochastic vs. deterministic, and temporarily freeze stochastic dimensions for debugging.

2. Reproducibility Tools

Utilities for making RL experiments reproducible:

ReproWrapper: Wraps any env+agent pair to capture full episode trajectories (observations, actions, rewards, dones, seeds, RNG states).
Episode Replay: Replays a recorded episode step-by-step for comparison against expected behavior.
State Snapshot: Saves/restores complete training state (model weights, optimizer state, RNG state, env state).
Diff Replay: Compares two episode trajectories and highlights divergences with step-level granularity.
Seed Cascade: Generates deterministic seed sequences for parallel workers to avoid seed collisions.

3. Behavior Analysis

Techniques for understanding what the agent is actually doing:

Trajectory Clustering: Groups similar trajectories to identify behavioral modes (e.g., "agent always fails at corner cases").
Action Frequency Heatmap: Visualizes action distributions over state space regions.
Policy Consistency Check: Detects if the same state produces different action distributions across episodes (a sign of state encoding bugs or hidden state leakage).
Temporal Correlation Detector: Finds unintended correlations between consecutive actions that indicate the agent isn't respecting Markov assumptions.
Behavioral Mode Detection: Identifies distinct behavioral regimes the agent switches between (e.g., cautious vs. reckless).

4. Reward Debugging

Methods for diagnosing reward-related issues:

Reward Decomposition: Breaks multi-component rewards into individual signals to identify which component drives behavior.
Reward Shaping Validator: Checks if shaped rewards accidentally create local optima or reward cycling.
Sparse Reward Tracer: For sparse-reward environments, logs the full trajectory leading up to reward events for analysis.
Reward Scale Analyzer: Detects reward scale mismatches between components that cause gradient domination.
Episode Return Sanity Check: Verifies that discounted returns are computed correctly and that reward normalization isn't destroying the signal.
Reward Hacking Detector: Flags when the agent achieves high reward through unintended behavior (exploiting bugs in reward computation).

Usage Patterns

Quick Reproducibility Check

1. Set global seed via seedAll()
2. Run episode with EpisodeRecorder
3. Replay and compare

Diagnose Erratic Behavior

1. Run 50 episodes with fixed seeds
2. Cluster trajectories
3. Inspect divergent clusters
4. Use policyConsistencyCheck on divergent states

Reward Signal Investigation

1. Decompose reward into components
2. Run rewardScaleAnalyzer
3. Check for hacking via rewardHackingDetector
4. Validate return computation

Anti-Patterns to Watch For

Seed per episode but not per step: Environment internal RNG can diverge even with episode-level seeding.
Caching state without RNG: Replay buffers that store (s, a, r, s') without the RNG state cannot reproduce the exact transition.
Floating point mode differences: GPU non-determinism from reduced-precision ops. Use torch.backends.cudnn.deterministic = True during debug.
Hidden environment state: Some environments (e.g., Atari with frame-skipping) have internal state not exposed in the observation.
Reward normalization drift: Running mean/std normalization changes the effective reward over training, making early episodes non-reproducible.

Integration Tips

Works with Gym/Gymnasium, PettingZoo, and custom env wrappers.
Compatible with PyTorch, TensorFlow, and JAX-based agents.
Output formats: JSON trajectories, CSV logs, and structured debug reports.

Comments

Loading comments...