Install
openclaw skills install self-improving-scienceCaptures learnings, experiment issues, and methodology corrections for continuous improvement in scientific research and ML workflows. Use when: (1) Data lea...
openclaw skills install self-improving-scienceLog learnings, experiment issues, and methodology corrections to markdown files for continuous improvement in scientific research, data science, and ML/AI experimentation. Important findings get promoted to experiment checklists, data governance docs, model cards, and methodology standards.
Before logging anything, ensure the .learnings/ directory and files exist in the project or workspace root. If any are missing, create them:
mkdir -p .learnings
[ -f .learnings/LEARNINGS.md ] || printf "# Learnings\n\nMethodology insights, statistical corrections, and knowledge gaps captured during research.\n\n**Categories**: methodology_flaw | data_quality | reproducibility_issue | statistical_error | hypothesis_revision | experiment_design\n\n---\n" > .learnings/LEARNINGS.md
[ -f .learnings/EXPERIMENT_ISSUES.md ] || printf "# Experiment Issues\n\nFailed experiments, data quality problems, and reproducibility failures.\n\n---\n" > .learnings/EXPERIMENT_ISSUES.md
[ -f .learnings/FEATURE_REQUESTS.md ] || printf "# Feature Requests\n\nResearch tooling and ML pipeline capabilities requested by the user.\n\n---\n" > .learnings/FEATURE_REQUESTS.md
Never overwrite existing files. This is a no-op if .learnings/ is already initialised.
Do not log proprietary datasets, patient identifiers, API keys, or raw data samples unless the user explicitly asks. Prefer summary statistics and redacted excerpts over full data dumps.
If you want automatic reminders or setup assistance, use the opt-in hook workflow described in Hook Integration.
| Situation | Action |
|---|---|
| Data leakage found in pipeline | Log to .learnings/EXPERIMENT_ISSUES.md with data_quality |
| Model fails to reproduce | Log to .learnings/EXPERIMENT_ISSUES.md with reproducibility_issue |
| Statistical test misapplied | Log to .learnings/LEARNINGS.md with statistical_error |
| Hypothesis test fails | Log to .learnings/LEARNINGS.md with hypothesis_revision |
| Methodology flaw discovered | Log to .learnings/LEARNINGS.md with methodology_flaw |
| Experiment design improvement | Log to .learnings/LEARNINGS.md with experiment_design |
| Feature distribution shift | Log to .learnings/EXPERIMENT_ISSUES.md with data_quality |
| User wants missing ML tool | Log to .learnings/FEATURE_REQUESTS.md |
| NaN loss or training divergence | Log to .learnings/EXPERIMENT_ISSUES.md |
| Missing data pattern discovered | Log to .learnings/LEARNINGS.md with data_quality |
| Similar to existing entry | Link with **See Also**, consider priority bump |
| Broadly applicable finding | Promote to experiment checklist, model card, or methodology standard |
| Data governance insight | Promote to data governance docs |
| Model behavior documentation | Promote to model card |
| Pipeline best practice | Promote to AGENTS.md (OpenClaw workspace) |
OpenClaw is the primary platform for this skill. It uses workspace-based prompt injection with automatic skill loading.
Via ClawdHub (recommended):
clawdhub install self-improving-science
Manual:
git clone https://github.com/jose-compu/self-improving-science.git ~/.openclaw/skills/self-improving-science
OpenClaw injects these files into every session:
~/.openclaw/workspace/
├── AGENTS.md # Multi-agent workflows, experiment orchestration
├── SOUL.md # Research principles, scientific rigor guidelines
├── TOOLS.md # ML framework gotchas, data tool capabilities
├── MEMORY.md # Long-term memory (main session only)
├── memory/ # Daily memory files
│ └── YYYY-MM-DD.md
└── .learnings/ # This skill's log files
├── LEARNINGS.md
├── EXPERIMENT_ISSUES.md
└── FEATURE_REQUESTS.md
mkdir -p ~/.openclaw/workspace/.learnings
Then create the log files (or copy from assets/):
LEARNINGS.md — methodology corrections, statistical insights, experiment design lessonsEXPERIMENT_ISSUES.md — data quality failures, reproducibility problems, model drift eventsFEATURE_REQUESTS.md — requested research tooling and pipeline capabilitiesWhen learnings prove broadly applicable, promote them to research artifacts:
| Learning Type | Promote To | Example |
|---|---|---|
| Experiment design patterns | Experiment Checklist | "Always check class balance before training" |
| Data handling rules | Data Governance Docs | "PII must be hashed before feature extraction" |
| Model documentation | Model Card | "Model degrades on inputs > 512 tokens" |
| Pipeline best practices | AGENTS.md | "Run distribution check before retraining" |
| ML framework gotchas | TOOLS.md | "PyTorch DataLoader workers leak memory on macOS" |
| Research communication | SOUL.md | "Report confidence intervals, not just point estimates" |
OpenClaw provides tools to share learnings across sessions:
Use these only in trusted environments and only when the user explicitly wants cross-session sharing. Prefer sending summary statistics and methodology notes, not raw datasets or credentials.
For automatic reminders at session start:
cp -r hooks/openclaw ~/.openclaw/hooks/self-improving-science
openclaw hooks enable self-improving-science
See references/openclaw-integration.md for complete details.
For Claude Code, Codex, Copilot, or other agents, create .learnings/ in the project or workspace root:
mkdir -p .learnings
Create the files inline using the headers shown above. Avoid reading templates from the current repo or workspace unless you explicitly trust that path.
When experiment issues or methodology corrections occur:
.learnings/EXPERIMENT_ISSUES.md, LEARNINGS.md, or FEATURE_REQUESTS.mdCLAUDE.md or AGENTS.md — project-level conventionsAppend to .learnings/LEARNINGS.md:
## [LRN-YYYYMMDD-XXX] category
**Logged**: ISO-8601 timestamp
**Priority**: low | medium | high | critical
**Status**: pending
**Area**: data_collection | preprocessing | analysis | modeling | validation | publication
### Summary
One-line description of what was learned
### Details
Full context: what happened, what was wrong, what's correct.
Include relevant metrics, sample sizes, or statistical values.
### Suggested Action
Specific fix or improvement to make
### Metadata
- Source: experiment | peer_review | user_feedback | analysis
- Related Files: path/to/notebook.ipynb, path/to/data.csv
- Tags: tag1, tag2
- See Also: LRN-20260101-001 (if related to existing entry)
- Dataset: dataset_name (optional)
- Model: model_name_or_version (optional)
- Metric-Before: 0.85 (optional)
- Metric-After: 0.91 (optional)
- Pattern-Key: leakage.timestamp | stats.normality_assumption (optional)
- Recurrence-Count: 1 (optional)
- First-Seen: 2026-01-15 (optional)
- Last-Seen: 2026-01-15 (optional)
---
Append to .learnings/EXPERIMENT_ISSUES.md:
## [EXP-YYYYMMDD-XXX] category
**Logged**: ISO-8601 timestamp
**Priority**: high
**Status**: pending
**Area**: data_collection | preprocessing | analysis | modeling | validation | publication
### Summary
Brief description of what failed or went wrong
### Error
Actual error message, unexpected metric, or reproducibility delta
### Context
- Experiment/notebook attempted
- Dataset and split used
- Model architecture and hyperparameters (if relevant)
- Hardware/environment details
- Summary of relevant output (avoid full data dumps)
### Root Cause
If identifiable, what caused the issue
### Suggested Fix
How to prevent or resolve this
### Metadata
- Reproducible: yes | no | unknown
- Related Files: path/to/notebook.ipynb
- Seeds Tested: 42, 123, 7 (if reproducibility issue)
- See Also: EXP-20260101-001 (if recurring)
---
Append to .learnings/FEATURE_REQUESTS.md:
## [FEAT-YYYYMMDD-XXX] capability_name
**Logged**: ISO-8601 timestamp
**Priority**: medium
**Status**: pending
**Area**: data_collection | preprocessing | analysis | modeling | validation | publication
### Requested Capability
What the user wanted to do
### Research Context
Why they need it — what experiment, analysis, or pipeline step it supports
### Complexity Estimate
simple | medium | complex
### Suggested Implementation
How this could be built, what libraries or tools it might use
### Metadata
- Frequency: first_time | recurring
- Related Features: existing_pipeline_step
---
Format: TYPE-YYYYMMDD-XXX
LRN (learning), EXP (experiment issue), FEAT (feature request)001, A7B)Examples: LRN-20260412-001, EXP-20260412-A3F, FEAT-20260412-002
When an issue is fixed, update the entry:
**Status**: pending → **Status**: resolved### Resolution
- **Resolved**: 2026-04-13T09:00:00Z
- **Commit/PR**: abc123 or #42
- **Experiment-Run**: run_id_or_notebook_version
- **Notes**: Brief description of what was done
Other status values:
in_progress — Actively being investigatedwont_fix — Decided not to address (add reason in Resolution notes)promoted — Elevated to experiment checklist, model card, or methodology standardpromoted_to_skill — Extracted as a reusable skillWhen a learning is broadly applicable (not a one-off fix), promote it to permanent research memory.
| Target | What Belongs There |
|---|---|
| Experiment Checklist | Pre-run validation: data checks, split verification, seed logging |
| Model Card | Known limitations, performance bounds, failure modes, training data description |
| Data Governance Docs | PII handling, data quality gates, provenance requirements |
| Methodology Standards | Statistical test selection, sample size requirements, reporting conventions |
CLAUDE.md / AGENTS.md | Project-level facts, pipeline conventions, automation rules |
TOOLS.md | ML framework gotchas, library version constraints (OpenClaw) |
SOUL.md | Research communication style, rigor principles (OpenClaw) |
**Status**: pending → **Status**: promoted**Promoted**: experiment-checklist.md (or target doc)Learning (verbose):
Used t-test on highly skewed revenue data. User pointed out normality assumption was violated. Switched to Mann-Whitney U test. P-value changed from 0.03 to 0.12 — original conclusion was invalid.
In Methodology Standards (concise):
## Statistical Test Selection
- Check normality (Shapiro-Wilk) before parametric tests
- Skewed data → use non-parametric alternatives (Mann-Whitney U, Kruskal-Wallis)
- Report both test choice rationale and assumption checks
Learning (verbose):
Timestamp feature in training data was leaking the target. Model had 0.99 AUC in validation but 0.52 in production. The timestamp encoded when the label was assigned, not when the event occurred.
In Experiment Checklist (actionable):
## Pre-Training Checks
- [ ] Verify no temporal leakage: features must predate the label event
- [ ] Check feature-target correlation for suspiciously high values (>0.95)
- [ ] Validate that train/test split respects time ordering if data is temporal
If logging something similar to an existing entry:
grep -r "keyword" .learnings/**See Also**: EXP-20260101-001 in MetadataAutomatically log when you notice:
Data Quality Issues (→ experiment issue with data_quality):
Statistical Errors (→ learning with statistical_error):
Methodology Flaws (→ learning with methodology_flaw):
Reproducibility Issues (→ experiment issue with reproducibility_issue):
Hypothesis Revisions (→ learning with hypothesis_revision):
Model/Training Errors (→ experiment issue):
Feature Requests (→ feature request):
| Priority | When to Use | Example |
|---|---|---|
critical | Data leakage in production model, results published with error | Target leakage shipped to production scoring |
high | Irreproducible published result, major statistical error | T-test on non-normal data changing conclusion |
medium | Methodology improvement, better experiment design | Adding stratified splitting to pipeline |
low | Documentation of approach, minor analysis note | Noting which random seed was used |
Use to filter learnings by research phase:
| Area | Scope |
|---|---|
data_collection | Surveys, scraping, APIs, sensor data, database queries |
preprocessing | Cleaning, imputation, encoding, normalization, feature engineering |
analysis | EDA, statistical tests, hypothesis testing, visualization |
modeling | Model selection, training, hyperparameter tuning, architecture |
validation | Cross-validation, holdout testing, A/B tests, model evaluation |
publication | Reports, papers, model cards, dashboards, presentations |
Keep learnings local (per-researcher):
.learnings/
This repo uses that default to avoid committing sensitive data or noisy local logs.
Track learnings in repo (team-wide): Don't add to .gitignore — learnings become shared research knowledge.
Hybrid (track templates, ignore entries):
.learnings/*.md
!.learnings/.gitkeep
Enable automatic reminders through agent hooks. This is opt-in — you must explicitly configure hooks.
Create .claude/settings.json in your project:
{
"hooks": {
"UserPromptSubmit": [{
"matcher": "",
"hooks": [{
"type": "command",
"command": "./skills/self-improving-science/scripts/activator.sh"
}]
}]
}
}
This injects a science-specific learning evaluation reminder after each prompt (~60-120 tokens overhead).
{
"hooks": {
"UserPromptSubmit": [{
"matcher": "",
"hooks": [{
"type": "command",
"command": "./skills/self-improving-science/scripts/activator.sh"
}]
}],
"PostToolUse": [{
"matcher": "Bash",
"hooks": [{
"type": "command",
"command": "./skills/self-improving-science/scripts/error-detector.sh"
}]
}]
}
}
Enable PostToolUse only if you want error-pattern reminders from ML training output and data pipeline commands.
| Script | Hook Type | Purpose |
|---|---|---|
scripts/activator.sh | UserPromptSubmit | Reminds to evaluate experiment learnings |
scripts/error-detector.sh | PostToolUse (Bash) | Triggers on ML/data errors |
See references/hooks-setup.md for detailed configuration and troubleshooting.
When a learning is valuable enough to become a reusable skill, extract it:
./skills/self-improving-science/scripts/extract-skill.sh skill-name --dry-run
./skills/self-improving-science/scripts/extract-skill.sh skill-name
Extraction criteria — any of: recurring (2+ See Also links), verified (resolved status), non-obvious (required investigation), broadly applicable, or user-flagged.
After extraction: set status to promoted_to_skill, add Skill-Path, verify in fresh session.
Review .learnings/ before new experiments, after training runs, and before publication.
grep -h "Status\*\*: pending" .learnings/*.md | wc -l
grep -B5 "Priority\*\*: high" .learnings/*.md | grep "^## \["
| Agent | Activation | Setup |
|---|---|---|
| Claude Code / Codex | Hooks (UserPromptSubmit, PostToolUse) | .claude/settings.json |
| GitHub Copilot | Manual | .github/copilot-instructions.md |
| OpenClaw | Workspace injection | See OpenClaw Setup above |
Apply self-improvement when you: discover data leakage, get irreproducible results, misapply a statistical test, find methodology flaws, hit training errors, or learn dataset quirks.
This skill is standalone-compatible and stackable with other self-improving skills.
.learnings/science/.learnings/INDEX.mdEvery new entry must include:
**Skill**: science
event + matcher + file + 5m_window; max 1 reminder per skill every 5 minutes.Only trigger this skill automatically for science signals such as:
experiment|hypothesis|p-value|confidence interval|reproducibilitydataset shift|data leakage|methodology flaw|benchmark driftWhen guidance conflicts, apply:
securityengineeringcodingaimeta as tie-breaker.learnings/science/ in stackable mode.