TinkerClaw Memory Bench

Memory SystemAgent Memory

Be one of the first to benchmark your agent's memory — and help shape how AI remembers. Runs a peer-review-grade evaluation suite (LLM-as-judge, nDCG/MAP/MRR with 95% CIs, ablation studies) against your live memory system and submits anonymized results to the ENGRAM/CORTEX research papers. Your data stays private; only aggregate stats leave. Works with agent-memory-ultimate. For the bold few who believe AI memory should be measured, not guessed at. Built for the TinkerClaw fork — github.com/globalcaos/tinkerclaw.

Install

openclaw skills install @globalcaos/memory-bench-pioneer

Memory Bench

Collect, assess, and submit anonymized memory system statistics for the ENGRAM and CORTEX research papers.

Three-Step Pipeline

1. Assess Retrieval Quality

Run the standard test set (30 queries across 4 types × 3 difficulty levels) with LLM-as-judge:

# Full assessment with GPT-4o-mini judge + ablation (recommended)
python3 scripts/rate.py --queries 30 --judge openai --ablation

# Without OpenAI key: local embedding judge (weaker, marked in output)
python3 scripts/rate.py --queries 30 --judge local --ablation

# Custom test set
python3 scripts/rate.py --testset path/to/queries.json --judge openai

What it measures:

RAR (Recall Accuracy Ratio), MRR (Mean Reciprocal Rank)
nDCG@5, MAP@5, Precision@5, Hit Rate
All metrics include 95% bootstrap confidence intervals
Ablation: runs with AND without spreading activation to isolate its contribution

Judge methods:

openai — GPT-4o-mini rates each (query, result) pair 1-5. Independent from retrieval system. ~$0.01 per run.
local — Embedding cosine similarity. Weaker, marked as such in output. Zero cost.

Standard test set (scripts/testset.json): 30 queries stratified across semantic/episodic/procedural/strategic types and easy/medium/hard difficulty. No lexical overlap with stored memories. All deployments run the same queries for cross-site comparability.

2. Collect Statistics

python3 scripts/collect.py --contributor GITHUB_USER --days 14 --output /tmp/memory-bench-report.json

Collected (anonymized): Memory counts/types/ages, strength/importance histograms, association graph size, hierarchy levels, consolidation history, retrieval metrics (RAR/MRR/nDCG/MAP with CIs), ablation results, judge method, algorithm version, embedding coverage. Instance ID is a random UUID (not reversible).

Never collected: Memory content, queries, file paths, usernames, hostnames.

3. Submit as PR

scripts/submit.sh /tmp/memory-bench-report.json GITHUB_USERNAME

Forks, branches, places report, updates INDEX.json, opens PR. Requires gh CLI.

Validation Protocol

For peer-review-ready data, contributors should:

Run rate.py --ablation --judge openai (minimum N=30 queries)
Collect at least 2 reports from the same instance, ≥7 days apart (longitudinal)
Report the algorithm version (auto-captured from git)

Test Set Format

Custom test sets are JSON arrays:

[
  {
    "id": "T01",
    "query": "...",
    "category": "semantic|episodic|procedural|strategic",
    "difficulty": "easy|medium|hard"
  }
]

Agent Workflow

When asked to submit benchmarks: run rate.py --ablation --judge openai, then collect.py, review summary, then submit.sh. Share the PR link.