Install
openclaw skills install @globalcaos/memory-bench-pioneerBe one of the first to benchmark your agent's memory — and help shape how AI remembers. Runs a peer-review-grade evaluation suite (LLM-as-judge, nDCG/MAP/MRR with 95% CIs, ablation studies) against your live memory system and submits anonymized results to the ENGRAM/CORTEX research papers. Your data stays private; only aggregate stats leave. Works with agent-memory-ultimate. For the bold few who believe AI memory should be measured, not guessed at. Built for the TinkerClaw fork — github.com/globalcaos/tinkerclaw.
openclaw skills install @globalcaos/memory-bench-pioneerCollect, assess, and submit anonymized memory system statistics for the ENGRAM and CORTEX research papers.
Run the standard test set (30 queries across 4 types × 3 difficulty levels) with LLM-as-judge:
# Full assessment with GPT-4o-mini judge + ablation (recommended)
python3 scripts/rate.py --queries 30 --judge openai --ablation
# Without OpenAI key: local embedding judge (weaker, marked in output)
python3 scripts/rate.py --queries 30 --judge local --ablation
# Custom test set
python3 scripts/rate.py --testset path/to/queries.json --judge openai
What it measures:
Judge methods:
openai — GPT-4o-mini rates each (query, result) pair 1-5. Independent from retrieval system. ~$0.01 per run.local — Embedding cosine similarity. Weaker, marked as such in output. Zero cost.Standard test set (scripts/testset.json): 30 queries stratified across semantic/episodic/procedural/strategic types and easy/medium/hard difficulty. No lexical overlap with stored memories. All deployments run the same queries for cross-site comparability.
python3 scripts/collect.py --contributor GITHUB_USER --days 14 --output /tmp/memory-bench-report.json
Collected (anonymized): Memory counts/types/ages, strength/importance histograms, association graph size, hierarchy levels, consolidation history, retrieval metrics (RAR/MRR/nDCG/MAP with CIs), ablation results, judge method, algorithm version, embedding coverage. Instance ID is a random UUID (not reversible).
Never collected: Memory content, queries, file paths, usernames, hostnames.
scripts/submit.sh /tmp/memory-bench-report.json GITHUB_USERNAME
Forks, branches, places report, updates INDEX.json, opens PR. Requires gh CLI.
For peer-review-ready data, contributors should:
rate.py --ablation --judge openai (minimum N=30 queries)Custom test sets are JSON arrays:
[
{
"id": "T01",
"query": "...",
"category": "semantic|episodic|procedural|strategic",
"difficulty": "easy|medium|hard"
}
]
When asked to submit benchmarks: run rate.py --ablation --judge openai, then collect.py, review summary, then submit.sh. Share the PR link.