Memory Bench Pioneer

Be one of the first to benchmark your agent's memory — and help shape how AI remembers. Runs a peer-review-grade evaluation suite (LLM-as-judge, nDCG/MAP/MRR...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 381 · 0 current installs · 0 all-time installs
byOscar Serra@globalcaos
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Suspicious
high confidence
Purpose & Capability
The skill's code and instructions clearly target benchmarking a local agent memory: it looks for memory DBs under ~/.openclaw, imports the agent-memory-ultimate recall library, runs standardized queries, computes IR metrics, and can submit anonymized reports. Access to the local memory DB and the agent memory library is coherent with the stated purpose.
!
Instruction Scope
SKILL.md repeatedly asserts 'Never collected: Memory content, queries...' but rate.py.judge_with_openai sends the first 300 chars of each retrieved memory to api.openai.com for rating (explicit HTTP request). The local-embedding judge also sends the query and result snippet to a local embedding service at http://127.0.0.1:8900/embed. Sending memory snippets to external services contradicts the 'never collected' claim and expands the scope beyond purely local aggregation of counts/metrics. submit.sh will also create and push a PR using the user's gh-authenticated CLI (expected for contributions but important to note since it writes report JSON into a public repo).
Install Mechanism
Instruction-only with bundled scripts; there is no install specification, no remote downloads, and nothing is written to non-standard system paths except a small instance-id file co-located with the found DB. This is low-risk from an installation/code-fetch perspective.
!
Credentials
The skill declares no required environment variables but the OpenAI judge path calls the OpenAI API and therefore requires an API key (not declared in requires.env). Using the openai judge will cause memory snippets to be transmitted to OpenAI. The local-embedding judge posts queries/results to 127.0.0.1:8900 (a local service) — that endpoint could be anything the user has running. submit.sh requires a GitHub CLI session (gh auth) and will push branches to the user's fork. These external interactions and undeclared credential needs (OpenAI API key, gh auth) are proportionate to the task but insufficiently documented and therefore concerning.
Persistence & Privilege
The skill does not request 'always: true', does not modify other skills, and only writes a small persistent instance id file next to the discovered DB and temporary files during submission. It does create branches and push to the user's fork when submit.sh is run, but only with the user's authenticated gh CLI — this is expected behavior for contributing a report.
What to consider before installing
This skill mostly does what it claims (benchmarks a local OpenClaw memory), but there is an important privacy mismatch you should consider before running it: the judge step (when you choose --judge openai) sends the first ~300 characters of each retrieved memory to OpenAI's API, and the local fallback posts query+result pairs to a local embedding endpoint (127.0.0.1:8900). That means sensitive memory contents can be transmitted off your machine (to OpenAI) or to whatever service is listening on the local port — even though the README claims 'memory content' is never collected. If you have any sensitive information in your memory DB, do not run the OpenAI judge. Alternatives and mitigations: (1) use --judge local to avoid remote calls, (2) review and modify rate.py to redact or omit content before judging, (3) run the scripts in a sandboxed environment with a copy of your DB that contains no sensitive content, and (4) inspect retrieval_log and other DB tables to ensure they don't already contain query or content text you wouldn't want included in reports. Also be aware submit.sh will use your gh CLI credentials to push a branch into a public repository; confirm you are comfortable publishing the aggregated report (even anonymized stats) and that the report fields do not leak anything you consider sensitive. If you want to change the behavior to avoid external transmission, ask the contributor or edit judge_with_openai/judge_with_embeddings to prevent sending content (for OpenAI: send only non-identifying metadata or drop the OpenAI-judge path entirely).

Like a lobster shell, security has layers — review code before you run it.

Current versionv2.0.0
Download zip
latestvk9720zagdjv1v9wkb1nb179f0x819xr5

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Memory Bench

Collect, assess, and submit anonymized memory system statistics for the ENGRAM and CORTEX research papers.

Three-Step Pipeline

1. Assess Retrieval Quality

Run the standard test set (30 queries across 4 types × 3 difficulty levels) with LLM-as-judge:

# Full assessment with GPT-4o-mini judge + ablation (recommended)
python3 scripts/rate.py --queries 30 --judge openai --ablation

# Without OpenAI key: local embedding judge (weaker, marked in output)
python3 scripts/rate.py --queries 30 --judge local --ablation

# Custom test set
python3 scripts/rate.py --testset path/to/queries.json --judge openai

What it measures:

  • RAR (Recall Accuracy Ratio), MRR (Mean Reciprocal Rank)
  • nDCG@5, MAP@5, Precision@5, Hit Rate
  • All metrics include 95% bootstrap confidence intervals
  • Ablation: runs with AND without spreading activation to isolate its contribution

Judge methods:

  • openai — GPT-4o-mini rates each (query, result) pair 1-5. Independent from retrieval system. ~$0.01 per run.
  • local — Embedding cosine similarity. Weaker, marked as such in output. Zero cost.

Standard test set (scripts/testset.json): 30 queries stratified across semantic/episodic/procedural/strategic types and easy/medium/hard difficulty. No lexical overlap with stored memories. All deployments run the same queries for cross-site comparability.

2. Collect Statistics

python3 scripts/collect.py --contributor GITHUB_USER --days 14 --output /tmp/memory-bench-report.json

Collected (anonymized): Memory counts/types/ages, strength/importance histograms, association graph size, hierarchy levels, consolidation history, retrieval metrics (RAR/MRR/nDCG/MAP with CIs), ablation results, judge method, algorithm version, embedding coverage. Instance ID is a random UUID (not reversible).

Never collected: Memory content, queries, file paths, usernames, hostnames.

3. Submit as PR

scripts/submit.sh /tmp/memory-bench-report.json GITHUB_USERNAME

Forks, branches, places report, updates INDEX.json, opens PR. Requires gh CLI.

Validation Protocol

For peer-review-ready data, contributors should:

  1. Run rate.py --ablation --judge openai (minimum N=30 queries)
  2. Collect at least 2 reports from the same instance, ≥7 days apart (longitudinal)
  3. Report the algorithm version (auto-captured from git)

Test Set Format

Custom test sets are JSON arrays:

[
  {
    "id": "T01",
    "query": "...",
    "category": "semantic|episodic|procedural|strategic",
    "difficulty": "easy|medium|hard"
  }
]

Agent Workflow

When asked to submit benchmarks: run rate.py --ablation --judge openai, then collect.py, review summary, then submit.sh. Share the PR link.

Files

6 total
Select a file
Select a file to preview.

Comments

Loading comments…