Memory Bench Designer

v0.1.0

Designs a custom agent-memory benchmark for the user's specific use case. Activate when the user asks which memory strategy fits their agent, how to evaluate...

⭐ 0· 82·0 current·0 all-time

byTatsuKo Tsukimi@tatsuko-tsukimi

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for tatsuko-tsukimi/memory-bench-designer.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Memory Bench Designer" (tatsuko-tsukimi/memory-bench-designer) from ClawHub.
Skill page: https://clawhub.ai/tatsuko-tsukimi/memory-bench-designer
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install memory-bench-designer

ClawHub CLI

Package manager switcher

npx clawhub@latest install memory-bench-designer

Security Scan

Capability signals

Crypto

These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.

VirusTotal

Benign

View report →

OpenClaw

Suspicious

medium confidence

Purpose & Capability

SKILL.md clearly expects to run an external runner (memory-bench run ...) and to use template files (templates/scenario.yaml.tmpl, templates/weights.yaml.tmpl). The registry metadata declares no required binaries and includes no templates or install instructions. Either the metadata is incomplete or the skill assumes tools/files that are not provided — that is an incoherence between claimed functionality and declared requirements.

Instruction Scope

Runtime instructions tell the agent to (a) conduct multi-turn elicitation, (b) write scenario-<name>.yaml and weights-<name>.yaml into the user's current working directory, (c) invoke a CLI command that produces results/<name>/results.md and results.json, and (d) read and interpret results.md. The instructions reference template files that are not present in the file manifest. They also note that the runner will download a ~90 MB sentence-transformers model on first run. These behaviors involve filesystem writes, executing a local CLI, and network downloads — all beyond what's declared in the metadata.

ℹ

Install Mechanism

There is no install spec (instruction-only), which is low-risk in itself. However, the skill assumes the 'memory-bench' CLI exists on PATH and will cause a model download (Hugging Face / sentence-transformers) when invoked with --embedding. The absence of an install step or a declared source for the runner binary means it's unclear how that binary would be obtained or whether it is safe/trusted.

✓

Credentials

The skill declares no required environment variables, credentials, or config paths, and SKILL.md does not ask for secrets. That is proportionate. Note: network activity (model download) and reading/writing local files will still occur; no explicit credential access is requested.

ℹ

Persistence & Privilege

always:false (good) and no install-time persistence is requested. The default autonomous-invocation setting is enabled (disable-model-invocation:false) — normal for skills — but combined with the instruction to execute a CLI and write files, this increases the impact if the agent runs the skill without clear user approval. The skill does not request to modify other skills or system-wide settings.

What to consider before installing

Before installing or enabling this skill, verify the following: (1) Where does the 'memory-bench' runner come from? The skill expects to run 'memory-bench' but the registry gives no install or binary source — ask the publisher for an official install/instruction or a trusted release URL. (2) Templates referenced in SKILL.md (templates/scenario.yaml.tmpl, templates/weights.yaml.tmpl) are not included in the package; request those or confirm how they are created. (3) Running the skill will write files into your current working directory and will run a local CLI that may download ~90 MB of model data from external hosts (Hugging Face or similar). If you are uncomfortable with filesystem writes or network/model downloads, do not enable autonomous runs; prefer manual invocation. (4) Because the skill's source and homepage are unknown, exercise extra caution: ask the owner for provenance, an install guide, and a checksum/verified release of the runner. If those clarifications are provided (runner origin, templates included, or an explicit install spec), the coherence issues would be resolved; otherwise, treat the skill as suspicious and avoid giving it autonomous execution privileges.

Like a lobster shell, security has layers — review code before you run it.

latestvk97cx3263bqwqnydgsqq0x2gs58595kg

82downloads

0stars

1versions

Updated 6d ago

v0.1.0

MIT-0

Memory Bench Designer

An agent memory benchmark designer. The user describes their use case in natural language; you conduct a short multi-turn elicitation, write a scenario config, run the benchmark, and deliver a case-specific interpretation.

The central premise: no single memory strategy wins across use cases. Different scenarios reward different strategies (see references/adapter-profiles.md for empirical evidence). Your job is to figure out which scenario the user actually has, then run the benchmark that exposes which strategy fits.

Four-stage flow

Stage 1 Understanding — conversation with the user (3–5 turns) Stage 2 Ideation — generate scenario.yaml + weights.yaml Stage 3 Rollout — invoke the runner CLI Stage 4 Judgment — interpret the results.md for this specific use case

After Stage 4, always offer: "Want to refine the scenario and re-run?" This is the AdaTest-style inner loop.

Stage 1 — Understanding

Goal: extract enough about the user's use case to fill in the scenario DSL.

Turn 1 — examples, not criteria. Ask:

"Give me 1–2 concrete examples of things your agent's memory should keep and retrieve later, and 1–2 examples of things it should discard or at least de-prioritize. Don't worry about defining the rules — just the examples."

Rationale: EvalGen's "criteria drift" finding. Users can't define criteria upfront; they can recognize good/bad examples.

Turn 2 — session shape. Ask two short questions:

"How many conversations/sessions does a typical user have with your agent before memory matters? And how long is one session — roughly how many turns?"

If the user is vague, offer defaults: 10 sessions × 40 steps. These are runner defaults.

Turn 3 — taxonomy check. Show the 4-family × 8-dimension matrix from references/taxonomy.md. Ask which 2–3 dimensions matter most for this use case. Do not force the user to rank all 8 — cognitive load is too high. You are looking for which families to weight.

Turn 4 (optional) — archetype mix. If the use case is ambiguous, show 3 candidate archetype mixes (see references/use-case-patterns.md), let the user pick or modify. Never show more than 3 candidates at once (AdaTest's 3–7 cap, we lean to 3).

By the end of Stage 1 you should know:

Archetype mix: fractions for core / evolving / episode / noise
Context evolution: random / narrow-band-drift / stable / mode-shifts
Themes: 3–6 short lists of vocabulary tokens (ask the user for domain words if they're non-obvious)
Which families they care about (for the Judgment stage)

If anything is ambiguous, default to the closest pattern in references/use-case-patterns.md and tell the user which pattern you chose and why.

Stage 2 — Ideation

Write two files into the user's current working directory:

scenario-<name>.yaml — the scenario config
weights-<name>.yaml — family weights for Judgment (optional)

Use templates/scenario.yaml.tmpl and templates/weights.yaml.tmpl as starting points. Substitute the values from Stage 1.

Show the user the generated scenario.yaml and ask: "Look right, or tweak anything before we run?" Keep this confirmation to one round — don't re-litigate Stage 1.

Stage 3 — Rollout

Invoke the runner via Bash:

memory-bench run --scenario scenario-<name>.yaml --out results/<name>/ --embedding --composite

The --embedding flag enables the sentence-transformers adapter (first run downloads ~90 MB model). The --composite flag enables the weighted multi-signal adapter. Both are recommended — without them you only get three cheap baselines and the leaderboard is thin.

The runner writes results/<name>/results.md and results/<name>/results.json. Read the markdown file.

Expected runtime: 1–5 minutes. If it's slower, sentence-transformers is doing a cold model download — this is normal on first run.

Stage 4 — Judgment

Read results.md. Do not just paste it back to the user. Write a case-specific interpretation with three sections:

1. Capability profile. For each family the user said matters in Stage 1, state the winner, its score, and whether that score is high or low relative to the other scenarios in references/adapter-profiles.md. A winner with score 0.4 means "best available but still weak" — say that out loud.

2. Tradeoffs observed. Point to 1–2 dimensions where a non-winner adapter came close, and what that means. Example: "Composite edges out Embedding in Update Coherence by 5%, but loses Personalization by 10%. For your use case, you care more about X, so Embedding is the safer default."

3. Recommended starting strategy. One sentence: "Start with <adapter> because <why>. If you see <symptom> in production, try <alternative>." Be specific.

After these three sections, ask: "Want to refine the scenario and re-run?" Common refinements:

Bump up an archetype fraction that felt underrepresented
Switch context evolution type
Add or remove themes
Adjust weights.yaml to shift family priorities

Key UX rules (full detail in references/elicitation-flow.md)

Grade before criteria — ask for examples before asking for rules
Cap at 7 — never show more than 7 candidates/options/dimensions at once; prefer 3
Ranking always visible — when you show candidates, show why they're ranked in that order
Iterate every 5–8 interactions — surface pattern-detected summaries, don't let the conversation wander
Organization optional — don't force a taxonomy on the user upfront; let structure emerge from the examples they give

References

references/taxonomy.md — the 4×8 matrix shown in Turn 3
references/adapter-profiles.md — empirical profile of each strategy (what it wins, what it loses)
references/use-case-patterns.md — canonical patterns (game / companion / RAG / coding)
references/elicitation-flow.md — the UX rules above, with rationale
examples/game-ai-walkthrough.md — a full game-AI scenario elicitation and result
examples/npc-cognition-walkthrough.md — long-running NPC with stable persona
examples/coding-agent-walkthrough.md — code/PR/design memory with frequent supersedes
templates/scenario.yaml.tmpl — the scenario DSL skeleton
templates/weights.yaml.tmpl — family weights skeleton

What this skill does not do

It does not call any LLM judges — all metrics are mechanical
It does not evaluate actual agent responses — it evaluates the retrieval layer feeding them
It does not benchmark external memory services (Mem0, Zep, Letta) — it benchmarks algorithmic primitives (Recency, BM25, ACT-R, Embedding, Composite)
It does not replace production telemetry — it de-risks the initial strategy choice before you build

Comments

Loading comments...