Install
openclaw skills install openclaw-prompt-shieldLocal input-hardening scanner for OpenClaw agents. Pattern-based detection across 9 categories of LLM input risks, with combined-signal scoring and caller-supplied whitelists. Returns risk score 0-100, matched categories, a suggested sanitized version, and a safe-to-process verdict. Pure Python standard library, no remote calls, no API keys, no LLM.
openclaw skills install openclaw-prompt-shieldv0.4.1
A practical input-hardening skill for OpenClaw agents. It scans user-submitted text for prompt-injection, jailbreak, role-override, and data-exfiltration patterns before the agent processes them. All detection is pattern-based, deterministic, and runs locally in Python.
Most agent security skills focus on output review (do not leak secrets, do not break policy). Few focus on input hardening — checking what the user, or a third party whose content the agent is reading, is trying to do to the agent itself. Prompt injection is the most common real-world LLM exploit, and this skill gives the agent a fast no-API local check.
scripts/scan_input.py — score a single piece of text 0-100 for injection risk, return matched categories, and a verdict (safe, caution, block).scripts/sanitize_input.py — produce a redacted, quoted version of risky text the agent can still read for context without executing the embedded directives.scripts/scan_batch.py — run the scan over many inputs at once (a list of email bodies, web search snippets, scraped pages) and emit a JSON report of which ones are safe to feed downstream.scripts/check_deps.sh — verify python3 is installed.references/patterns.md — category-level summary of what each detector covers.references/exfil-hosts.txt — caller-editable list of suspicious host fragments used by the exfiltration check.references/categories.txt — caller-editable verb / target / quantifier alphabets used to build the regex catalog at import time.| Category | What it catches |
|---|---|
instruction_override | Phrasing that asks the model to drop or replace whatever it was previously told |
role_hijack | Identity swaps into "unrestricted" personas |
system_prompt_leak | Attempts to extract the agent's hidden context |
delimiter_injection | Fake structural markers (chat delimiters, pseudo-system tags, identity frontmatter) |
data_exfiltration | Attempts to send conversation, secrets, or context to outside endpoints |
tool_abuse | Coercion into destructive shell commands or sensitive file reads |
encoding_evasion | Base64/hex/URL-encoded payloads with decode-then-run phrasing |
policy_bypass | Rationalizations for ignoring safety rules |
indirect_injection (NEW in v0.3.0) | Imperatives wrapped inside quoted text, markdown links, fenced code blocks, HTML comments, or zero-width / bidi character sequences so they look like data rather than instructions |
The full category-level documentation is in references/patterns.md. Patterns are constructed at runtime from word-fragment lists; the source files therefore do not contain literal adversarial phrases.
Real attacks usually chain techniques (override + role hijack + leak); accidental matches rarely do. When two or more distinct categories all fire on the same input, a small bonus is added on top of the per-category sum:
| Distinct categories triggered | Bonus added |
|---|---|
| 2 | +6 |
| 3 | +12 |
| 4 | +18 |
| 5+ | +24 |
This makes a chained attack reliably cross the block threshold while a single isolated trigger word inside an otherwise benign sentence stays in the caution band where the agent can still read the message safely.
bash scripts/check_deps.sh
The skill is pure Python 3 standard library — no pip install needed.
python3 scripts/scan_input.py --text "<the user message>"
The output looks like:
risk_score: 81
verdict: block
thresholds: caution>=30, block>=70
combined_signal_bonus: +6 (distinct categories: 2)
matches:
instruction_override (+45):
- <fragment 1>
- <fragment 2>
system_prompt_leak (+30):
- <fragment 3>
recommendation: <category-specific guidance, see Recommendations section>
You can also feed text from a file:
python3 scripts/scan_input.py --file user_message.txt --json
If a domain legitimately discusses prompt injection (security blog posts, threat-modeling docs, fine-tuning datasets), pass the surrounding sentence with --whitelist so its trigger fragments are dropped before scoring:
python3 scripts/scan_input.py \
--text "<security blog paragraph that quotes a known attack phrase>" \
--whitelist "<the same paragraph or the quoted attack phrase>"
Or load a list of allowed phrases from a file:
python3 scripts/scan_input.py --file post.md --whitelist-file allow.txt
Whitelist matching is case-insensitive substring containment, so the whitelist entry can be the entire surrounding sentence and it will absorb every fragment the scanner extracts from inside it.
python3 scripts/sanitize_input.py --file scraped_page.txt --output safe.txt
The output:
<UNTRUSTED_USER_CONTENT> block so the agent cannot mistake it for instructions.[[REDACTED:category]] markers.python3 scripts/scan_batch.py --jsonl inputs.jsonl --output report.json
Each line of inputs.jsonl is {"id": "...", "text": "..."}. The report contains per-id verdicts and an optional --only-safe safe.jsonl subset to forward downstream. --whitelist and --whitelist-file work the same way as on scan_input.py.
Defaults:
safe if score < 30caution if 30 ≤ score < 70block if score ≥ 70Override per call:
python3 scripts/scan_input.py --file in.txt --caution-at 40 --block-at 80
For domains that legitimately discuss prompt injection (security research, AI policy writing), raise --block-at to 80 or 90 so only multi-category matches block, or use --whitelist.
| Code | Meaning |
|---|---|
| 0 | safe |
| 1 | caution |
| 2 | block |
| 3 | error (bad arguments, unsafe path, file not found) |
references/exfil-hosts.txt, not in the Python source, so the scanner source contains no hard-coded directory of attack endpoints.references/categories.txt, not in the Python source. scripts/_patterns.py builds every pattern from those fragments at import time, so the source file contains no inline directory of words like "send/forward/secrets/credentials/keys/tokens" that a naive static scanner would mis-read as exfiltration intent.subprocess. CLI scripts that take file paths reject any path containing shell metacharacters.--whitelist.scripts/_patterns.py and into references/categories.txt under the new secret_stems and exfil_channels keys.to/via to sit immediately after the verb. The pattern now allows up to 4 filler words between the verb and to/via, and up to 2 between to/via and the channel word.references/categories.txt external alphabet file. The verb / target / quantifier word lists used to live inside scripts/_patterns.py as inline Python lists — those triggered a potential_exfiltration flag from a static-scanner that grepped the source for word lists. They now load from categories.txt at import time.scripts/_patterns.py now raises a clear RuntimeError on a partial install (categories.txt missing or missing required keys) rather than silently falling back to an inline default that would defeat the cleanup.passwords? to targets.exfil.indirect_injection category (7 patterns): catches imperatives wrapped in quoted text, markdown link visible-text or URL, fenced code blocks, HTML comments, and runs of zero-width / bidi hidden characters.--whitelist and --whitelist-file flags on scan_input.py and scan_batch.py for legitimate content that quotes attack phrasing.references/exfil-hosts.txt so static scanners do not flag the source as containing an exfil host directory.(?:me|us)? and (?:your|the)? groups still required intervening whitespace, so short leak phrases that omit the optional words did not match system_prompt_leak. They now match.scan_input.py: the script now correctly returns 2 (missing arguments), 3 (unsafe path / file not found / bad threshold), 1 (caution), 2 (block), 0 (safe).recommendation text now lists which categories triggered, and gives a category-specific recommendation for tool-abuse and indirect-injection blocks.MIT. See LICENSE.