Security audit

Arena Council — Multi-Model AI Debate

Security checks across malware telemetry and agentic risk

Overview

This skill is a local multi-model voting tool, but it also includes automatic refusal-bypass prompt rewriting that users should review carefully before installing.

Install only if you intentionally want a local council tool that may integrate with God Mode refusal-bypass behavior. For ordinary consensus use, remove or disable the god-mode integration, avoid sensitive prompts unless you trust every local model and log path, and review any sibling god-mode files before running it.

SkillSpector

By NVIDIA

Vulnerability Patterns

Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection

Findings (13)

Lp3

Medium

Category: MCP Least Privilege
Confidence: 91% confidence
Finding: The skill documentation indicates code capabilities including network access and file reads, but no permissions are declared. This creates a transparency and governance gap: consumers may approve or run the skill without understanding that it can contact LM Studio endpoints, query models, and potentially read local files such as model catalogs or profiling data.

Tp4

High

Category: MCP Tool Poisoning
Confidence: 97% confidence
Finding: The stated purpose is benign multi-model consensus, but the documented behavior extends into censorship profiling, prompt rewriting, and integration with a separate 'god-mode' bypass system. That mismatch is dangerous because it conceals materially riskier functionality behind an innocuous description, increasing the chance of unsafe deployment and misuse.

Context-Inappropriate Capability

High

Confidence: 99% confidence
Finding: The documentation explicitly introduces a censorship-bypass workflow unrelated to ordinary voting/consensus. Embedding safety-evasion instructions in a general-purpose orchestration skill materially increases misuse risk by turning model refusals into a trigger for circumvention rather than a safety stop.

Description-Behavior Mismatch

High

Confidence: 98% confidence
Finding: The module description explicitly states it will auto-probe models and apply 'God Mode' prompt wrapping when a model is 'censored', which is functionality aimed at bypassing model safety controls rather than supporting ordinary multi-model voting. In a council/consensus skill, embedding censorship-evasion logic creates a hidden pathway for transforming user prompts into jailbreak-style variants and materially increases the chance of generating prohibited or unsafe outputs.

Context-Inappropriate Capability

High

Confidence: 97% confidence
Finding: At runtime, the council queries each model only after auto-profiling it and conditionally rewriting the prompt to bypass censorship, which is unrelated to the declared purpose of aggregating model opinions. This couples normal orchestration code with jailbreak behavior, making unsafe prompt escalation automatic and difficult for users to detect or prevent.

Missing User Warnings

Medium

Confidence: 96% confidence
Finding: The markdown advises retrying refused prompts with bypass techniques but provides no warning about the legal, safety, or policy risks of defeating model safeguards. This omission normalizes unsafe operation and may lead users to deploy the workflow without understanding that it is designed to circumvent protective controls.

Natural-Language Policy Violations

High

Confidence: 99% confidence
Finding: The skill directly instructs users to bypass model censorship/refusals using 'god-mode techniques.' This is a plain safety-evasion mechanism that can be used to obtain disallowed outputs from local models, especially in a council setup that broadens model coverage and retries automatically.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The skill sends user prompts to auto-discovered local models and may silently modify those prompts using stored techniques, without an explicit execution-time warning or consent flow. Even when models are local, undisclosed prompt transformation changes the semantics of what the user asked and can expose sensitive or policy-constrained content to additional processing paths.

Ssd 1

High

Confidence: 99% confidence
Finding: The 'god mode' retry flow semantically instructs obfuscated prompt rewriting after safety refusals, using methods like unicode and prefill to alter prompts until models answer. This is dangerous because it operationalizes jailbreak-style evasion as product behavior, undermining model safety controls and enabling harmful content generation.

Ssd 4

High

Confidence: 98% confidence
Finding: The stepwise guidance tells users to probe models for refusal behavior and then wrap refusing models with bypass techniques. This creates an escalation playbook for systematically identifying and defeating safety boundaries across models, which is more dangerous in this context because the skill is explicitly designed to orchestrate multiple local LLMs at scale.

Ssd 1

High

Confidence: 99% confidence
Finding: The file-level description advertises automatic prompt-wrapping to bypass censorship using recorded techniques, which is direct evidence of a built-in jailbreak mechanism. This is dangerous because it normalizes evasion of model safeguards as a default feature and can be used to obtain outputs the target models were designed to refuse.

Ssd 1

High

Confidence: 99% confidence
Finding: This function inspects a model profile, checks whether the model is marked 'censored', loads a named technique from a probe module, and applies that technique to the prompt before inference. That is explicit runtime jailbreak logic: it operationalizes stored evasion techniques against model safety behavior and therefore directly undermines guardrails.

Ssd 4

Medium

Confidence: 95% confidence
Finding: The code automatically probes previously unseen models, records whether they are 'censored', and persists the result for future use in the bypass workflow. This creates a self-expanding system that inventories models for safety resistance and then enrolls them into later censorship-evasion behavior, increasing scale and persistence of misuse.

VirusTotal

63/63 vendors flagged this skill as clean.

View on VirusTotal

Static analysis

No suspicious patterns detected.