Install
openclaw skills install agentsop-code-execution-decisionDecision rubric for when an LM agent should write-and-run code (Program-of-Thought / code interpreter) versus reason in natural language: classify each step as deterministic- computable (emit + execute code, feed the result back) vs judgment (stay in prose). Use when designing or debugging an agent step that does arithmetic/parsing/data transforms, when prose reasoning hallucinates a computation (under-coding), or when a sandbox round- trip is wasted on a judgment task (over-coding). Search keywords: code interpreter, agent does math wrong, calculator hallucination, when to run code vs reason, program of thought, PoT, tool vs reasoning.
openclaw skills install agentsop-code-execution-decisionOne-liner: LMs are unreliable calculators but reliable coders. When the answer needs determinism and precision — arithmetic, exact data manipulation, deterministic transforms — emit code and run it. When the answer needs judgment, taste, or open-ended synthesis, reason in natural language. The cost of getting this gate wrong is silent: prose arithmetic hallucinates a plausible-looking wrong number, and over-coding a judgment task burns a sandbox round-trip for nothing.
This is an enhancement overlay. DSPy already gives you dspy.ProgramOfThought (PoT) —
the mechanism for write-then-execute. What it does not give you is the decision rubric for
when to reach for it. That rubric is this skill. Cross-link the sibling
[[agentsop-output-format-by-model]] (which decides how code-shaped content should be serialized) and
[[agentsop-test-fix-loop]] (which closes the execute → error → retry loop).
Activate this skill before committing a step to a reasoning strategy whenever the task has a verifiable, deterministic core — or whenever you catch an agent doing arithmetic in prose.
| Trigger | Signal |
|---|---|
| Arithmetic / math | "compute the compound interest", "what's 17.5% of $4,392.18", multi-step word problems, unit conversions, date deltas |
| Precise data manipulation | "sort these 240 rows by the third column", "dedupe and count", "join these two lists on id", "parse this CSV and sum column B" |
| Deterministic transforms | regex extraction, string reformatting, base conversion, hashing, sorting, set operations |
| Symbolic / combinatorial | "how many distinct permutations", "solve this system of equations", calendar/scheduling math |
| You see a model doing math in prose | "Let me add: 1,204 + 8,991 + ... = 10,195" — almost always worth a code check |
| Choosing a DSPy module | deciding between ChainOfThought and ProgramOfThought for a signature [dspy.ai/learn/programming/modules/] |
Anti-triggers (do NOT reach for code execution):
LMs are unreliable calculators but reliable coders.
A language model predicts the next token, not the correct value. When you ask it to add
48,217 + 9,884 in prose, it emits the most plausible-looking digit sequence — which is
frequently wrong, and wrong in a way that looks right. The same model can write
48217 + 9884 as a Python expression flawlessly, because emitting the program is a
pattern-matching task it is genuinely good at, and the Python interpreter is a deterministic
oracle. This decoupling — model writes the recipe, interpreter computes the result — is the
entire thesis of Program-of-Thought (PoT) [arXiv 2211.12588] and PAL [arXiv 2211.10435].
┌────────────────────────────────────────────────────────┐
│ THE GATE: does this answer need determinism/precision? │
└────────────────────────────────────────────────────────┘
│ │
YES (computable) NO (judgment)
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ EMIT CODE │ │ REASON IN PROSE │
│ model writes recipe │ │ model is the engine │
│ interpreter = oracle│ │ no oracle exists │
└─────────────────────┘ └─────────────────────┘
│
▼
sandbox → run → feed result back into LM → LM narrates/uses it
Two failure modes the gate prevents:
| Failure | Mechanism | Symptom |
|---|---|---|
| Under-coding (reason when you should compute) | LM hallucinates a calculation it cannot reliably perform | A confident, wrong, plausible-looking number; off-by-one counts; arithmetic that "looks" right |
| Over-coding (compute when you should reason) | LM wraps a judgment task in a sandbox round-trip that adds no determinism | Wasted latency + cost; brittle code that encodes a subjective rubric as if it were a formula; print("the tone is friendly") |
Why the asymmetry matters. Under-coding fails silently — the wrong number propagates downstream and no exception fires. Over-coding fails loudly and cheaply — you notice the useless sandbox call. So the default lean, when genuinely uncertain and a verifiable core exists, is toward code. But "genuinely uncertain" is the operative phrase: most judgment tasks are not close calls.
The format corollary (from [[agentsop-output-format-by-model]]): once you decide to emit code, emit it as code in a fenced block / single string field — never nested inside JSON sub-structure. Code-in-JSON measurably degrades the code itself (Aider: 61%→20% on GPT-4 Turbo). The execution decision and the serialization decision are two separate gates; pass both.
A three-step gate. Run it per step, not per task — one task can have computable steps and judgment steps interleaved.
Ask: "Is there a single correct answer that a program could verify?"
A useful tiebreaker for the "trivial computable" gray zone: if the model would be embarrassed to get it wrong and you'd reach for a calculator yourself, emit code. If you'd do it in your head without a second thought, prose is fine.
print/return the result. No network, no filesystem unless the task is I/O.ProgramOfThought uses a Python interpreter (Deno/PythonInterpreter sandbox in recent versions); OpenAI Code Interpreter runs in a managed container; Anthropic code-execution tool runs in a sandboxed VM; LangChain PythonREPLTool runs in-process and is unsandboxed — treat as untrusted-input-hostile.ProgramOfThought defaults to max_iters ≈ 3). After the bound, fall back to prose reasoning or surface the failure — do not loop forever.Exit criterion: the step produces either (a) a code-derived value the LM has consumed, or (b) a prose judgment, with the gate decision recorded so a reviewer can audit why code was or wasn't used.
Seven operations. Each row is a reusable move.
| # | Op | Trigger | Action | Output | Evidence |
|---|---|---|---|---|---|
| 1 | Computable-vs-judgment gate | Any step about to be reasoned | Apply §3 Step 1: single verifiable answer? | Route to code or prose | PoT premise: decouple compute from reasoning [arXiv 2211.12588] |
| 2 | Decompose mixed steps | Step has both a number and a narrative | Split into computable sub-step (code) + judgment sub-step (prose) | Two routed sub-steps | Mirrors mixed-content split in [[agentsop-output-format-by-model]] §5 Case B |
| 3 | Sandbox choice | Decided to emit code | Pick interpreter by trust + capability: DSPy PoT (Python sandbox), OpenAI Code Interpreter (managed container), Anthropic code-exec (VM), LangChain PythonREPLTool (unsandboxed, in-process) | Chosen runtime | DSPy modules [dspy.ai/learn/programming/modules/]; LangChain PythonREPLTool docs |
| 4 | Result-back-into-LM | Code produced a value | Inject stdout/return value into the next LM turn so the model narrates/uses it | LM-consumed result | PoT design: code computes, LM contextualizes [arXiv 2211.12588] |
| 5 | Error retry (bounded) | Code raised a traceback | Feed error to LM, regenerate, re-run; cap at max_iters (~3) then fall back | Fixed code or graceful fallback | DSPy ProgramOfThought max_iters; see [[agentsop-test-fix-loop]] |
| 6 | Precision escalation | Prose answer involves multi-step arithmetic | Re-route the arithmetic to code even if prose started it | Code-verified number | "LMs are unreliable calculators" — PAL [arXiv 2211.10435] |
| 7 | Over-coding veto | About to sandbox a judgment task | Stop: no deterministic core → reasoning, not code | Prose reasoning, no sandbox call | §5 Case B; avoids wasted round-trip |
In DSPy terms, op #1 is exactly the choice between dspy.ChainOfThought (prose reasoning) and
dspy.ProgramOfThought (emit+run) for a signature — the dspy skill lists the modules but this
overlay supplies the when.
Trigger: A finance-summary agent step: "Given these 14 line items, compute the total,
the 8.25% tax, and the grand total." The agent is a dspy.ChainOfThought module emitting prose.
Constraints:
Decision steps:
ChainOfThought to ProgramOfThought. The model now emits
subtotal = sum([...]); tax = round(subtotal * 0.0825, 2); total = subtotal + tax and the
interpreter computes it exactly.total = 4,217.93 and narrates the invoice
line. Code computed; LM contextualized.program field — code-in-JSON would degrade it.Outcome: The arithmetic is now deterministic and auditable. PoT-style code execution is the documented fix for exactly this class of error [arXiv 2211.12588, arXiv 2211.10435].
Extractable operation: Multi-step arithmetic in prose is a smell. Re-route it to code (op #6).
Trigger: A support-triage agent step: "Read this customer message and decide whether the
tone is hostile, neutral, or warm." An over-eager engineer wires it through ProgramOfThought
because "code is more reliable."
Constraints:
if "!!!" in msg: tone = "hostile" — a brittle
rule that encodes a subjective rubric as if it were a formula, and is worse than the model's
native judgment.Decision steps:
ChainOfThought (or Predict). The LM is the right engine for judgment.Outcome: No sandbox call. The judgment stays where judgment belongs. Over-coding is a real and common anti-pattern: not everything benefits from an interpreter, only things with a verifiable deterministic core.
Extractable operation: No verifiable answer → no code. Veto the sandbox round-trip (op #7).
Trigger: "Summarize this quarter's sales narrative and give me the exact total revenue."
Constraints: One sentence contains both a judgment (summary) and a computation (total).
Decision steps:
ProgramOfThought: code sums the figures, interpreter verifies.ChainOfThought: prose synthesis, no oracle exists.Outcome: Each sub-step uses its correct engine. This is the execution-decision analogue of the mixed-content two-pass pattern in [[agentsop-output-format-by-model]] §5 Case B.
Extractable operation: One task ≠ one strategy. Gate per step, decompose mixed steps.
max_iters ≈ 3) and fall back to prose or surface the failure. Infinite code-repair loops
burn cost. See [[agentsop-test-fix-loop]].PythonREPLTool as a safe sandbox. It runs in-process, unsandboxed.
Fine for trusted self-authored code; hostile to untrusted input. Use a real sandbox
(OpenAI Code Interpreter container, Anthropic code-exec VM, DSPy's interpreter) when inputs are untrusted.How major frameworks expose the emit-code-vs-reason mechanism. This skill operates at the decision layer; each framework supplies the mechanism.
ProgramOfThought [dspy.ai/learn/programming/modules/]The canonical declarative version. A signature compiled with dspy.ProgramOfThought(Sig)
makes the LM emit Python, runs it in an interpreter sandbox, and feeds the result back —
with bounded retry on error (max_iters). The sibling dspy skill lists ChainOfThought
vs ProgramOfThought as module choices but does not give the decision rubric; this overlay
is that rubric. Use ProgramOfThought exactly when §3 Step 1 returns "computable."
A managed container that the model can write Python into and execute, with files and state persisted across turns. Heavier and stateful — good for data-analysis sessions (load CSV, compute, plot). Same gate applies: route computable steps in, keep judgment in chat.
A sandboxed VM exposed as a tool; the model emits code, it runs, results return to the conversation. First-party, sandboxed — safe for untrusted inputs. The execution-decision gate maps directly: offer the tool, but the model/agent should only invoke it for computable steps.
PythonREPLToolA tool wrapping a Python REPL. Runs in-process and is unsandboxed — powerful and dangerous. Use only for trusted, self-authored computation; never expose it to untrusted input without an external sandbox. The decision rubric is identical; the safety profile is worst-in-class.
Framework | Mechanism | Sandbox | Result-back
---------------------|--------------------------|-------------|------------
DSPy PoT | ProgramOfThought module | Python intp | automatic (max_iters)
OpenAI Code Interp. | Assistants code tool | container | persisted state
Anthropic code-exec | code-execution tool | VM (safe) | into conversation
LangChain | PythonREPLTool | NONE (proc) | manual wiring
Every framework can be mis-invoked — pointed at a judgment task (over-code) or skipped for a computation (under-code). This overlay is the gate that decides invocation, regardless of which mechanism is underneath.
┌──────────────────────────────────────────────────────────────────────┐
│ EMIT-CODE-VS-REASON DECISION CARD │
├──────────────────────────────────────────────────────────────────────┤
│ Single verifiable answer a program could check? │
│ YES → EMIT CODE → sandbox → run → feed result back into LM │
│ NO → REASON IN PROSE (no oracle exists for judgment) │
│ MIXED → decompose; route each sub-step independently │
├──────────────────────────────────────────────────────────────────────┤
│ Arithmetic / parse / sort / count / regex / symbolic → CODE │
│ Tone / quality / summary / design / synthesis → PROSE │
│ "summary AND total" → SPLIT │
├──────────────────────────────────────────────────────────────────────┤
│ LMs are unreliable calculators but reliable coders. │
│ Under-coding fails SILENTLY (wrong plausible number). │
│ Over-coding fails LOUDLY+CHEAPLY (wasted sandbox round-trip). │
│ When genuinely uncertain AND a verifiable core exists → lean CODE. │
├──────────────────────────────────────────────────────────────────────┤
│ NEVER: │
│ • code a judgment task (over-coding veto) │
│ • do multi-step arithmetic in prose (under-coding) │
│ • forget to feed the code result back into the LM │
│ • nest generated code in JSON (see [[agentsop-output-format-by-model]]) │
│ • loop code-repair unbounded (see [[agentsop-test-fix-loop]]) │
│ • trust LangChain PythonREPLTool on untrusted input (unsandboxed) │
└──────────────────────────────────────────────────────────────────────┘
Primary anchors:
ProgramOfThought module — [dspy.ai/learn/programming/modules/] — emit+run+retry mechanism.Framework / API docs:
PythonREPLTool — [python.langchain.com/docs/integrations/tools/python] (unsandboxed; in-process).Companion / overlaid skills:
dspy-sop-skill/SKILL.md — ships ProgramOfThought but not this decision rubric (the gap this overlay fills).d-output-format-by-model-skill/SKILL.md — sibling: once you emit code, how to serialize it (PoT for math/parse; code never nested in JSON). Cross-linked as [[agentsop-output-format-by-model]].test-fix-loop — the execute → error → retry loop this skill defers to for bounded code repair. Cross-linked as [[agentsop-test-fix-loop]].