Agentsop Bounded Loop

Dev Tools

Universal discipline for any LM-driven loop — agent retries, plan-act-observe, multi-agent handoffs, optimiser passes, test-fix cycles. Encodes the one rule every framework documents quietly and every team relearns expensively: the LM in the loop is NEVER a reliable terminator. Termination must be provided by an explicit counter + exit predicate + stagnation signal + escalation path that live OUTSIDE the LM's control. This is a tool- level, framework-agnostic skill. It maps onto LangGraph (recursion_limit + state counter + interrupt), CrewAI (max_iter + max_rpm + human_input), Claude / OpenAI SDKs (max_iterations + tool_use_budget), DSPy (declared evaluation budget), Aider (REPL + explicit retry cap), and AutoGen (max_consecutive_auto_reply). Search keywords: infinite loop, recursion limit, recursion_limit, GraphRecursionError, max iterations, max_iter, agent stuck, agent won't stop, runaway agent, ReAct loop not terminating, agent repeating itself.

Install

openclaw skills install agentsop-bounded-loop

bounded-loop · O7

Source posture: every load-bearing claim is cited inline with a short tag resolved against references/R1-source-evidence.md and references/R2-cross-framework.md. Examples cite the real GitHub issues they're distilled from.


1. 何时激活 (Activation Rules)

Activate this skill when any of the following is true:

  • The task involves a workflow that contains a cycle — tool-call → reflect → retry, plan → act → observe → re-plan, draft → critique → revise, test → fix → re-test.
  • The user is hitting a framework's "loop too deep" error: GRAPH_RECURSION_LIMIT (LangGraph), MaxIterationsExceeded (LangChain AgentExecutor), "agent exceeded max_iter" (CrewAI), max_turns reached (OpenAI Agents SDK), stop_reason="max_tokens" mid-tool-use (Anthropic).
  • The user proposes "let's just raise the limit" / "set max_iter to 100" / recursion_limit=200 — this is the canonical anti-pattern this skill exists to prevent.
  • The user is building a multi-agent system with delegation, handoff, or supervisor patterns — these are exposure-multipliers for unbounded loops (see [gh/crewai-330]).
  • The user is building an optimiser / evaluator loop (DSPy, AutoEval, RLHF, self-refining agent) where "stop when good enough" is the termination criterion — this is never sufficient on its own.
  • The user wants a test-fix loop, self-healing code agent, or iterative refinement workflow — every code-agent in production (Cursor, Aider, Devin, Claude Code) ships with an explicit step budget.

Do not activate for: single LLM calls, one-shot RAG queries, stateless tool pipelines, or flows where the cycle is provably bounded by data (e.g., "iterate once per row in this fixed list").


2. 核心心智模型 (Core Mental Model)

Every loop body must produce a state change that proves progress — and the proof must be checkable without calling another LM.

Read that twice. It contains four claims:

  1. The body must change state. A no-op iteration (same input → same output) is the definition of a stuck loop. If your body might return the same value twice, the loop is already broken; the safety net just hasn't fired yet.

  2. The change must be progress, not just diff. A retry that says "I tried again, same error" is a change but not progress. The witness has to be monotone: counter strictly increasing, error list strictly shrinking, confidence strictly rising, or a new fact added to the plan.

  3. The proof must be checkable. Pure Python. A dict.get("retries") < N, not await llm.ainvoke("are we done?"). If you ask the LM to evaluate termination, you've recreated the problem one level up — now that loop needs bounding.

  4. The LM is not allowed to vote. It can suggest finality (stop_reason="end_turn", final_answer tool, etc.) but the framework must verify against the predicate before terminating. Otherwise an LM that always says "let me try once more" runs forever.

Why the framework's default safety net is not enough

Every framework ships a default cap:

  • LangGraph: recursion_limit=25 [lc-docs/errors]
  • CrewAI: Agent.max_iter=20, Crew.max_rpm [crewai-docs/agents]
  • LangChain AgentExecutor: max_iterations=15 (deprecated default)
  • OpenAI Agents: Run.max_turns
  • Anthropic Messages: max_tokens per call (per-call, not per-loop)

These are billing safety nets, not control flow. The LangGraph docs say so explicitly:

"If you are not expecting your graph to go through many iterations, you likely have a cycle. Check your logic for infinite loops." — [lc-docs/errors] https://docs.langchain.com/oss/python/langgraph/errors/GRAPH_RECURSION_LIMIT

And the cheatsheet adds:

"Hitting the limit typically indicates an underlying design flaw. The recursion limit is a safety net for runaway code, not a primary control flow mechanism." — [cheatsheet/gotchas]

When you raise the limit to "fix" the error, you've moved the bug further away, not removed it. The text-to-SQL agent in [gh/6731] would have hit recursion_limit=100 after burning 5× the Databricks quota.

The three-axis termination model

A bounded loop has three independent termination axes; you need at least two firing in series:

                ┌─── (a) success predicate met → exit success
                │
[loop body] ────┼─── (b) counter / budget exhausted → exit escalation
                │
                └─── (c) stagnation detected → exit escalation

If you only have (a), the LM controls termination — it doesn't. If you only have (b), you'll burn the budget on N identical iterations. If you only have (c), one-shot flake will look like success.

Compose all three.


3. SOP 工作流 (Standard Operating Procedure)

A coder agent walks this top-down. Each step has a decision gate — answer "no" and you go back, not forward.

Step 1 · Identify the loop body and the cycle invariant

Before adding any bound, write down on paper:

  • What is the loop body? (one function / one node / one task)
  • What input does it read? What output does it write?
  • What state field MUST be different on iteration N+1 vs iteration N for this to be progress? That field is your progress witness.

Gate: if you can't name the witness, you don't yet understand the loop well enough to bound it. Don't add a counter — go think.

Common witnesses by workflow shape:

WorkflowWitness
Tool-call → error → retrylast_error text must change (or counter increments)
Plan → act → observeplan_revision: int strictly increases
Draft → critique → revisecritique length shrinks OR revision_count increments with non-empty diff
Test → fix → re-testfailing_tests set strictly shrinks
Optimiser sweepbest_metric strictly improves (with patience)
Multi-agent handofftask_status transitions through a state machine, not "in_progress → in_progress → ..."

Step 2 · Add the iteration counter

Counter discipline:

  • One counter per loop, not per agent. In multi-agent systems where agents can call each other (CrewAI delegation, LangGraph subgraphs), the counter must live in the shared state, not per-agent max_iter — that is the CrewAI ping-pong bug [gh/crewai-330].
  • Counter is monotonicAnnotated[int, operator.add] in LangGraph, not a state replace.
  • Counter is visible — log it; surface it in traces. A counter you can't see in LangSmith / Maxim / Datadog is a counter you'll forget is there.

Pseudocode (framework-agnostic):

def loop_body(state):
    new_state = do_one_iteration(state)
    new_state["retries"] = state.get("retries", 0) + 1
    return new_state

def should_continue(state) -> Literal["continue", "give_up"]:
    if state["retries"] >= MAX_RETRIES:
        return "give_up"
    if success_predicate(state):
        return "end"
    return "continue"

Step 3 · Add the stagnation detector

The counter alone wastes (N-1) iterations on identical work. Add a progress witness comparison:

def should_continue(state):
    if state.get("last_witness") == state.get("witness"):
        return "give_up_stagnant"
    if state["retries"] >= MAX_RETRIES:
        return "give_up_budget"
    if success_predicate(state):
        return "end"
    return "continue"

Stagnation signals worth detecting:

  • Same last_error two iterations running.
  • Same tool_calls hash (same tool, same args) two iterations.
  • plan_revision did not increment.
  • failing_tests did not shrink (test-fix loop).

When stagnation fires, always escalate — don't retry.

Step 4 · Pick the LM's view of the loop state

The LM must see the loop counter and the last error / last witness. If it doesn't, it will happily repeat. Concretely:

  • LangGraph: include retries and last_error in the messages passed to the LLM node — or render them into the system prompt at each iteration.
  • CrewAI: surface the previous task's failure in the next task's context=[...], not in Crew.memory (which is muddier).
  • Claude / OpenAI SDKs: in the next user/tool-result message, include "Attempt {n} of {N}. Previous error: {err}. If you cannot fix it on this attempt, return final_answer with status=failed."

Without this, the LM thinks it's on iteration 1 forever. The framework's counter is in your code; the behavioural counter must be in the prompt.

Step 5 · Build the escalation branch before you remove the safety net

The framework's safety net exists for a reason — runaway billing. Don't disable it. Instead, build the graceful give-up that catches the counter/stagnation exit:

  • LangGraph: a give_up node that calls interrupt({"reason": ...}), preserving the full state for a human or outer agent to inspect.
  • CrewAI: a fallback Task with human_input=True that fires when the main task fails the validation in expected_output.
  • Claude SDK: a human_escalation tool that the model is forced to call when attempt == N.
  • OpenAI Agents: handle Run.status == "incomplete" / incomplete_reason == "max_turns" in the caller and surface to user.

Rule of thumb: a loop without a give-up branch is a loop that fails to a stack trace. That's not graceful.

Step 6 · Layer the outer safety bound

Even with counter + witness + escalation, each individual iteration can be expensive (one tool call doing a 200k-token web search). Add:

  • Token budget: sum input + output tokens across iterations; cap.
  • Wall-clock timeout: asyncio.wait_for(loop, timeout=T) at the outermost caller.
  • Rate limit: CrewAI max_rpm, OpenAI tier limits, Anthropic requests_per_minute. Hit these before you hit the model's rate-limit error which adds backoff + more retries.

These are not redundant with the counter — they're orthogonal axes. A 3-iteration loop where one iteration runs for 4 hours still wedges your system.

Step 7 · Test the bound

Write a regression test that injects a permanent failure and asserts:

  • The loop exits within N iterations (counter works).
  • The loop exits before N if the same error recurs (stagnation works).
  • The final state is captured in the escalation branch (escalation works).
  • The trace shows the counter visible (observability works).

This is the same shape as [gh/6731]'s recommended fix: "Add a regression test that injects a permanent SQL error and asserts the graph terminates within 3 iterations." Steal that pattern.


4. 操作模型 (Operation Models)

Each operation is a primitive a coder agent can invoke. Format: Trigger → Action → Output → Evidence.

OP-1 · Retry counter in state (the foundational operation)

  • Trigger: Any cyclic LM workflow exists — even one cycle.
  • Action: Add a typed integer field to the workflow's persistent state with a monotonic semantics (LangGraph: Annotated[int, operator.add]; CrewAI: shared dict in Crew context or Flow state; Claude SDK: app variable). Increment inside the loop body; check in the exit predicate.
  • Output: A deterministic upper bound on iterations regardless of LM behaviour.
  • Evidence: [gh/6731] (text-to-SQL fix), [lc-docs/errors] ("explicit termination conditions are the right answer").

OP-2 · Stagnation detection via progress witness

  • Trigger: The counter alone keeps firing — you're wasting N iterations on identical retries.
  • Action: Store the previous iteration's distinguishing artifact (last_error, last_tool_call_hash, last_witness) in state. In the exit predicate, compare current vs previous before checking the counter. Same → exit stagnant, don't increment.
  • Output: Fast-fail on truly stuck loops; reserves the counter for cases where iterations are progressing slowly.
  • Evidence: [gh/6731] (LLM was retrying identical query 20 times — stagnation would have fired after 1).

OP-3 · Human / outer-loop escalation

  • Trigger: Counter exhausted OR stagnation detected.
  • Action: Route to a designated escalation node that does not crash. LangGraph: give_up node → interrupt({"reason", "state"}). CrewAI: fallback Task(human_input=True) or Flow @listen("failed") branch. Claude SDK: human_escalation tool injection. OpenAI Agents: catch incomplete_reason == "max_turns" in caller.
  • Output: A clean give-up branch carrying enough state for a human or outer agent to diagnose.
  • Evidence: [lc-blog/interrupt] four-pattern table; Aider's REPL return-to-human on failed test.

OP-4 · Token & wall-clock budget (orthogonal safety net)

  • Trigger: Iterations are themselves expensive (long context, web search, code execution).
  • Action: Track cumulative tokens in state; enforce wall-clock asyncio.wait_for at outer caller; enforce per-model requests_per_minute. Any one tripping → escalation (OP-3).
  • Output: Defense-in-depth — neither "3 cheap iterations" nor "1 expensive iteration" can run away.
  • Evidence: Anthropic Computer Use "step budget"; CrewAI max_rpm; OpenAI max_completion_tokens / max_prompt_tokens on Run.

OP-5 · Progress witness declared up front

  • Trigger: Designing a new cyclic flow.
  • Action: Identify the state field that MUST change between iterations to prove progress. Declare it as a typed field. The exit predicate verifies it changed; the LM's prompt is told to set it.
  • Output: A loop body that is structurally incapable of being a no-op.
  • Evidence: [cheatsheet/gotchas] "Treat each node like a pure function — return a partial state update"; LangChain best practices.

OP-6 · Refuse-to-raise-the-net (diagnostic)

  • Trigger: User asks "just raise recursion_limit / max_iter."
  • Action: Refuse and diagnose. Walk: (a) is there a state counter? (b) is there a witness? (c) does the LM see the previous error? Add what's missing. Leave the framework default in place — it's a circuit breaker, not a control knob.
  • Output: A diagnostic conversation that ends with OP-1+OP-2 instead of papering over the failure.
  • Evidence: [gh/6731] maintainer marked "not planned" — i.e., this is by design. [cheatsheet/gotchas] "indicates an underlying design flaw."

5. 困境决策案例 (Dilemma Cases)

Case 1 · "Text-to-SQL agent loops 20× to GRAPH_RECURSION_LIMIT"

  • Source: [gh/6731] (https://github.com/langchain-ai/langgraph/issues/6731), maintainer labelled "not planned."

  • 困境: A team built a text-to-SQL agent on LangGraph 1.0.6. When the Databricks query returned an error, the agent retried the same broken SQL 20 times until the default recursion_limit=25 fired. It had worked on 0.6.x; the upgrade exposed the missing exit condition.

  • 约束:

    • Can't pin to 0.6.x — security fixes only in 1.x.
    • Maintainer won't ship a fix — explicitly "not planned."
    • Databricks quota is bleeding; business needs the agent live.
  • 决策步骤:

    1. Reject "just raise recursion_limit to 100." That makes the bleeding worse and confirms the cheatsheet's diagnosis [cheatsheet/gotchas].
    2. Add state counter (OP-1):
      class S(TypedDict):
          messages: Annotated[list[AnyMessage], add_messages]
          retries: Annotated[int, operator.add]
          last_error: str | None
      
    3. Add stagnation detector (OP-2): if state["last_error"] is the same as the new error, route to give-up — don't burn 2 more attempts on the identical broken query.
    4. Surface last_error into the next LLM prompt — the content of the SQL error usually tells the LLM whether to retry or abandon.
    5. Add give-up node (OP-3) that uses interrupt() to ask the user: "Tried 3 times, got: {last_error}. Should I rewrite the query differently or stop?"
    6. Add regression test (Step 7) injecting a permanent SQL error and asserting termination within 3 iterations.
  • 结果: Loop bounded at 3 iterations. Stagnation typically fires on iteration 2 (same query, same error). Quota cost capped. Failure mode observable in LangSmith. Maintainer-labelled-"not-planned" issue becomes a non-issue without an upstream patch.

  • 可提取的操作: OP-1 + OP-2 + OP-3 are mandatory for any cyclic graph. The LM is never the terminator.


Case 2 · "CrewAI delegation ping-pong burns 10× the token budget"

  • Source: github.com/crewAIInc/crewAI/issues/330

  • 困境: A team running CrewAI in hierarchical mode set allow_delegation=True on all 3 worker agents. Agent A delegated to B; B delegated back to A; A delegated to C; C delegated back to A. The per-agent max_iter=20 did NOT propagate across the handoffs — every delegation reset the count. Token bill 10× expected; the loop only ended when OpenAI rate-limited them.

  • 约束:

    • Can't trivially flatten to sequential — the workflow really does need different specialists.
    • The CrewAI maintainer documentation acknowledges this as a known limitation but recommends "design your agents not to delegate circularly" — i.e., the framework's safety net is genuinely bypassed.
    • Compliance needs an audit trail of which agent ran when.
  • 决策步骤:

    1. Reject "raise max_iter per agent to 100." The bug is that max_iter doesn't cross handoffs — raising it does nothing [gh/crewai-330].
    2. Set allow_delegation=False on all worker agents. Only the manager agent gets delegation. This kills the cycle structurally — the CrewAI canonical advice from [azguards.com].
    3. Add a Crew-level handoff counter in Crew context (Flow state if using Flows). Each delegation increments; manager checks before dispatching.
    4. Stagnation detector: hash (from_agent, to_agent, task_id) — if the same triple recurs, route to the manager's "I-can't-resolve-this" fallback Task with human_input=True.
    5. Outer timeout: asyncio.wait_for(crew.kickoff_async(...), timeout=600) — wall-clock cap regardless of token budget.
    6. Switch to CrewAI Flow + small Crews if delegation logic really needs to be conditional — @listen gives explicit routing [crewai-docs/flows], eliminating the LM-driven handoff.
  • 结果: Cycle eliminated structurally (no worker delegates back). Token bill returns to expected level. Audit trail preserved (the manager owns dispatch; the handoff counter logs each).

  • 可提取的操作:

    • In multi-agent systems, counter and witness must live in shared state, not per-agent config. Per-agent max_iter is a useful inner net but not a sufficient outer net.
    • When the framework's primitive is structurally insufficient (CrewAI max_iter not crossing handoffs), the right move is to remove the primitive's source of failure (allow_delegation=False), not raise its limit.

6. 反模式与边界 (Anti-patterns & Boundaries)

Concrete don'ts. Each has a real-world example.

  • ❌ Raise recursion_limit / max_iter / max_turns to "fix" a loop. This is the anti-pattern this skill exists to name. The framework defaults are circuit breakers; raising them moves the failure further away while doubling the cost. Source: [gh/6731] maintainer "not planned"; [cheatsheet/gotchas] "indicates an underlying design flaw."

  • ❌ Let the LM vote on termination via reflection. Calling llm.invoke("are we done?") to decide whether to exit recreates the problem one level up — and the answer is biased ("let me just check one more thing"). The LM may suggest finality (a final_answer tool, stop_reason="end_turn"); the framework code must verify.

  • ❌ Bound only by tokens. A slow loop with cheap iterations never hits the token cap and runs for hours. Token budget is one of three axes (OP-4), not the whole bound.

  • ❌ Per-agent max_iter in multi-agent systems with delegation. CrewAI's Agent.max_iter does not propagate across delegation handoffs [gh/crewai-330]. The counter must be shared.

  • ❌ Bound without an escalation path. A loop that hits recursion_limit and raises is not "bounded" in any useful sense — it's "crashed with stacktrace." A bounded loop has a clean give-up branch (OP-3).

  • ❌ Counter that the LM cannot see. If you increment a counter in Python state but never surface "attempt N of M, previous error: X" in the LM's prompt, the LM thinks it's on attempt 1 forever and emits the same plan. Counter must be in the behaviour, not just the control plane.

  • ❌ "Stop when the metric stops improving" with no patience or cap. Classic optimiser footgun. The metric can plateau and resume; the loop should be bounded by both a max-step count and a patience counter — DSPy and W&B sweeps document this. Source: optimiser docs across DSPy / Optuna / W&B.

  • ❌ Bound the framework's max_iter but not the outer caller. If the LM raises an exception inside the loop body and your retry decorator wraps the whole call, the bounded loop becomes an unbounded retry. Bound at every layer the framework gives you.

  • ❌ Use interrupt() / human_input=True only on success. The give-up branch is the most important place for human-in-the-loop — that's where the agent is admitting it's stuck. Routing the failure to a stack trace instead of a human wastes the diagnostic moment.

Hard boundaries (when this skill does NOT apply)

  • One-shot LLM calls (no loop to bound).
  • Data-bounded loops where the cardinality is fixed at design time ("iterate once per row in this 500-row CSV"). The bound is the data size; counters/witnesses are over-engineering.
  • Pure-deterministic loops that don't involve LM calls.

7. 跨框架对照 (Cross-Framework Mapping)

The same termination contract expressed in each framework's vocabulary. Use this table when porting a bounded loop between frameworks — the shape is identical, only the names change.

ConceptLangGraphCrewAIClaude SDKOpenAI AgentsDSPy
Iteration counterstate["retries"]: Annotated[int, operator.add]shared dict in Crew context or Flow stateapp-side for i in range(max_iter):Run(max_turns=N) configoptimiser max_bootstrapped_demos
Exit predicateconditional edge function returning "END"manager Task validating expected_outputif stop_reason == "end_turn": breakRun.status == "completed"metric early-stopping (with explicit patience)
Safety-net defaultrecursion_limit=25Agent.max_iter=20 + Crew.max_rpmmax_tokens per callmax_turns (no default)none (dataset size)
Progress witnesstyped state field updated by nodeTask expected_output mandates deltavalidator on tool outputfunction schema enforces non-empty deltametric must strictly improve
Stagnation signalcompare state["last_X"] == state["X"] in conditional edgetask callback hashes output → stored in contextcompare tool_use blocks across iterationscompare tool_calls in Run stepspatience counter (steps since best)
Escalationinterrupt({"reason": ...}) nodefallback Task with human_input=Truetool call to human channelincomplete_reason == "max_turns" handlerterminate optimiser + log
Resume after escalationCommand(resume=...)re-kickoff() with appended human inputre-invoke with new user messagesubmit_tool_outputs(...)re-run with adjusted config
Outer safety boundswrap graph.ainvoke in asyncio.wait_forCrew.max_rpm + outer wait_forsum usage.input_tokens + usage.output_tokens across calls; wall-clockRun(max_completion_tokens, max_prompt_tokens)num_threads budget + wall-clock

Translation example. "Text-to-SQL agent retries the broken query 20 times" expressed in three frameworks:

LangGraphCrewAIClaude SDK
Counterretries: Annotated[int, operator.add] in TypedDictCrew(memory=False, context={"retries": 0}) updated in callbackattempt = 0 in caller
IncrementLLM node returns {"retries": 1}Task on_complete callback updates dictattempt += 1 after each Messages call
Exitconditional edge → END when retries >= 3manager Task aborts when context counter ≥ 3if attempt >= 3: break
Witnesslast_error: str field updated by tool nodelast_error key in Crew contexttrack in caller variable
Stagnationedge function compares last_errorcallback compares stored vs newcaller compares strings
Escalationinterrupt({"err": state["last_error"]}) nodefallback Task(human_input=True)tool call human_escalate(err)
Outer netrecursion_limit=10 (leave default low)Crew(max_rpm=30) + wait_for(60s)max_tokens=2048 + wait_for(60s)

The translation is mechanical because the contract is universal. That is the entire point of this skill.


8. 附录 · 引用速查 (Citation Index)

Short tags resolved against references/R1-source-evidence.md and references/R2-cross-framework.md:

  • [gh/6731] = github.com/langchain-ai/langgraph/issues/6731 — text-to-SQL recursion_limit, maintainer "not planned"
  • [gh/crewai-330] = github.com/crewAIInc/crewAI/issues/330 — delegation ping-pong; related #4783, #2606
  • [lc-docs/errors] = docs.langchain.com/oss/python/langgraph/errors/GRAPH_RECURSION_LIMIT
  • [lc-blog/interrupt] = www.langchain.com/blog/making-it-easier-to-build-human-in-the-loop-agents-with-interrupt
  • [cheatsheet/gotchas] = sumanmichael.github.io/langgraph-cheatsheet/cheatsheet/faqs-gotchas/
  • [crewai-docs/agents] = docs.crewai.com/en/concepts/agents
  • [crewai-docs/flows] = docs.crewai.com/en/concepts/flows
  • [azguards.com] = azguards.com/technical/the-delegation-ping-pong-breaking-infinite-handoff-loops-in-crewai-hierarchical-topologies/
  • [anthropic-docs] = docs.anthropic.com/en/api/messages — Messages API + Agent SDK "step budget" pattern
  • [dspy-docs] = dspy.ai/docs/building-blocks/optimizers — declared evaluation budget
  • [openai-agents-docs] = platform.openai.com/docs/assistants — Run config max_turns, max_completion_tokens, max_prompt_tokens

TL;DR (one-paragraph version for the impatient)

Every LM loop must carry an explicit counter in state, a progress witness the loop body must update, a stagnation detector that compares the witness across iterations, and a graceful escalation branch when either fires. Never raise the framework's recursion_limit / max_iter / max_turns to "fix" a loop — that limit is a billing safety net, not control flow, and raising it moves the failure further away while burning more tokens. The LM in the loop is never a reliable terminator. Source-of-truth case: LangGraph issue #6731 marked "not planned" — the framework will not save you; the discipline must.