Durable Workflow

v1.0.1

Patterns and procedures for building AI agent workflows that survive real-world failures. Use when asked to build a multi-step automation, pipeline, or agent...

0· 98·0 current·0 all-time
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
The name/description match the included files and patterns (lock.js and workflow-template.js implement checkpointing, retries, DLQ, exit handling). Minor inconsistency: the package is instruction-only and declares no required binaries, but the shipped scripts are Node.js programs — the skill does not declare that Node is required.
Instruction Scope
SKILL.md stays on topic: it instructs copying and running the provided scripts and documents expected behavior (checkpointing, locks, retries). It explicitly tells the agent to read/write local state and lock files and to run node scripts; it does not instruct reading unrelated system secrets or contacting external endpoints.
Install Mechanism
There is no install spec (instruction-only), so nothing will be downloaded or written by an installer. The risk is low. Note: runtime execution will write files to the filesystem when you run the scripts.
!
Credentials
The code references environment variables (WORKFLOW_STATE_PATH, WORKFLOW_DLQ_PATH, STEP_TIMEOUT_MS) and defaults to local paths, but the skill metadata lists no required env vars. While these env vars are non-sensitive, the SKILL.md/code access env configuration not declared in the registry metadata — a small coherence gap you should be aware of.
Persistence & Privilege
always:false and default model invocation are set (normal). The skill does not request persistent platform-level privileges or try to modify other skills' config. It writes/reads local state and lock files only, which is appropriate for its purpose.
Assessment
This skill appears to do what it says — reusable Node.js workflow patterns and helper scripts. Before installing or running: (1) ensure you have Node.js available (the registry metadata does not declare it); (2) review the two scripts (lock.js and workflow-template.js) — they run locally and will read/write files (defaults: workflow-state.json, workflow-dlq.json, /tmp locks); (3) avoid running them as root and set WORKFLOW_STATE_PATH and WORKFLOW_DLQ_PATH to directories you control to prevent accidental writes to sensitive locations; (4) confirm you are comfortable with process.kill-based PID checks (used by the lock); (5) fill the TODO steps and review notification hooks — do not plug in credentials or external endpoints without auditing how they are used. If you need higher assurance, ask the publisher to declare Node as a required binary and list the env vars the skill reads.

Like a lobster shell, security has layers — review code before you run it.

latestvk979azjstjrzc3vdeg1ver8gp183g8nb

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

SKILL.md

Durable Workflow Patterns

Build automations that survive API failures, timeouts, and unexpected state — without rebuilding from scratch every time something breaks.

Core Principle

Every step in a multi-step workflow must answer three questions:

  1. What did I finish? (checkpoint)
  2. What do I do if this step fails? (recovery)
  3. Who finds out if something goes wrong? (alerting)

Skip any of these and the workflow will eventually fail silently.

Scripts

Ready-to-use implementations in scripts/:

ScriptPurpose
workflow-template.jsComplete workflow skeleton with checkpoints, retry, DLQ, exit handler
lock.jsFile-based process lock — prevents concurrent runs

workflow-template.js

Copy and fill in the step TODOs:

cp scripts/workflow-template.js my-workflow.js
node my-workflow.js           # Run (or re-run — resumes from last checkpoint)
WORKFLOW_STATE_PATH=/tmp/state.json node my-workflow.js   # Custom state path

Features: atomic state saves, exponential backoff, timeout wrapper, DLQ, abnormal-exit logging.

lock.js

Prevent two instances of the same workflow from running at once:

const { withLock, LockError } = require('./lock');

withLock('/tmp/my-workflow.lock', async () => {
  // Only one process runs this block at a time
  await runWorkflow();
}).catch(e => {
  if (e.name === 'LockError') {
    console.error('Already running:', e.message);
  } else {
    throw e;
  }
});

Pattern 1: Checkpoint State

Save progress after every meaningful step. Never trust in-memory state across network calls.

// checkpoint.js pattern
const state = loadState('workflow-id') || { step: 0, results: [] };

if (state.step < 1) {
  state.results.push(await fetchData());
  state.step = 1;
  saveState('workflow-id', state);
}
if (state.step < 2) {
  state.results.push(await processData(state.results[0]));
  state.step = 2;
  saveState('workflow-id', state);
}
// Restart from any step — already-done steps are skipped

Pattern 2: Circuit Breaker

Stop hammering a failing service. Open the circuit after N failures, half-open after a cooldown.

class CircuitBreaker {
  constructor(threshold = 3, cooldownMs = 30000) {
    this.failures = 0; this.threshold = threshold;
    this.state = 'closed'; this.nextRetry = 0;
  }
  async call(fn) {
    if (this.state === 'open') {
      if (Date.now() < this.nextRetry) throw new Error('Circuit open');
      this.state = 'half-open';
    }
    try {
      const result = await fn();
      this.failures = 0; this.state = 'closed';
      return result;
    } catch (e) {
      this.failures++;
      if (this.failures >= this.threshold) {
        this.state = 'open';
        this.nextRetry = Date.now() + this.cooldownMs;
      }
      throw e;
    }
  }
}

Pattern 3: Exponential Backoff with Jitter

async function withRetry(fn, maxAttempts = 4, baseDelayMs = 1000) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try { return await fn(); }
    catch (e) {
      if (attempt === maxAttempts - 1) throw e;
      const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 500;
      await new Promise(r => setTimeout(r, delay));
    }
  }
}

Pattern 4: Dead Letter Queue

When a step fails after all retries, don't silently drop it. Route it somewhere reviewable.

async function processWithDLQ(items, processFn, dlqPath) {
  const failed = [];
  for (const item of items) {
    try { await withRetry(() => processFn(item)); }
    catch (e) { failed.push({ item, error: e.message, failedAt: new Date() }); }
  }
  if (failed.length) {
    const existing = fs.existsSync(dlqPath) ? JSON.parse(fs.readFileSync(dlqPath)) : [];
    fs.writeFileSync(dlqPath, JSON.stringify([...existing, ...failed], null, 2));
  }
}

Pattern 5: Idempotent Operations

Design every step so running it twice produces the same result as running it once.

// BAD: running twice creates two records
await db.insert({ id: uuid(), data });

// GOOD: upsert on natural key
await db.upsert({ id: deterministicId(data), data }, { onConflict: 'update' });

Pattern 6: Instance Lock

Prevent duplicate runs (e.g. cron overlap, manual re-trigger while running).

const { withLock, LockError } = require('./scripts/lock');

const LOCK_PATH = '/tmp/my-workflow.lock';

async function main() {
  await withLock(LOCK_PATH, async () => {
    // Safe: only one instance reaches here at a time
    await runWorkflow();
  });
}

main().catch(e => {
  if (e.name === 'LockError') {
    // Not an error — just another instance running
    console.log(`Skipping: ${e.message}`);
    process.exit(0);
  }
  console.error('Fatal:', e.message);
  process.exit(1);
});

The lock uses PID detection — stale locks from crashed processes are automatically reclaimed.

Workflow Design Checklist

Before shipping any multi-step automation:

  • Each step saves state before moving to the next
  • External API calls wrapped in retry + backoff
  • Circuit breaker on services called more than once per run
  • Failed items go to a dead letter file/queue, not /dev/null
  • The workflow can restart from any step without duplicating completed work
  • Alerting fires when the workflow exits abnormally (not just on exception)
  • Timeouts set on all external calls (never await fetch() without a deadline)
  • Instance lock in place if triggered by cron or multiple callers

Alerting

Send a Telegram message on workflow failure so you know before you look. Uses only the https built-in.

Set env vars: ALERT_TELEGRAM_TOKEN and ALERT_CHAT_ID.

const https = require('https');

function sendTelegramAlert(message) {
  const token  = process.env.ALERT_TELEGRAM_TOKEN;
  const chatId = process.env.ALERT_CHAT_ID;
  if (!token || !chatId) return Promise.resolve(); // alerting not configured, skip silently

  const body = JSON.stringify({ chat_id: chatId, text: message, parse_mode: 'Markdown' });
  return new Promise((resolve) => {
    const req = https.request(
      {
        hostname: 'api.telegram.org',
        path: `/bot${token}/sendMessage`,
        method: 'POST',
        headers: { 'Content-Type': 'application/json', 'Content-Length': Buffer.byteLength(body) },
      },
      res => { res.resume(); res.on('end', resolve); }
    );
    req.on('error', () => resolve()); // don't let alert failure crash the workflow
    req.setTimeout(5000, () => { req.destroy(); resolve(); });
    req.write(body);
    req.end();
  });
}

// Usage — in your main() catch block:
main().catch(async e => {
  console.error('Fatal:', e.message);
  await sendTelegramAlert(`❌ *Workflow failed*\n\`${e.message}\``);
  process.exit(1);
});

Common Failure Modes

See references/failure-taxonomy.md for a full catalog of agent workflow failures with diagnosis and fix patterns.

Files

4 total
Select a file
Select a file to preview.

Comments

Loading comments…