Install
openclaw skills install high-performance-codingUse when writing or optimizing performance-critical code — batch processing, concurrent/parallel systems, async pipelines, GPU computing, scientific computing, or any code where throughput, latency, or resource efficiency matters. Also trigger when the user mentions "performance", "optimize", "scale", "concurrency", "make it faster", "speed up", "throughput", "latency", "GPU", "memory bound", "CPU bound", "checkpoint", "resume", "断点续传", "中断恢复", "idempotent", or asks about resource usage or making long-running tasks resumable. This skill encodes universal performance principles distilled from real systems — resource-aware parallelism, async pipeline design, GPU acceleration, interruption-tolerant computation, lock-free data structures, and progressive validation.
openclaw skills install high-performance-codingUniversal principles for writing fast, resource-efficient code. These patterns apply across languages and domains — systems programming, scientific computing, data pipelines, and web services.
Before optimizing, answer one question: is the program compute-bound or bandwidth-bound?
A useful mental model is the roofline: every operation has an arithmetic intensity (FLOPs per byte loaded). Plot that against your hardware's peak compute and peak bandwidth. If your intensity is below the ridge point, you're bandwidth-bound — optimizing compute won't help. If above, you're compute-bound — better caching won't help.
In practice:
perf stat shows high IPC (> 2) and low cache miss rate, you're likely compute-bound.cache-misses or cache-references are high, you're bandwidth-bound.Amdahl's Law sets the theoretical ceiling on parallelism. If a fraction S of your program is serial, the maximum speedup with N cores is 1 / (S + (1-S)/N). For S = 10%, infinite cores give at most 10× speedup. This is why finding and shrinking the serial fraction matters more than adding cores — and why the bottleneck (compute, bandwidth, or serial code) dictates the optimization strategy.
Little's Law connects concurrency to throughput. concurrency = throughput × latency. If each task takes 1 second and you need 100 tasks/s, you need ≥ 100 concurrent workers. Use this to pick max_workers from requirements, not guesswork. Combined with Amdahl: the serial fraction limits how much concurrency can actually help.
This determines everything that follows. Memory-bound optimizations applied to a compute-bound program add overhead with no benefit. Compute optimizations applied to a bandwidth-bound program don't move the needle. Adding parallelism beyond Amdahl's limit wastes resources. Answer the bottleneck question first.
Before anything: check what the server actually has available right now. cpu_count tells you total cores, not free cores. total_ram tells you installed memory, not available memory. Other users, background services, and yesterday's forgotten processes all consume resources. Use htop / free -h / nvidia-smi to see current state, not theoretical capacity. Then pick a strategy:
CPU-bound workloads:
ProcessPoolExecutor for CPU work (the GIL serializes threads on CPU tasks), ThreadPoolExecutor for IO waits.std::thread::hardware_concurrency() returns available cores. std::execution::par_unseq (C++17) or OpenMP #pragma omp parallel for for loop parallelism.rayon with par_iter() automatically sizes the thread pool. Use par_bridge() for sequential iterators.Memory-bound workloads:
free -h, not total). Estimate per-worker memory: a worker that loads 2 GB means you can't run 16 of them on 32 GB.Iterator traits, and lazy pipelines keep memory constant regardless of dataset size.IO-bound workloads:
Semaphore(max_concurrent) gate rather than spawning unlimited tasks. Tune the limit by observing actual resource pressure.GPU-bound workloads:
loss / n_accumulate then backward(), then optimizer.step() after every N micro-batches.tensor.to(device, non_blocking=True) overlaps CPU→GPU copy with kernel execution when data is pinned.Container-bound workloads (Docker/batch evaluation): Each container is a resource consumer — memory, disk, CPU. Parallelism strategies:
docker exec git checkout HEAD && git clean -fd). An agent task that shares one container across N modes is N× faster than rebuilding each time.sweb.eval.* containers sitting around consuming memory. Always run docker ps -a --filter | xargs docker rm -f before starting a new batch.max_workers. Instance workers generate predictions (CPU/network-bound), eval workers run Docker containers (memory/IO-bound). They have different bottleneck profiles — tuning them independently avoids contention.docker exec or docker build to complete). Threads work fine here — the GIL doesn't block waiting on subprocess output.max_workers = min(cpu_count, len(tasks)) — spawning 32 workers for 5 tasks wastes thread creation overhead.Cross-cutting patterns for async code that apply regardless of framework:
The event loop must not block:
Synchronous SDKs (database drivers, HTTP clients, file I/O) stall the entire event loop. Offload them: asyncio.to_thread() in Python, tokio::task::spawn_blocking() in Rust. One blocked coroutine starves all others.
Know your framework's concurrency model: Not all async primitives are concurrency-safe. Before spawning concurrent tasks, understand what your framework actually supports:
AsyncSession rejects concurrent queries — asyncio.gather() on the same session errors. Use separate sessions or serialize.tokio::spawn() is cheap but spawning 100K tasks holding connections exhausts file descriptors.pool_size and max_overflow based on your measured concurrency, not guesswork.Test the failure paths: What happens when the database is unreachable? When the external API times out? When the semaphore is at capacity? Async error handling often silently swallows exceptions — explicitly test these paths.
Decouple stages with bounded queues:
Producer-consumer patterns with asyncio.Queue(maxsize=N) or tokio::sync::mpsc::channel(N) let stages run at different speeds. Backpressure is automatic — the producer blocks when the queue is full.
LLM API and embedding throughput: When the pipeline calls external AI APIs, throughput is gated by rate limits, latency, and token cost:
embed(["text1", "text2", ...]) instead of N separate calls can reduce wall time by 10-50×. Test your provider's max batch size — some cap at 96, others at 2048.cache_control blocks. When cache hits, those tokens cost 90% less and skip the model's prefill phase entirely. Structure prompts so the cached prefix stays identical across calls.asyncio.Semaphore(max_concurrent) to stay under the limit, and catch 429 responses with exponential backoff. Many SDKs have built-in retry — make sure it's enabled.Universal GPU programming patterns, applicable whether you use PyTorch, CUDA C++, or JAX:
Benchmark methodology is not optional:
cudaDeviceSynchronize() (CUDA C++) / torch.cuda.synchronize() (PyTorch). CUDA ops are asynchronous — without sync you measure launch overhead, not compute time.cudaMemGetInfo() (CUDA C++) / reset_peak_memory_stats() → max_memory_allocated() (PyTorch). Subtract baseline memory.Memory allocation patterns (CUDA C++):
cudaMalloc once, reuse buffers. Runtime allocation inside hot loops is expensive.cudaMallocHost) for DMA transfers. Enables overlap with cudaMemcpyAsync on different streams.cudaStream_t stream; cudaStreamCreate(&stream); kernel<<<grid, block, 0, stream>>>(); — multiple streams can overlap kernel execution with data transfer.Reproducibility vs speed:
cudnn.deterministic = True, cudnn.benchmark = False, manual_seed(seed), cuda.manual_seed_all(seed). Slower but outputs are identical run-to-run.cudnn.benchmark = True (cuDNN auto-tuner picks fastest algorithm). Only works when input shapes are consistent.Sparse vs dense is a data property:
The crossover point where sparse beats dense depends on density, hardware, and operation. Don't guess — benchmark a grid of sizes × densities. For graphs, edge_index (sparse) is almost always right when average degree << total nodes.
Gradient accumulation is the universal solution to "batch doesn't fit":
Zero gradients once → loop over micro-batches → loss = loss / n_batches → backward() each → optimizer.step() once at the end. The effective batch size is the sum of micro-batches, but peak memory is that of a single micro-batch.
The most reliable way to avoid wasting resources:
timeout 10m, run 1-2 iterations of the heavy loop, measure the time, and extrapolate. If one slice takes 150ms and you have 119 slices per epoch × 2000 epochs, that's ~10 hours — not 10 minutes. A 5-second dry run can prevent a multi-hour timeout.max_items=5 before max_items=5000. Catch code bugs, config errors, and logic issues on a tiny, fast run. Also catches "this will take all night" before it starts.When profiling shows a specific code path dominates, handle it without the general-case overhead:
Check before locking. An atomic counter can tell you "no work to do" without ever touching the lock:
// Wait queue wake: check atomic flag before acquiring the spinlock.
// If num_wakers == 0, there's nothing to wake — skip the lock entirely.
if (atomic_load(&wq->num_wakers, relaxed) == 0)
return;
spin_lock(&wq->lock);
// ... actual wake logic
Single-owner skip. When a reference count is 1, you own the data exclusively:
// RwArc::get() — if we're the only handle, return &T without any atomic ops.
if self.num_rw.load(Ordering::Relaxed) == 1 {
return unsafe { &*self.ptr }; // no other thread can observe
}
// Otherwise, fall back to the general (slower) path...
Power-of-two capacities. Ring buffer size = 2^n turns modulo into a bitmask:
// idx % capacity → idx & (capacity - 1)
// Compiles to a single AND instruction instead of DIV.
buf[tail & (buf->capacity - 1)] = item;
Relaxed memory ordering when safe.
If your context already provides ordering (preemption disabled, lock held, single-threaded phase), Ordering::Relaxed is cheaper than SeqCst:
// Inside a spinlock — lock acquire/release already provides ordering.
// Relaxed is sufficient and generates plain MOV instead of LOCK XADD.
self.counter.fetch_add(1, Ordering::Relaxed);
Data layout often matters more than instruction count. The gap between CPU speed and DRAM latency has been growing for 40 years — a main memory access costs ~100ns, during which a modern core could execute ~400 instructions.
Cache lines and false sharing: A cache line is 64 bytes on x86/ARM. When any core writes to a cache line, every other core's copy is invalidated. If two threads write to different variables that happen to share a cache line, they destroy each other's cache despite never touching the same data:
// BAD: two ints on the same cache line — threads fight invisibly.
struct {
int counter_a; // thread A writes here
int counter_b; // thread B writes here — SAME cache line
} __attribute__((packed));
// GOOD: pad to cache line boundaries.
struct {
alignas(64) int counter_a;
alignas(64) int counter_b;
};
In Rust: #[repr(align(64))]. Symptoms: high cache-miss rate on writes, poor scaling beyond 2 threads despite no obvious contention.
AoS vs SoA: When iterating over one field of many objects, Structure-of-Arrays keeps the accessed data contiguous:
// AoS: iterating all x coordinates loads x,y,z into cache for every element.
struct Particle { float x, y, z; } particles[N];
// SoA: three separate arrays. Walking all x touches only x — 3× fewer cache lines.
struct Particles { float *x, *y, *z; };
Rule of thumb: if you access all fields together, use AoS (good locality). If you access one field across many objects, SoA wins. GPUs especially punish AoS — coalesced access requires threads in a warp to hit consecutive addresses, and AoS interleaves unused data.
NUMA awareness:
On multi-socket machines, memory is attached to specific CPUs. Accessing "remote" memory costs 1.3-2× more latency than "local". When allocating large buffers, allocate them on the NUMA node where they'll be used. In Linux, numactl --cpunodebind=N --membind=N pins both. In code, libnuma or numa_alloc_onnode().
Practical rules:
perf stat -e cache-misses,cache-references tells you if layout optimization is worth doing at all.Don't let one failure cascade:
Long-running work should survive crashes, restarts, and partial failures. Five levels of investment, from trivial to production-grade:
Level 1 — Skip if output exists. if os.path.exists(out): return. One line, makes any script idempotent.
Level 2 — Track processed IDs. Save a processed.json set, updated after each item. Re-run skips already-done work. For batch processing that takes minutes to hours.
Level 3 — Structured checkpoints at boundaries. After each pipeline stage, before external jobs (API calls, Docker, GPU epochs), and before human gates. A JSON row recording: current step, produced artifacts, pending jobs, how to continue.
Level 4 — Automatic recovery strategy. Escalate by failure count and type: auto-retry (transient, < 3 attempts) → checkpoint-resume (persistent, roll back) → skip-and-continue (non-critical, log and move on) → manual intervention (critical, pause for human).
Level 5 — Survive process death. DB-backed queue (not in-memory) + lease per work item (expired lease = another worker claims it) + graceful shutdown (catch SIGTERM, finish current item, save checkpoint, release locks).
Core principle: make the unit of work small enough that restarting from the last checkpoint is cheap. If recomputing from scratch costs 3 hours, create checkpoints more often than every 3 hours. The overhead of saving state is negligible compared to the cost of redoing work.
For compiled languages:
lto = true + codegen-units = 1 enables cross-crate inlining. lto = "thin" for dev builds (most of the benefit, much faster). panic = "abort" removes landing pads for smaller binaries and better inlining. Typical: 2-10% speed gain, 2-3× compile time.-flto -fwhole-program -fuse-linker-plugin for GCC/Clang. -ffast-math when IEEE compliance isn't needed. -march=native to enable instruction sets available on the build machine (AVX2, etc.).-ldflags="-s -w" strips debug info and symbol table. CGO_ENABLED=0 produces static binaries and avoids cgo overhead. Build with the target's GOARCH for best results — cross-compilation is cheap in Go.The right data structure often beats micro-optimization by orders of magnitude:
The ⚠️ structures are correct only under specific invariants. Without those invariants, they produce silently wrong results — worse than a slow lock.