Install
openclaw skills install @superior-ai/backtestingUse when running, interpreting, or designing backtests on Superior Trade — anything about backtest windows, trade-count thresholds, exit-reason mix, parameter sweeps, walk-forward validation, zero-trade diagnosis, compute-cost estimation, or "is this backtest result trustworthy?". Pair with the relevant strategy template from `strategies/`.
openclaw skills install @superior-ai/backtestingThe mechanics of submitting a backtest are in the main SKILL.md under "Backtest Workflow". This page is about the judgment calls — picking a window that means something, telling signal from noise in the result, and knowing when to give up vs. iterate.
Trade count is the single most important number on a result page. Look at it before PnL, before Sharpe, before win rate.
| Trade count | Verdict |
|---|---|
| < 30 | Coincidence, not a strategy. Don't promise anything; widen entries or extend window. |
| 30-50 | Marginal. Sharpe is noisy. Treat results as directional, not numeric. |
| 50-200 | Useful. Sharpe / profit factor start to mean something. |
| 200+ | Statistical confidence. Now you can compare variants on micro-differences. |
Watch for the trap: backtests with 5-10 trades and a 100% win rate. They look like world-beaters and almost always disintegrate live. The strategy is too selective — every signal is a coin flip you've cherry-picked, not a repeatable edge. Widen the entry threshold, lengthen the window, or accept that there's no statistical signal here.
A great backtest over the wrong window is a great fiction.
The window should answer: "if I had deployed this strategy on day one of this window, what would have happened?" — not "what's the prettiest curve I can fit?"
Pure bull, pure bear, sideways chop — your window should include at least two of the three. A 90-day backtest in a one-direction market is a 90-day cherry-pick. A momentum strategy that prints +50% over a +60% trending window has told you nothing about itself; it's just measured beta.
Less than that and you're really looking at noise. More than that and Hyperliquid's history may not cover the pair (HL was launched in 2023; many alts have < 12mo of data).
Some HIP3 / new-listing pairs only have a few weeks of data. Two pragmatic responses:
BTC if BTC-AAPL is too new), then assume the result transfers ±20%.If you tune parameters and validate on the same window, your backtest is a souvenir, not a forecast. This is the single most common reason strategies that look brilliant on paper die in the first week of live trading.
The honest workflow:
If you do sweep, walk forward to a different period or pair before promoting the winner. A parameter that wins on Q1 BTC and also wins on Q1 ETH (out-of-sample for the second pair) is real. A parameter that wins on Q1 BTC and gets re-validated on Q1 BTC is theatre.
If the user's "strategy" started as a 5-variant sweep and now they want to deploy the winner — the result on the test window is overstated. Tell them to either run an out-of-sample check or accept they're deploying on optimistic numbers.
When the user wants to find the right parameter (e.g. RSI threshold, ATR multiplier, lookback length), run 3 variants in one batch, not iterative single backtests. The shape of the 3-result table tells you what to do:
| Variant | Config | PnL% | Trades | Sharpe | Max DD |
|---|---|---|---|---|---|
| A | RSI < 25 | … | … | … | … |
| B | RSI < 30 | … | … | … | … |
| C | RSI < 35 | … | … | … | … |
Read the shape:
After the backtest completes, look at the breakdown of how trades ended. A healthy strategy ends through a mix of:
minimal_roipopulate_exit_trend)custom_exit)Each one tells a different story. Distortions diagnose specific bugs:
| Distortion | Meaning |
|---|---|
| 90% stoploss hits | Stop is too tight relative to the strategy's natural noise. Widen, or accept this strategy is in the wrong volatility regime for this pair. |
| 90% time-based timeouts | Exit signal does nothing. Either the exit conditions are too restrictive, or there's no real edge — you're just holding until the timer rings. |
| 100% take-profit hits | minimal_roi is the strategy. The exit signal isn't earning its keep — could be removed or tightened. |
| 50/50 stop vs profit | Healthy. Strategy is choosing actively in both directions. |
Zero trades from a backtest means the entry condition never fired. Five common causes, in order of likelihood:
RSI < 30 rarely triggers on BTC 15m).startup_candle_count larger than available data — the indicator stays NaN forever.BTC/USDC for futures will silently produce no fills).false AND false — each condition individually rare, joined by AND rarer still.Backtest run time scales with candle count. Estimate before submitting so you can set the user's expectations:
candles ≈ (days × 24) / timeframe_hours × number_of_pairs
| Candle count | Behavior |
|---|---|
| < 100K | Submit normally, poll every 10s. |
| 100K–500K | Warn user "may take a few minutes", use exponential polling. |
| > 500K | Warn user "could take 10+ minutes", longer poll intervals. |
Reference points:
Always allow the user to proceed. Just set expectations. Don't block on size.
When polling backtest_status on long runs:
This prevents hitting the 20-step tool limit on million-candle backtests.
| Metric | Trust at | Notes |
|---|---|---|
| Total trades | Always look first. | Below 30 = ignore everything else. |
| Total profit % | Useful for ranking, weak for forecasting. | Large windows + small per-trade edge can produce big PnL from luck. |
| Win rate | OK above 50 trades. | A 60% WR on 8 trades is one good week, not an edge. |
| Sharpe ratio | Above 1.0 = good, above 2.0 = excellent. | But fragile under 50 trades; don't quote "Sharpe 2.0" off a 12-trade run. |
| Profit factor | Gross gains / gross losses. > 1.3 is the practical floor. | Penalizes hidden tail losses better than Sharpe. |
| Max drawdown | > 20% is risky for retail-sized accounts. | A 5% Sharpe-1 strategy with 30% DD is unrunnable for most users. |
| Avg holding | Sanity check — does it match the strategy's intent? | A "scalp" with 12h avg holding is misnamed; a "swing" with 5min holding likewise. |
Before recommending live deployment, run the strategy on out-of-sample data at least once:
ETH if you tuned on BTC). Allow ±30% degradation; anything beyond that means the parameters were pair-specific not regime-specific.If both walk-forwards survive, you have a defensible recommendation. If either falls apart, you have a backtest, not a strategy.