Install
openclaw skills install algo-builderBuild and test systematic daily-to-weekly trading algorithms through hypothesis, signal validation, backtesting, and strategy development steps.
openclaw skills install algo-builderUse this skill when the user wants to build a trading algorithm, develop a trading strategy, test a signal, backtest a strategy, research alpha, validate a trading hypothesis, or structure a systematic trading process.
This pipeline is optimized for daily-to-weekly frequency strategies (swing trading, factor investing, systematic macro, crypto position trading).
Does NOT apply to HFT (millisecond to minute): HFT requires order book simulation, tick data, latency modeling, and queue position analytics. If the user is building HFT, redirect them to those specialized tools.
For intraday strategies (1min to 4hr): use intraday IC horizons, model bid-ask explicitly, consider volume or dollar bars instead of time bars.
Most people start at backtesting. That is wrong. Backtesting is a confirmation tool, not a discovery tool — as Marcos Lopez de Prado puts it: "Backtesting is not a research tool. Feature importance is." Starting at backtesting causes overfitting, wasted time, and false confidence.
The correct pipeline:
HYPOTHESIS → SIGNAL TEST → STRATEGY CONSTRUCTION → IN-SAMPLE BACKTEST → WALK-FORWARD → PAPER TRADE → LIVE
Never skip steps. Never go backward except intentionally.
Before any code, write the hypothesis in plain English.
Prompt the user to answer:
If the user cannot answer all five, the hypothesis is not ready. Do not proceed.
Write the hypothesis into a file: algo-builder/hypotheses/<strategy-name>.md
Template:
## Hypothesis: <name>
**Edge:** <one sentence>
**Mechanism:** <why the edge exists>
**Counterparty:** <who is wrong and why>
**Regime dependency:** <when it breaks>
**Holding period:** <timeframe>
**Trials so far:** 0 (increment each time you test a variation — needed for DSR in Step 5)
**Status:** [HYPOTHESIS | SIGNAL_TESTED | BACKTESTED | LIVE]
Test each signal component IN ISOLATION before combining them into a strategy. This is the most important and most skipped step.
Information Coefficient (IC) IC = Spearman rank correlation between signal value and forward return over N periods. Use Spearman (not Pearson) — it is robust to outliers and works for non-linear rank relationships.
Calibrated thresholds (per Grinold & Kahn, academic literature, and practitioner consensus):
Note: In crypto (less efficient than equities), IC thresholds can be higher. In highly liquid large-cap equities, IC 0.03 is genuinely good. Calibrate to your asset class.
IC Information Ratio (ICIR) — equally important as IC itself ICIR = mean(IC) / std(IC) over rolling windows. Measures consistency of the signal over time.
A signal with IC 0.08 but ICIR 0.3 is more dangerous than a signal with IC 0.04 and ICIR 0.9. Consistency matters as much as magnitude.
IC Stability (Rolling) Plot rolling IC over time. Consistent positive IC across multiple market regimes is required. IC that only appears in one regime has conditional validity — you must then condition entry on that regime.
Decay Analysis Test IC at 1-bar, 3-bar, 5-bar, 10-bar, 20-bar forward horizons. Does the signal's predictive power decay gracefully or cliff-edge? This reveals the natural holding period.
T-stat on IC IC must be statistically significant: t-stat > 2.0 (p < 0.05) over the test period.
Regime Conditioning Test IC separately in different market regimes (bull/bear, high-vol/low-vol). A signal with IC 0.06 in all regimes is far better than IC 0.12 only in bull markets. Document regime dependency clearly.
Simple regime proxies:
# Price above/below 200-day MA
def bull_bear_regime(prices):
ma200 = prices.rolling(200).mean()
return (prices > ma200).astype(int) # 1=bull, 0=bear
# VIX-based volatility regime
def vol_regime(vix):
return pd.cut(vix, bins=[0, 15, 25, 100], labels=['low', 'medium', 'high'])
Ask the user for:
Write scripts/test_signal.py:
import pandas as pd
import numpy as np
from scipy import stats
def compute_ic(signal: pd.Series, forward_returns: pd.Series) -> dict:
"""Compute Information Coefficient using Spearman rank correlation."""
combined = pd.concat([signal, forward_returns], axis=1).dropna()
signal_col, return_col = combined.columns
ic, pvalue = stats.spearmanr(combined[signal_col], combined[return_col])
t_stat = ic * np.sqrt((len(combined) - 2) / (1 - ic**2))
return {
"IC": round(ic, 4),
"p_value": round(pvalue, 4),
"t_stat": round(t_stat, 4),
"n_obs": len(combined),
"verdict": _verdict(ic, pvalue)
}
def _verdict(ic: float, pvalue: float) -> str:
if pvalue > 0.05:
return "REJECT: Not statistically significant"
if ic < 0.01:
return "REJECT: IC too low, no edge"
if ic > 0.15:
return "FLAG: Exceptional IC — verify data quality before proceeding"
if 0.01 <= ic < 0.02:
return "WEAK: Viable only at high breadth/scale"
if 0.02 <= ic < 0.05:
return "MODERATE: Viable in ensemble"
if 0.05 <= ic < 0.15:
return "STRONG: Pursue this signal"
return "STRONG: Pursue this signal"
def compute_icir(rolling_ic: pd.Series) -> dict:
"""Compute IC Information Ratio from rolling IC series."""
icir = rolling_ic.mean() / rolling_ic.std()
return {
"ICIR": round(icir, 4),
"mean_IC": round(rolling_ic.mean(), 4),
"std_IC": round(rolling_ic.std(), 4),
"verdict": "STABLE" if icir > 0.5 else "UNSTABLE — do not proceed even with good IC"
}
def test_ic_decay(signal: pd.Series, prices: pd.Series, horizons=[1,3,5,10,20]) -> pd.DataFrame:
"""Test IC at multiple forward horizons to find natural holding period."""
results = []
for h in horizons:
fwd_ret = prices.pct_change(h).shift(-h)
result = compute_ic(signal, fwd_ret)
result["horizon"] = h
results.append(result)
return pd.DataFrame(results).set_index("horizon")
def test_ic_stability(signal: pd.Series, forward_returns: pd.Series, window=52) -> pd.Series:
"""Rolling IC to check stability over time."""
rolling_ic = []
for i in range(window, len(signal)):
s = signal.iloc[i-window:i]
r = forward_returns.iloc[i-window:i]
ic, _ = stats.spearmanr(s.dropna(), r.dropna())
rolling_ic.append(ic)
return pd.Series(rolling_ic, index=signal.index[window:])
def test_ic_by_regime(signal, forward_returns, regime_indicator, regime_labels):
"""Test IC conditional on market regime."""
results = {}
for regime_id, regime_name in regime_labels.items():
mask = regime_indicator == regime_id
s = signal[mask]
r = forward_returns[mask]
if len(s) > 20:
ic, pval = stats.spearmanr(s.dropna(), r[s.notna()])
results[regime_name] = {
"IC": round(ic, 4),
"p_value": round(pval, 4),
"n_obs": len(s.dropna())
}
return results
A signal passes to Step 3 ONLY IF ALL of the following:
Document results in algo-builder/signals/<signal-name>.md.
Multiple testing note: If this is the Nth signal tested, flag it. The more signals tested, the higher the probability that a passing signal is a false positive. Track trial count in the hypothesis file. This feeds into DSR in Step 5.
Combine validated signals into a complete strategy. Define each component explicitly before writing backtest code.
Entry logic
Exit logic (define ALL of these — do not leave any as "figure it out later")
Position sizing
Risk controls (separate from signal logic, always on)
Transaction Cost Budget (quantify before backtesting — not a checkbox)
Specify:
8bps × (position_size / ADV)^0.5 as a rough ruleRed flags:
Capacity check
Write the full strategy specification into algo-builder/strategies/<name>.md before writing any backtest code.
Split your data: 70% train, 30% test. Seal the test set. Do not touch it.
Run the backtest on the TRAIN set only.
If the strategy fails these, go back to Step 3. Max 3 parameter adjustment cycles before reconsidering the hypothesis. Each adjustment cycle increments your trial count — this matters for DSR in Step 5.
This is what separates a real strategy from a backtest artifact.
Acceptance criteria:
If walk-forward fails: the strategy is overfit. Return to Step 3.
Walk-forward gives you ONE path through history. The DSR asks: given how many variations you tested to get here, how likely is this Sharpe to be real?
After only 1,000 independent backtests, the expected maximum Sharpe is ~3.26 even if the true SR of every strategy is zero (Bailey & Lopez de Prado, 2013). This is the #1 cause of false discoveries in systematic trading.
def deflated_sharpe_ratio(sharpe_obs, n_trials, obs_per_year=252):
"""
Adjust observed Sharpe for selection bias from multiple testing.
Based on Bailey & Lopez de Prado (2013), Journal of Portfolio Management.
Args:
sharpe_obs: Observed annualized Sharpe ratio
n_trials: Total strategy/parameter variations tested (be honest)
obs_per_year: Trading days per year
Returns:
dict with DSR assessment
"""
import scipy.stats as ss
import numpy as np
euler_gamma = 0.5772
expected_max_sr = (1 - euler_gamma) * ss.norm.ppf(1 - 1/max(n_trials, 2)) + \
euler_gamma * ss.norm.ppf(1 - 1/(max(n_trials, 2) * np.e))
sr_benchmark = expected_max_sr / np.sqrt(obs_per_year)
return {
"observed_sharpe": sharpe_obs,
"expected_max_sr_if_all_noise": round(expected_max_sr, 3),
"n_trials": n_trials,
"passes_dsr": sharpe_obs > sr_benchmark,
"verdict": "PASSES" if sharpe_obs > sr_benchmark else "FAIL: Sharpe may be a false positive given number of trials"
}
Track n_trials from the start (in the hypothesis file). Include every parameter variation, every signal variant, and every strategy reformulation you tested.
Combinatorial Purged Cross-Validation generates a distribution of OOS paths rather than one, making overfitting detection more robust than walk-forward alone. Implement from Lopez de Prado's AFML Chapter 12 or use the mlfinlab library.
Write walk-forward + DSR results into algo-builder/results/<name>_walkforward.md.
Before live capital, run on live data with simulated fills for a minimum of:
Compare paper trade metrics to backtest expectations:
If paper trade significantly underperforms backtest: find out why before going live. Common culprits: execution assumptions, data latency, bid-ask spread not modeled, regime shift since backtest period.
Start at 10-25% of intended position size. Scale up only after:
Maintain this folder structure:
algo-builder/
hypotheses/ # Step 1 - one file per strategy idea (includes trial count)
signals/ # Step 2 - IC, ICIR, decay, regime results per signal
strategies/ # Step 3 - full strategy specs including TC budget
results/ # Steps 4+5 - backtest + walkforward + DSR results
scripts/ # reusable Python scripts
test_signal.py
backtest.py
walkforward.py
fetch_data.py
When the user describes a strategy idea:
When the user presents backtest results for review:
When the user wants to just "write the strategy code":
US Equities (daily): IC 0.03–0.05 is genuinely good in efficient large-cap markets. Breadth (number of independent bets) matters as much as IC magnitude. Include delisted stocks to avoid survivorship bias.
Crypto: Less efficient — IC thresholds can be higher (0.05–0.10 achievable). Regime instability is extreme (bull/bear cycles are fast and severe). IC by regime is essential. Filter wash trading from volume signals. Shorter valid history (2017–present for liquid multi-exchange data).
Futures/Macro: Trend-following factors have longer holding periods — IC decay analysis is especially important. Model roll yield and funding costs in TC budget.
HFT: This pipeline does not apply. See Scope section above.