# Machine Learning Reference ## Overview The FinLab machine learning module provides tools for creating ML-based trading strategies. It includes feature engineering, label generation, and integration with popular ML libraries like scikit-learn, XGBoost, and LightGBM. **Import:** ```python from finlab.ml import feature, label ``` --- ## Table of Contents 1. [Feature Engineering](#feature-engineering) 2. [Label Generation](#label-generation) 3. [Complete ML Workflow](#complete-ml-workflow) 4. [Best Practices](#best-practices) --- ## Feature Engineering ### feature.ta_names Generate a list of technical indicator feature names with randomized parameters. **Signature:** ```python feature.ta_names( lb: int = 1, ub: int = 10, n: int = 1, factory: Optional[Factory] = None ) -> List[str] ``` **Parameters:** - `lb` (int, optional, default=1): Lower bound of the multiplier for default technical indicator parameters - `ub` (int, optional, default=10): Upper bound of the multiplier for default technical indicator parameters - `n` (int, optional, default=1): Number of random samples for each technical indicator - `factory` (optional, default=None): Factory object to generate technical indicators. Defaults to TalibIndicatorFactory **Returns:** - `List[str]`: A list of technical indicator feature names **Example:** ```python import finlab.ml.feature as f # Generate 5 random variations for each TA-Lib indicator feature_names = f.ta_names(n=5) print(feature_names[:10]) ``` --- ### feature.ta Calculate technical indicator values for a list of feature names. **Signature:** ```python feature.ta( feature_names: Optional[List[str]], factories = None, resample = None, start_time = None, end_time = None, adj = False, cpu = -1, **kwargs ) -> pd.DataFrame ``` **Parameters:** - `feature_names` (list, optional, default=None): List of technical indicator feature names. Defaults to None (generates default names) - `factories` (dict, optional, default=None): Dictionary of factories to generate technical indicators. Defaults to {'talib': TalibIndicatorFactory()} - `resample` (str, optional, default=None): Frequency to resample data (e.g., 'W', 'M'). Defaults to None - `start_time` (str, optional, default=None): Start time of the data - `end_time` (str, optional, default=None): End time of the data - `adj` (bool, optional, default=False): Whether to use adjusted prices - `cpu` (int, optional, default=-1): Number of CPU cores for parallel processing. -1 uses all available cores - `**kwargs`: Additional keyword arguments to pass to the resampler function **Returns:** - `pd.DataFrame`: Technical indicator feature names and values, indexed by ('datetime', 'instrument') **Example:** ```python import finlab.ml.feature as f # Method 1: Generate default indicators with random parameters features1 = f.ta() print(features1.head()) # Method 2: Generate specific indicator with defined parameters, resampled weekly feature_names = ['talib.MACD__macdhist__fastperiod__52__slowperiod__212__signalperiod__75__'] features2 = f.ta(feature_names, resample='W') print(features2.head()) ``` **Important Notes:** - `feature.ta` can only calculate values for feature names generated by `feature.ta_names` randomly - Do NOT use `feature.ta` and `feature.ta_names` by default. Use `data.indicator` instead for static feature names - If you want static feature names, use `data.indicator('SMA', timeperiod=20)` which supports all TA-Lib indicators --- ### feature.combine Combine multiple feature DataFrames into a single DataFrame. **Signature:** ```python feature.combine( features: Dict[str, pd.DataFrame], resample = None, sample_filter = None, **kwargs ) ``` **Parameters:** - `features` (dict, required): Dictionary where keys are feature names and values are DataFrames (index=datetime, columns=instrument) - `resample` (str, optional, default=None): Optional argument to resample data in features (e.g., 'W', 'M') - `sample_filter` (pd.DataFrame, optional, default=None): Boolean DataFrame (index=date, columns=instrument) representing the filter of features - `**kwargs`: Additional keyword arguments to pass to the resampler function **Returns:** - `pd.DataFrame`: All input features combined, indexed by ('datetime', 'instrument') **Example:** ```python from finlab import data import finlab.ml.feature as f features_dict = { 'pb': data.get('price_earning_ratio:股價淨值比'), 'rsi': data.indicator('RSI') } combined_features = f.combine(features_dict, resample='M') print(combined_features.head()) ``` **Important Notes:** - `feature.combine` can handle misaligned indices and missing data automatically - Set `resample` to 'W', '2W', `data.get("monthly_revenue:當月營收").index`, `data.get("fundamental_features:ROE稅後").deadline().index`, 'ME', 'QE', etc. to avoid excessive data points and high RAM consumption - Use `sample_filter` parameter to filter unwanted data points (e.g., `sample_filter = data.get('price:成交股數') > 200_000`) --- ## Label Generation ### label.return_percentage Calculate the percentage change of market prices over a given period. **Signature:** ```python label.return_percentage( index: pd.Index, resample = None, period = 1, trade_at_price = 'close', bfill = False, **kwargs ) ``` **Parameters:** - `index` (pd.Index, required): Multi-level index of datetime and instrument - `resample` (str, optional, default=None): Resample frequency for output data (e.g., 'W', 'M') - `period` (int, optional, default=1): Number of periods to calculate percentage change over - `trade_at_price` (str, optional, default='close'): Price for execution ('open', 'high', 'low', 'close') - `bfill` (bool, optional, default=False): Whether to backfill missing price data before calculation - `**kwargs`: Additional arguments passed to resampler function **Returns:** - `pd.Series`: Percentage change of stock prices, aligned to input index **Example:** ```python from finlab.ml import label # Assume features is your feature DataFrame y = label.return_percentage(features.index, resample='M', period=1) ``` --- ### label.excess_over_mean Calculate the excess return over the cross-sectional mean return for a given period. **Signature:** ```python label.excess_over_mean( index: pd.Index, resample = None, period = 1, trade_at_price = 'close', **kwargs ) ``` **Parameters:** - `index` (pd.Index, required): Multi-level index of datetime and instrument - `resample` (str, optional, default=None): Resample frequency (e.g., 'W', 'M') - `period` (int, optional, default=1): Number of periods for return calculation - `trade_at_price` (str, optional, default='close'): Price for return calculation ('open', 'high', 'low', 'close') - `**kwargs`: Additional arguments passed to resampler function **Returns:** - `pd.Series`: Excess return over the mean, aligned to input index **Example:** ```python from finlab.ml import label # Excess return over market mean y = label.excess_over_mean(features.index, resample='M', period=1) ``` --- ### label.excess_over_median Calculate the excess return over the cross-sectional median return for a given period. **Signature:** ```python label.excess_over_median( index: pd.Index, resample = None, period = 1, trade_at_price = 'close', **kwargs ) ``` **Parameters:** - `index` (pd.Index, required): Multi-level index of datetime and instrument - `resample` (str, optional, default=None): Resample frequency (e.g., 'W', 'M') - `period` (int, optional, default=1): Number of periods for return calculation - `trade_at_price` (str, optional, default='close'): Price for return calculation - `**kwargs`: Additional arguments passed to resampler function **Returns:** - `pd.Series`: Excess return over the median, aligned to input index --- ### label.daytrading_percentage Calculate the intraday percentage change (close / open - 1). **Signature:** ```python label.daytrading_percentage( index: pd.Index, **kwargs ) ``` **Parameters:** - `index` (pd.Index, required): Multi-level index of datetime and instrument, typically from a feature DataFrame - `**kwargs`: Additional arguments passed to internal resampler function **Returns:** - `pd.Series`: Intraday percentage change, aligned to input index --- ### label.maximum_adverse_excursion Calculate the maximum adverse excursion (lowest price relative to entry) over a given period. **Signature:** ```python label.maximum_adverse_excursion( index: pd.Index, period = 1, trade_at_price = 'close' ) ``` **Parameters:** - `index` (pd.Index, required): Multi-level index of datetime and instrument - `period` (int, optional, default=1): Number of periods to look forward for minimum price - `trade_at_price` (str, optional, default='close'): Entry price to compare against ('open', 'high', 'low', 'close') **Returns:** - `pd.Series`: Maximum adverse excursion, aligned to input index --- ### label.maximum_favorable_excursion Calculate the maximum favorable excursion (highest price relative to entry) over a given period. **Signature:** ```python label.maximum_favorable_excursion( index: pd.Index, period = 1, trade_at_price = 'close' ) ``` **Parameters:** - `index` (pd.Index, required): Multi-level index of datetime and instrument - `period` (int, optional, default=1): Number of periods to look forward for maximum price - `trade_at_price` (str, optional, default='close'): Entry price to compare against **Returns:** - `pd.Series`: Maximum favorable excursion, aligned to input index --- ## Complete ML Workflow ### Basic LightGBM Regression Example ```python from finlab import data from finlab.ml import feature, label from finlab.backtest import sim from finlab.dataframe import FinlabDataFrame from lightgbm import LGBMRegressor import pandas as pd # Step 1: Feature Engineering features_dict = { 'pb': data.get('price_earning_ratio:股價淨值比'), 'pe': data.get('price_earning_ratio:本益比'), 'rsi': data.indicator('RSI', timeperiod=14), 'revenue_growth': data.get('monthly_revenue:去年同月增減(%)'), 'roe': data.get('fundamental_features:ROE稅後'), } # Combine features with filtering sample_filter = data.get('price:成交股數') > 200_000 X = feature.combine(features_dict, resample='M', sample_filter=sample_filter) # Step 2: Label Generation y = label.excess_over_mean(X.index, resample='M', period=1) # Step 3: Train/Test Split train_mask = X.index.get_level_values('datetime') < '2020-01-01' test_mask = X.index.get_level_values('datetime') >= '2020-01-01' X_train = X[train_mask] y_train = y[train_mask] X_test = X[test_mask] y_test = y[test_mask] # Step 4: Model Training model = LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42) model.fit(X_train, y_train) # Step 5: Prediction and Position Construction y_pred = model.predict(X_test) df_y = FinlabDataFrame(y_pred, index=X_test.index).unstack().T # Step 6: Create Trading Position position = df_y.is_largest(10) # Select top 10 stocks # Step 7: Backtest report = sim(position) print(report.get_metrics()) ``` --- ### Advanced Feature Selection Example ```python from finlab import data from finlab.ml import feature, label from lightgbm import LGBMRegressor from sklearn.feature_selection import SelectKBest, f_regression import pandas as pd # Generate many features features_dict = { 'pb': data.get('price_earning_ratio:股價淨值比'), 'pe': data.get('price_earning_ratio:本益比'), 'ps': data.get('price_earning_ratio:股價淨值比') * data.get('fundamental_features:每股營業額'), 'rsi_14': data.indicator('RSI', timeperiod=14), 'rsi_28': data.indicator('RSI', timeperiod=28), 'macd': data.indicator('MACD')[0], # MACD line 'revenue_growth': data.get('monthly_revenue:去年同月增減(%)'), 'revenue_ma3': data.get('monthly_revenue:當月營收').average(3), 'roe': data.get('fundamental_features:ROE稅後'), 'roa': data.get('fundamental_features:ROA綜合損益'), 'gross_margin': data.get('fundamental_features:營業毛利率'), 'debt_ratio': data.get('fundamental_features:負債比率'), } # Combine and clean sample_filter = data.get('price:成交股數') > 200_000 X = feature.combine(features_dict, resample='M', sample_filter=sample_filter) y = label.excess_over_mean(X.index, resample='M', period=1) # Remove NaN mask = ~(X.isna().any(axis=1) | y.isna()) X = X[mask] y = y[mask] # Feature selection selector = SelectKBest(score_func=f_regression, k=6) X_selected = selector.fit_transform(X, y) selected_features = X.columns[selector.get_support()].tolist() print("Selected Features:", selected_features) # Train model with selected features X_selected_df = pd.DataFrame(X_selected, index=X.index, columns=selected_features) # Split train_mask = X_selected_df.index.get_level_values('datetime') < '2020-01-01' test_mask = X_selected_df.index.get_level_values('datetime') >= '2020-01-01' X_train = X_selected_df[train_mask] y_train = y[train_mask] X_test = X_selected_df[test_mask] # Train and predict model = LGBMRegressor(n_estimators=200, learning_rate=0.05, max_depth=6) model.fit(X_train, y_train) y_pred = model.predict(X_test) # Feature importance importance_df = pd.DataFrame({ 'feature': selected_features, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False) print("\nFeature Importance:") print(importance_df) ``` --- ### Classification Example ```python from finlab import data from finlab.ml import feature, label from finlab.dataframe import FinlabDataFrame from finlab.backtest import sim from lightgbm import LGBMClassifier import pandas as pd # Feature engineering features_dict = { 'pb': data.get('price_earning_ratio:股價淨值比'), 'rsi': data.indicator('RSI'), 'revenue_growth': data.get('monthly_revenue:去年同月增減(%)'), } sample_filter = data.get('price:成交股數') > 200_000 X = feature.combine(features_dict, resample='M', sample_filter=sample_filter) # Binary classification label (outperform market or not) y_continuous = label.excess_over_mean(X.index, resample='M', period=1) y = (y_continuous > 0).astype(int) # 1 if outperforms, 0 otherwise # Remove NaN mask = ~(X.isna().any(axis=1) | y.isna()) X = X[mask] y = y[mask] # Split train_mask = X.index.get_level_values('datetime') < '2020-01-01' test_mask = X.index.get_level_values('datetime') >= '2020-01-01' X_train = X[train_mask] y_train = y[train_mask] X_test = X[test_mask] # Train classifier model = LGBMClassifier(n_estimators=100, learning_rate=0.1, max_depth=5) model.fit(X_train, y_train) # Predict probabilities y_pred_proba = model.predict_proba(X_test)[:, 1] # Probability of class 1 df_y = FinlabDataFrame(y_pred_proba, index=X_test.index).unstack().T # Create position based on probability position = df_y.is_largest(10) # Top 10 highest probability # Backtest report = sim(position) print(report.get_metrics()) ``` --- ### Time Series Cross-Validation ```python from finlab import data from finlab.ml import feature, label from lightgbm import LGBMRegressor from sklearn.model_selection import TimeSeriesSplit import pandas as pd import numpy as np # Prepare features and labels features_dict = { 'pb': data.get('price_earning_ratio:股價淨值比'), 'rsi': data.indicator('RSI'), } X = feature.combine(features_dict, resample='M') y = label.excess_over_mean(X.index, resample='M', period=1) # Remove NaN mask = ~(X.isna().any(axis=1) | y.isna()) X = X[mask] y = y[mask] # Get unique dates for time series split dates = X.index.get_level_values('datetime').unique().sort_values() date_to_idx = {date: idx for idx, date in enumerate(dates)} X['date_idx'] = X.index.get_level_values('datetime').map(date_to_idx) # Time series cross-validation tscv = TimeSeriesSplit(n_splits=5) scores = [] for train_idx, val_idx in tscv.split(dates): train_dates = dates[train_idx] val_dates = dates[val_idx] X_train = X[X['date_idx'].isin([date_to_idx[d] for d in train_dates])].drop('date_idx', axis=1) y_train = y[X_train.index] X_val = X[X['date_idx'].isin([date_to_idx[d] for d in val_dates])].drop('date_idx', axis=1) y_val = y[X_val.index] model = LGBMRegressor(n_estimators=100) model.fit(X_train, y_train) score = model.score(X_val, y_val) scores.append(score) print(f"Fold R²: {score:.4f}") print(f"\nAverage R²: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})") ``` --- ## Best Practices ### Feature Engineering 1. **Start with static indicators** - Use `data.indicator()` instead of `feature.ta()` for reproducibility 2. **Set appropriate resample frequency** - Use 'W', 'ME', 'QE', or `revenue.index` to control data density 3. **Apply sample filters** - Filter low-volume or special-status stocks using `sample_filter` 4. **Handle missing data** - Always check and handle NaN values before training 5. **Normalize features** - Consider scaling features for better model performance 6. **Ensure proper alignment** - `feature.combine` handles this automatically ### Label Generation 1. **Match resample frequency** - Ensure label `resample` matches feature `resample` 2. **Use excess returns** - `excess_over_mean` or `excess_over_median` for better signal 3. **Consider prediction horizon** - Match `period` to your trading frequency 4. **Align indices** - Use `features.index` when generating labels ### Model Training 1. **Time-based splits** - Use `X.index.get_level_values('datetime') > '2020-01-01'` for proper train/test split 2. **Avoid look-ahead bias** - Never use future information in features 3. **Cross-validate** - Use time series cross-validation, not random splits 4. **Tune hyperparameters** - Use validation set for hyperparameter optimization 5. **Monitor overfitting** - Compare train and test performance regularly ### Position Construction 1. **Use FinlabDataFrame** - Convert predictions with `FinlabDataFrame(y_pred, index=X_test.index).unstack().T` 2. **Limit positions** - Use `is_largest(n)` or `is_smallest(n)` for position sizing 3. **Apply filters** - Combine ML predictions with fundamental or technical filters 4. **Set stop-loss/take-profit** - Protect against large losses in backtesting ### Backtesting 1. **Use realistic assumptions** - Include transaction costs and slippage 2. **Test out-of-sample** - Always backtest on unseen data 3. **Monitor metrics** - Check Sharpe ratio, max drawdown, and win rate 4. **Avoid overfitting** - Be wary of perfect backtest results --- ## Related References - [FinlabDataFrame Reference](dataframe-reference.md) - Enhanced DataFrame methods - [Data Reference](data-reference.md) - Available data sources - [Factor Examples](factor-examples.md) - Factor-based strategies - [Factor Analysis Reference](factor-analysis-reference.md) - Analyze factor performance - [Backtesting Reference](backtesting-reference.md) - Backtest ML strategies