Evaluation & monitoring

Methodology

How StockPredict AI evaluates models, avoids lookahead bias, and monitors performance over time. Model v10.1.0 with LightGBM + LSTM hybrid.

Read How it works Open Dashboard

Contents

1. Walk-forward validation 2. Avoiding lookahead bias 3. Evaluation metrics 4. Drift monitoring 5. Interpreting results Example evaluation snapshot

1. Walk-forward validation

Instead of a single static train/test split, StockPredict AI uses walk-forward validation with 4 rolling folds (starting at 50% of data, stepping 10% each fold). Historical data is split into sequential windows:

Train on an initial history window (~92K samples across 75 tickers).
Evaluate on the following period (out-of-sample holdout: last 15%).
Roll the window forward and repeat across the timeline.
7-day purge gap + 3-day embargo between train/validate/holdout to prevent label leakage.

This matches production behavior: you always train on the past and predict the future, and it reduces overly optimistic backtests.

2. Avoiding lookahead bias

Lookahead bias happens when a model accidentally uses future information during training. StockPredict AI avoids this by:

Building features only from data available up to each prediction timestamp (all features use shift(1)).
Separating training and evaluation periods in time (no shuffling across the timeline).
Using holdout windows that simulate “live” deployment conditions.
Predicting market-neutral alpha (stock return minus SPY return) instead of absolute price, eliminating market-direction bias.

The goal is not to “fit the past” perfectly, but to estimate how the model might behave on unseen data.

3. Evaluation metrics

The platform tracks multiple metrics to understand performance from different angles:

Sharpe Ratio: return per unit of risk (annualized). Above 1.0 is good, above 2.0 is excellent.
Win Rate: percentage of profitable trades. Above 50% is decent, above 55% is good.
Max Drawdown: largest peak-to-trough decline. Below 10% for a cautious strategy.
Directional accuracy: how often the model predicts the sign correctly (up vs. down).
Rank correlation: Spearman correlation between predicted and actual returns — how well the model orders stocks.
Calibration / Brier score: how well predicted probabilities match observed frequencies.

No single metric tells the full story; a dashboard of metrics helps catch drift and overfitting.

4. Drift monitoring

Markets change, so the pipeline includes drift checks to detect when the relationship between features and targets may be breaking down:

Population Stability Index (PSI) between historical and recent feature distributions.
Rolling directional accuracy and error metrics over time.
Monitoring hit-rates across segments and regimes.

When drift is detected, models may need retraining, retuning, or in some cases, feature redesign.

5. Interpreting results

Even with careful validation, ML forecasts are noisy. StockPredict AI emphasizes:

Using predictions as probabilistic signals, not guarantees.
Combining model outputs with human judgment and risk management.
Being transparent about limitations and assumptions.

The platform is designed as an educational and research tool, not a plug-and-play trading system.

Example evaluation snapshot

The monitoring layer tracks multiple signals together so regressions are visible quickly.

Latest production backtest (v10.1.0, Sep 2025 — Mar 2026, market-neutral)

30-day horizon

+13.71%

Sharpe 2.68 · 64.3% WR · 28 trades

7-day horizon

+8.05%

Sharpe 1.61 · 50.4% WR · 129 trades

Next-day horizon

0 trades

Model correctly abstained (no edge)

SPY returned +1.90% over the same period. All returns are alpha (excess over SPY).

For end-to-end pipeline details, see How it works.