Pipeline overview

How StockPredict AI Works

End-to-end pipeline from raw market data to ML forecasts and plain-English explanations for the S&P 100.

Contents

Architecture diagram 1. Data sources 2. Feature engineering 3. Model training (LightGBM)4. Daily automated pipeline 5. Explainability (SHAP + Gemini)Example prediction explanation Limitations

Architecture diagram

High-level flow from ingestion to UI. This mirrors the daily automated pipeline that runs in CI/CD.

1. Data sources

StockPredict AI aggregates data from multiple providers to build a rich feature set:

Prices & fundamentals: OHLCV, fundamentals, and analyst data from providers like Finnhub and FMP.
Macro & economics: FRED macro indicators (rates, yields, unemployment, CPI, GDP) to capture the macro regime.
News & filings: SEC filings, RSS feeds, and headlines from sources such as Reddit and FinViz.
Real-time feed: Live prices via Finnhub WebSocket and TradingView widgets in the UI.

2. Feature engineering

For each ticker, the ML backend builds a feature vector with 40+ signals, including:

Price-based features: rolling returns, volatility, gaps, volume spikes, and moving averages.
Technical indicators: RSI, MACD, Bollinger Bands, trend strength, and overbought/oversold flags.
Sentiment features: aggregated scores from FinBERT, RoBERTa, and VADER across news, Reddit, and filings.
Macro & cross-asset signals: yields, spreads, index levels, and sector ETFs.

Features are aligned on a daily timeline, normalized, and stored in MongoDB for training and analysis.

3. Model training (LightGBM)

The core predictor is a LightGBM gradient-boosted tree model. It learns to forecast log-returns for three horizons:

Next day (1-day / next-day horizon)
1 week (7 trading days)
1 month (30 calendar days)

Models are trained in a walk-forward fashion: each training window uses historical data only up to that point and is evaluated on future periods, which reduces lookahead bias and gives a more realistic picture of performance.

4. Daily automated pipeline

A GitHub Actions workflow runs the full ML pipeline every trading day:

Fetch latest prices, news, sentiment, macro, and insider data.
Update features and retrain pooled LightGBM models where needed.
Generate predictions for all S&P 100 stocks across three horizons.
Run SHAP analysis to understand which features drove each prediction.
Store predictions, explanations, and monitoring metrics in MongoDB.

The Next.js frontend reads from the Node.js API, which serves predictions and explanations to the UI.

5. SHAP explainability & Gemini AI

For each prediction, SHAP decomposes the model output into feature contributions (what pushed the forecast up or down). These numbers, along with sentiment and technical context, are passed to Google Gemini to generate a plain-English narrative.

This is what powers the AI explanation sections in the UI: instead of raw probabilities, users see a concise summary of the model's reasoning.

Example prediction explanation

The platform surfaces a concise “what changed and why” summary based on SHAP feature attributions.

Example (illustrative)

AAPL 7-day forecast: +1.2%

Top positive drivers

Improving aggregated sentiment across recent headlines
Uptrend signal from momentum + moving averages
Favorable macro trend signals

Top negative drivers

Elevated short-term volatility
Recent drawdown pressure vs. prior highs

6. Limitations & risk disclaimer

StockPredict AI is an educational research project. It is not a trading signal service and does not provide investment advice.

Models are trained on historical data and can fail under new market regimes.
Predictions are probabilistic estimates, not guarantees of future prices.
Real-world execution costs, slippage, and liquidity are not modeled in detail.

Always do your own research and consult a licensed financial advisor before making investment decisions.

For evaluation details, see Methodology.