Beyond Sharpe: A Research Framework for Evaluating ML Trading Strategies

1. The Limits of Sharpe

Sharpe assumes returns are i.i.d., Gaussian or close enough, and variance is a good proxy for risk.

ML strategies often have:

highly skewed return distributions
clustered losses
exposure to hidden factors
complex dependence on market regimes

Two strategies with identical Sharpe can have radically different risk and robustness.

2. A Multi-Dimensional Metric Set

We propose evaluating ML strategies along at least these dimensions:

Risk-adjusted return (Sharpe, Sortino, Information ratio)
Drawdown profile (max DD, average DD, recovery time)
Tail behaviour (CVaR, tail Sharpe, skew, kurtosis)
Turnover and capacity (turnover, market impact proxies)
Stability and robustness (across time, regimes, cross-validation folds)
Implementation risk (signal noise, fill ratios, slippage sensitivity)

3. Tail Metrics for ML Strategies

Given return distribution r_t, define:

CVaR_α: average loss in the worst α% of cases
Tail Sharpe: Sharpe computed on left-tail truncated distribution (e.g. worst 20% of returns)

ML strategies tuned on average loss can accidentally produce very fat left tails. CVaR and tail Sharpe expose this directly.

4. Stability Measures

4.1 Time-Based Stability

Compute Sharpe in rolling windows, then analyze mean, median, dispersion, and worst decile. A strategy with Sharpe 1.0 but frequent windows with Sharpe < –1.0 is not robust.

4.2 Regime-Based Stability

Combine with regime-conditioning: Sharpe by regime, CVaR by regime, Hit ratio by regime. Define a Stability Index that penalizes dispersion of performance across regimes.

5. Turnover, Capacity, and Market Impact

ML strategies often trade too frequently. Key quantities:

Turnover: sum of absolute position changes
Estimated market impact: using square-root models or simplified cost functions

Compute Net Sharpe after cost and compare: Net Sharpe vs Gross Sharpe, Cost per unit of alpha.

6. ML-Specific Diagnostics

For ML-based strategies, we also want:

Performance by prediction confidence bucket
Performance by prediction sign agreement between models
Calibration plots: predicted vs realised returns/volatility
Feature importance stability over time

These diagnostics answer: "Do larger predicted returns actually correspond to larger realised returns?", "Is the model overconfident?", "Are signals coming from a stable feature set or from shifting noise?"

7. How volarixs Implements This Framework

In volarixs, each experiment/run stores:

full time series of: predictions, realised returns, positions, PnL
metadata: model, features, regimes, configuration, transaction cost settings

Evaluation pipelines then compute:

the full metric set listed above
breakdowns by time, regime, asset, and confidence bucket
comparisons across runs on the same dataset/universe

The UI exposes metric dashboards, distribution visualisations, and regime/stability diagnostics. This turns the question from "Is my model good?" into "Is my model good where it matters, robust when conditions change, and still good after costs?" That's a research-level evaluation standard.

Evaluation

Metrics

Sharpe

CVaR

Stability

1. The Limits of Sharpe

Sharpe assumes returns are i.i.d., Gaussian or close enough, and variance is a good proxy for risk.

ML strategies often have:

highly skewed return distributions
clustered losses
exposure to hidden factors
complex dependence on market regimes

Two strategies with identical Sharpe can have radically different risk and robustness.

2. A Multi-Dimensional Metric Set

We propose evaluating ML strategies along at least these dimensions:

Risk-adjusted return (Sharpe, Sortino, Information ratio)
Drawdown profile (max DD, average DD, recovery time)
Tail behaviour (CVaR, tail Sharpe, skew, kurtosis)
Turnover and capacity (turnover, market impact proxies)
Stability and robustness (across time, regimes, cross-validation folds)
Implementation risk (signal noise, fill ratios, slippage sensitivity)

3. Tail Metrics for ML Strategies

Given return distribution r_t, define:

CVaR_α: average loss in the worst α% of cases
Tail Sharpe: Sharpe computed on left-tail truncated distribution (e.g. worst 20% of returns)

ML strategies tuned on average loss can accidentally produce very fat left tails. CVaR and tail Sharpe expose this directly.

4. Stability Measures

4.1 Time-Based Stability

Compute Sharpe in rolling windows, then analyze mean, median, dispersion, and worst decile. A strategy with Sharpe 1.0 but frequent windows with Sharpe < –1.0 is not robust.

4.2 Regime-Based Stability

Combine with regime-conditioning: Sharpe by regime, CVaR by regime, Hit ratio by regime. Define a Stability Index that penalizes dispersion of performance across regimes.

5. Turnover, Capacity, and Market Impact

ML strategies often trade too frequently. Key quantities:

Turnover: sum of absolute position changes
Estimated market impact: using square-root models or simplified cost functions

Compute Net Sharpe after cost and compare: Net Sharpe vs Gross Sharpe, Cost per unit of alpha.

6. ML-Specific Diagnostics

For ML-based strategies, we also want:

Performance by prediction confidence bucket
Performance by prediction sign agreement between models
Calibration plots: predicted vs realised returns/volatility
Feature importance stability over time

7. How volarixs Implements This Framework

In volarixs, each experiment/run stores:

full time series of: predictions, realised returns, positions, PnL
metadata: model, features, regimes, configuration, transaction cost settings

Evaluation pipelines then compute:

the full metric set listed above
breakdowns by time, regime, asset, and confidence bucket
comparisons across runs on the same dataset/universe

Evaluation

Metrics

Sharpe

CVaR

Stability

volarixs - applied AI & ML to finance

Shrinking the Feature Space: PCA & Autoencoders

How Asset Managers Can Implement AI & Machine Learning

Neural Networks for Market Data: MLPs, CNNs & LSTMs

Signal Half-Life and Decay: How Long Do ML Edges Really Last?

How Asset Managers Can Use AI & Machine Learning in Investment Decisions

Modeling Market Turbulence: GARCH, EGARCH & HAR

ARIMA, SARIMAX & VAR: When Classical Time-Series Still Win

Volatility Forecasting Benchmarks: GARCH, HAR, and ML

How Market Regimes Break ML Models

Boosted Trees for Alpha: XGBoost & LightGBM

The 19 Most Important Features for Equity Return Forecasting

Rolling Windows for Financial ML: A Complete Guide