Beyond Sharpe: A Research Framework for Evaluating ML Trading Strategies
Sharpe ratio is the default metric in systematic finance. For ML-driven strategies, it's also dangerously incomplete.
1. The Limits of Sharpe
Sharpe assumes returns are i.i.d., Gaussian or close enough, and variance is a good proxy for risk.
ML strategies often have:
- highly skewed return distributions
- clustered losses
- exposure to hidden factors
- complex dependence on market regimes
Two strategies with identical Sharpe can have radically different risk and robustness.
2. A Multi-Dimensional Metric Set
We propose evaluating ML strategies along at least these dimensions:
- Risk-adjusted return (Sharpe, Sortino, Information ratio)
- Drawdown profile (max DD, average DD, recovery time)
- Tail behaviour (CVaR, tail Sharpe, skew, kurtosis)
- Turnover and capacity (turnover, market impact proxies)
- Stability and robustness (across time, regimes, cross-validation folds)
- Implementation risk (signal noise, fill ratios, slippage sensitivity)
3. Tail Metrics for ML Strategies
Given return distribution r_t, define:
- CVaR_α: average loss in the worst α% of cases
- Tail Sharpe: Sharpe computed on left-tail truncated distribution (e.g. worst 20% of returns)
ML strategies tuned on average loss can accidentally produce very fat left tails. CVaR and tail Sharpe expose this directly.
4. Stability Measures
4.1 Time-Based Stability
Compute Sharpe in rolling windows, then analyze mean, median, dispersion, and worst decile. A strategy with Sharpe 1.0 but frequent windows with Sharpe < –1.0 is not robust.
4.2 Regime-Based Stability
Combine with regime-conditioning: Sharpe by regime, CVaR by regime, Hit ratio by regime. Define a Stability Index that penalizes dispersion of performance across regimes.
5. Turnover, Capacity, and Market Impact
ML strategies often trade too frequently. Key quantities:
- Turnover: sum of absolute position changes
- Estimated market impact: using square-root models or simplified cost functions
Compute Net Sharpe after cost and compare: Net Sharpe vs Gross Sharpe, Cost per unit of alpha.
6. ML-Specific Diagnostics
For ML-based strategies, we also want:
- Performance by prediction confidence bucket
- Performance by prediction sign agreement between models
- Calibration plots: predicted vs realised returns/volatility
- Feature importance stability over time
These diagnostics answer: "Do larger predicted returns actually correspond to larger realised returns?", "Is the model overconfident?", "Are signals coming from a stable feature set or from shifting noise?"
7. How volarixs Implements This Framework
In volarixs, each experiment/run stores:
- full time series of: predictions, realised returns, positions, PnL
- metadata: model, features, regimes, configuration, transaction cost settings
Evaluation pipelines then compute:
- the full metric set listed above
- breakdowns by time, regime, asset, and confidence bucket
- comparisons across runs on the same dataset/universe
The UI exposes metric dashboards, distribution visualisations, and regime/stability diagnostics. This turns the question from "Is my model good?" into "Is my model good where it matters, robust when conditions change, and still good after costs?" That's a research-level evaluation standard.