Beyond Sharpe: A Research Framework for Evaluating ML Trading Strategies
Sharpe ratio is the default metric in systematic finance. For ML-driven strategies, it's also dangerously incomplete.
1. The Limits of Sharpe
Sharpe assumes returns are i.i.d., Gaussian or close enough, and variance is a good proxy for risk.
ML strategies often have:
- highly skewed return distributions
- clustered losses
- exposure to hidden factors
- complex dependence on market regimes
Two strategies with identical Sharpe can have radically different risk and robustness.
2. A Multi-Dimensional Metric Set
We propose evaluating ML strategies along at least these dimensions:
- Risk-adjusted return (Sharpe, Sortino, Information ratio)
- Drawdown profile (max DD, average DD, recovery time)
- Tail behaviour (CVaR, tail Sharpe, skew, kurtosis)
- Turnover and capacity (turnover, market impact proxies)
- Stability and robustness (across time, regimes, cross-validation folds)
- Implementation risk (signal noise, fill ratios, slippage sensitivity)
3. Tail Metrics for ML Strategies
Given return distribution r_t, define:
- CVaR_α: average loss in the worst α% of cases
- Tail Sharpe: Sharpe computed on left-tail truncated distribution (e.g. worst 20% of returns)
ML strategies tuned on average loss can accidentally produce very fat left tails. CVaR and tail Sharpe expose this directly.
4. Stability Measures
4.1 Time-Based Stability
Compute Sharpe in rolling windows, then analyze mean, median, dispersion, and worst decile. A strategy with Sharpe 1.0 but frequent windows with Sharpe < –1.0 is not robust.
4.2 Regime-Based Stability
Combine with regime-conditioning: Sharpe by regime, CVaR by regime, Hit ratio by regime. Define a Stability Index that penalizes dispersion of performance across regimes.
5. Turnover, Capacity, and Market Impact
ML strategies often trade too frequently. Key quantities:
- Turnover: sum of absolute position changes
- Estimated market impact: using square-root models or simplified cost functions
Compute Net Sharpe after cost and compare: Net Sharpe vs Gross Sharpe, Cost per unit of alpha.
6. ML-Specific Diagnostics
For ML-based strategies, we also want:
- Performance by prediction confidence bucket
- Performance by prediction sign agreement between models
- Calibration plots: predicted vs realised returns/volatility
- Feature importance stability over time
These diagnostics answer: "Do larger predicted returns actually correspond to larger realised returns?", "Is the model overconfident?", "Are signals coming from a stable feature set or from shifting noise?"
7. How This Maps to volarixs
A framework like this is only as good as the data underneath it — and the work volarixs does up front is to keep that data. Each experiment records the raw material these metrics are computed from:
- the model's prediction history, with direction and confidence, across horizons (1d / 5d / 21d / 63d and beyond)
- the regime context each run was generated under — inflation, policy and liquidity state
- factor exposures for the run: betas, R², alpha and residual volatility
- run results: the model, datasets, targets and train/test R²
That history is what the metrics above are built on. Confidence buckets are recorded with every signal, so the diagnostics in section 6 — does a higher predicted return actually pay off, is the model overconfident — can be measured rather than assumed. Regime labels travel with every run, so stability can be read by regime instead of as a single blended number.
The metric set in this post is the lens volarixs is built around, not a one-click dashboard you toggle on today. The aim is to move the question from “Is my model good?” to “Is it good where it matters, robust when conditions change, and still good after costs?” — a research-level evaluation standard, with the prediction and regime history needed to answer it kept run by run.