Regime-Conditioned Performance: Measuring ML Robustness Across Market States

This post outlines a research framework for regime-conditioned performance analysis that we are building directly into volarixs.

1. Why Average Metrics Are Misleading

Let a strategy produce daily returns (r_t) over T days. Standard practice: compute a single Sharpe ratio:

Sharpe = E[r_t] / √Var(r_t) × √252

Suppose we segment the sample into regimes k ∈ {1, ..., K}, e.g.:

Regime 1: Low-vol bull
Regime 2: High-vol sideways
Regime 3: Crisis

Two models can have the same overall Sharpe, but very different regime profiles:

Model A: Sharpe₁ = 3.0, Sharpe₂ = -1.5, Sharpe₃ = -2.0
Model B: Sharpe₁ = 1.2, Sharpe₂ = 0.8, Sharpe₃ = 0.3

If you're allocating real capital, Model B is far more attractive. The usual backtest won't tell you this.

2. Defining Market Regimes

We need objective regime labels that:

are computed without peeking into the future
can be reproduced across experiments
can be applied across asset classes

2.1 Volatility-Based Regimes

Compute rolling realised volatility σ_t over a window w, then bucket by quantiles:

0–20%: Low-vol
20–60%: Normal
60–85%: Elevated
85–100%: Crisis

Regime Segmentation Playground

Low-Vol Cutoff: 20th percentile

High-Vol Cutoff: 85th percentile

Rolling Window: 21 days

Regime Timeline

Normal: 332 days

Low-Vol: 96 days

Crisis: 72 days

Adjust the volatility thresholds and rolling window to see how regime definitions change. Notice how different parameters create different regime patterns, affecting model performance evaluation.

2.2 Hidden Markov Model (HMM) Regimes

Fit a 2–3 state Gaussian HMM to infer most likely state sequence. More flexible but heavier to compute.

2.3 Cluster-Based Regimes

Define a feature vector (vol, correlation, dispersion, etc.) and cluster using k-means or Gaussian mixtures.

3. Regime-Conditioned Metrics

Once we have regime labels, we compute for each model:

Regime Sharpe: Sharpe_k for each regime
Regime Max Drawdown
Regime Hit Ratio (% r_t^(k) > 0)
Regime Turnover
Regime PnL contribution

We can then define robustness scores that penalize dispersion of performance across regimes.

4. Research Use Cases

4.1 Model Comparisons

Compare LSTM vs Ridge vs Random Forest by regime, not just overall. Analyze which models collapse during crisis states.

4.2 Portfolio Construction

Build ensembles where constituent models dominate in different regimes. Allocate capital dynamically based on current regime probability.

4.3 Production Monitoring

A live model suddenly deteriorates only in one regime → diagnosis, not panic. Rolling regime-conditioned Sharpe becomes part of your health dashboard.

5. How volarixs Implements This

In volarixs, every experiment is linked to:

time series of features
time series of predictions
time series of realised returns
time series of regime labels (one or more schemes)

Regime-conditioned metrics are:

computed automatically for each run
stored in the experiment metadata
visualised as "Performance by Regime" plots and tables
available for alpha factory ranking and diagnostics

This shifts the conversation from "What's the Sharpe?" to "How does this thing behave when the world changes?" That's where robustness lives.

Regimes

Robustness

Performance

Evaluation