Jump to content

Draft:Purged Cross-Validation

From Wikipedia, the free encyclopedia


Purged and Embargoed Cross-Validation is a time-series-aware model validation technique designed to address information leakage in financial machine learning. It provides a statistically rigorous alternative to conventional cross-validation and walk-forward backtesting methods, which often yield overly optimistic performance estimates due to information leakage and overfitting.[1][2] It is especially applicable when labels span time intervals rather than individual points in time. The method modifies standard k-fold cross-validation by incorporating two critical mechanisms: purging and embargoing.

Why K-Fold Cross-Validation Fails in Finance

[edit]

Traditional cross-validation assumes independent and identically distributed (i.i.d.) observations, which is violated in financial time series. In many financial labeling schemes, each observation represents an event with a starting time and an ending time (e.g., a position held over several days). If a model is trained on data whose label intervals overlap with those in the test set, then future information is inadvertently included in training—this is known as look-ahead bias or information leakage.[3][2][1]

Purged cross-validation was introduced to ensure that the training set is uncontaminated by test information.[4][5]

The figure below illustrates standard 5 Fold Cross-Validation[6]

Visualization of KFold Cross-Validation
Visualization of KFold Cross-Validation

Purging

[edit]

Purging removes from the training set any observation whose label end time overlaps with the start of the test set. See the figure below for an illustration of purging. [4]

Purging Overlapping Samples in Finance
Purging Overlapping Samples in Finance


To handle overlapping labels in financial time series, the following notation is used:

  • t1: A pandas.Series that maps each observation to the end time of its label.
  • [i, j): The index range of the test set in a given fold.
  • t0: The start time of the test set, i.e., the timestamp at index i.
  • test_max: The latest end time reached by any label in the test set, defined as max(t1[k]) for k in [i, j).

To prevent information leakage, training samples must satisfy:

t1[k] ≤ t0   OR   k > index.searchsorted(test_max) + embargo

This ensures two things:

  1. No training label ends after the test set starts (purging).
  2. No training sample falls within an embargo period after the test set ends.

Embargoing

[edit]

Embargoing addresses a more subtle form of leakage: even if an observation does not directly overlap the test set, it may still be affected by test events due to market reaction lag or downstream dependencies. To guard against this, a percentage-based embargo is imposed after each test fold. For example, with a 5% embargo and 1000 observations, the 50 observations following each test fold are excluded from training.

The figure below illustrates the application of embargo[4]:

Embargo of post-test train observations
Embargo of post-test train observations

Applications

[edit]

Purged and embargoed cross-validation is especially useful in:

  • Backtesting of trading strategies[1]
  • Validation of classifiers on labeled event-driven returns[5]
  • Any machine learning task with overlapping label horizons[3][7]

Example

[edit]

To illustrate the effect of purging and embargoing, consider the figures below. Both diagrams show the structure of 5-fold cross-validation over a 20-day period. In each row, blue squares indicate training samples and red squares denote test samples. Each label is defined based on the value of the next two observations, hence creating an overlap. If this overlap is left untreated, test set information leaks into the train set.

Standard K-Fold Cross-Validation: test samples are randomly partitioned with no attention to label overlap or time ordering. This can lead to contamination of the training set with future information.

The second figure applies the Purged CV procedure. Notice how purging removes overlapping observations from the training set and the embargo widens the gap between test and training data. This approach ensures that the evaluation more closely resembles a true out-of-sample test and reduces the risk of backtest overfitting.

Purged K-Fold Cross-Validation: training samples that overlap with the test label horizon are removed. Embargoing is applied to prevent leakage from immediately adjacent samples.

Comparison to Standard K-Fold

[edit]
Feature Standard K-Fold Purged & Embargoed CV
Assumes i.i.d. Yes No
Handles overlapping labels No Yes
Prevents information leakage No Yes
Suitable for financial time series Poorly Well

Combinatorial Purged Cross-Validation

[edit]

Walk-forward backtesting analysis, another common cross-validation technique in finance, preserves temporal order but evaluates the model on a single sequence of test sets. This leads to high variance in performance estimation, as results are contingent on a specific historical path.[1]

Combinatorial Purged Cross-Validation (CPCV) addresses this limitation by systematically constructing multiple train-test splits, purging overlapping samples, and enforcing an embargo period to prevent information leakage. The result is a distribution of out-of-sample performance estimates, enabling robust statistical inference and more realistic assessment of a model's predictive power.[4]

Methodology

[edit]

CPCV divides a time-series dataset into N sequential, non-overlapping groups. These groups preserve the temporal order of observations. Then, all combinations of k groups (where k < N) are selected as test sets, with the remaining N − k groups used for training. For each combination, the model is trained and evaluated under strict controls to prevent leakage.[4]

To eliminate potential contamination between training and test sets, CPCV introduces two additional mechanisms:

  • Purging: Any training observations whose label horizon overlaps with the test period are excluded. This ensures that future information does not influence model training.
  • Embargoing: After the end of each test period, a fixed number of observations (typically a small percentage) are removed from the training set. This prevents leakage due to delayed market reactions or auto-correlated features.

Each data point appears in multiple test sets across different combinations. Because test groups are drawn combinatorially, this process produces multiple backtest "paths," each of which simulates a plausible market scenario. From these paths, practitioners can compute a distribution of performance statistics such as the Sharpe ratio, drawdown, or classification accuracy.

Formal definition

[edit]

Let N be the number of sequential groups into which the dataset is divided, and let k be the number of groups selected as the test set for each split. Then:

  • The number of unique train-test combinations is given by the binomial coefficient:
  • Each observation is used in test sets and contributes to unique backtest paths:

This yields a distribution of performance metrics rather than a single point estimate, making it possible to apply Monte Carlo-based or probabilistic techniques to assess model robustness.

Illustrative example

[edit]

Consider the case where N = 6 and k = 2. The number of possible test set combinations is . Each of the six groups appears in five test splits. Consequently, five distinct backtest paths can be constructed, each incorporating one appearance from every group.

Test group assignment matrix

[edit]

This table shows the 15 test combinations. An "x" indicates that the corresponding group is included in the test set for that split.

Paths generated for N = 6, k = 2
Group S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15
G1 x x x x x
G2 x x x x x
G3 x x x x x
G4 x x x x x
G5 x x x x x
G6 x x x x x

Backtest path assignment

[edit]

Each group contributes to five different backtest paths. The number in each cell indicates the path to which the group's result is assigned for that split.

Path assignments for each group
Group S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15
G1 1 2 3 4 5
G2 1 2 3 4 5
G3 1 2 3 4 5
G4 1 2 3 4 5
G5 1 2 3 4 5
G6 1 2 3 4 5

Advantages

[edit]

Combinatorial Purged Cross-Validation offers several key benefits over conventional methods:

  • It produces a distribution of performance metrics, enabling more rigorous statistical inference.
  • The method systematically eliminates lookahead bias through purging and embargoing.
  • By simulating multiple historical scenarios, it reduces the dependence on any single market regime or realization.
  • It supports high-confidence comparisons between competing models or strategies.

CPCV is commonly used in quantitative strategy research, especially for evaluating predictive models such as classifiers, regressors, and portfolio optimizers.[3] It has been applied to estimate realistic Sharpe ratios, assess the risk of overfitting, and support the use of statistical tools such as the Deflated Sharpe Ratio.[7][5]

Limitations

[edit]

The main limitation of CPCV stems from its high computational cost. However, this cost can be managed by sampling a finite number of splits from the space of all possible combinations.

Python Notebooks with Examples:

[edit]

The following GitHub repositories link to open-source code to experiment with:

See also

[edit]

References

[edit]
  1. ^ a b c Joubert, J. & Sestovic, D. & Barziy I. & Distaso, W. & Lopez de Prado, M. (2024): "Enhanced Backtesting for Practitioners." The Journal of Portfolio Management, Quantitative Tools 51(2), pp. 12 - 27. DOI: 10.3905/jpm.2024.1.637
  2. ^ a b Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014): "The Probability of Backtest Overfitting." Journal of Computational Finance. 20(4).
  3. ^ a b c Lopez de Prado, M. (2018): "The 10 Reasons Most Machine Learning Funds Fail." The Journal of Portfolio Management, 44(6), pp. 120 - 133. DOI: 10.3905/jpm.2018.44.6.120
  4. ^ a b c d e f g López de Prado, M. (2018). Advances in Financial Machine Learning. John Wiley & Sons. ISBN 978-1-119-48208-6.
  5. ^ a b c Lopez de Prado, M. (2020): Machine Learning for Asset Managers. Cambridge University Press. https://www.amazon.com/Machine-Learning-Managers-Elements-Quantitative/dp/1108792898
  6. ^ "KFold CV Illustration by Scikit-Learn". Scikit-Learn. 20 May 2025.
  7. ^ a b López de Prado, M. & Zoonekynd, V. (2025):"Correcting the Factor Mirage: A Research Protocol for Causal Factor Investing." Available at SSRN: https://ssrn.com/abstract=4697929 or http://dx.doi.org/10.2139/ssrn.4697929