PLS Predict

Abstract

The PLS predict algorithm has been developed by Shmueli et al. (2016). The method uses training and holdout samples to generate and evaluate predictions from PLS path model estimations.

Description

The research by Shmueli et al. (2016) proposes a set of procedures for prediction with PLS path models and the evaluation of their predictive performance. These procedures are combined in the PLSpredict package https://github.com/ISS-Analytics/pls-predict for the statistical software R. They allow generating different out-of-sample and in-sample predictions (e.g., case-wise and average predictions), which facilitate the evaluation of the predictive performance when analyzing new data (that was not used to estimate the PLS path model). The analysis serves as a diagnostic for possible overfitting of the PLS path model to the training data.

Based on the procedures suggested by Shmueli et al. (2016), the current PLS predict algorithm implementation in the SmartPLS software allows researchers to obtain k-fold cross-validated prediction errors and prediction error summaries statistics (e.g., RMSE and MAPE) to assess the predictive performance of their PLS path model.

Additional procedures and extensions are under development and may become part of future SmartPLS releases.

PLS Predict Settings in SmartPLS

Number of Folds

Default: 10

In k-fold cross-validation the algorithm splits the full dataset into k equally sized subsets of data. The algorithm then predicts each fold (hold-out sample) with the remaining k-1 subsets, which, in combination, become the training sample. For example, when k equals 10 (i.e., 10-folds), a dataset of 200 observations will be split into 10 subsets with 20 observations per subset. The algorithm then predicts ten times each fold with the nine remaining subsets.

Number of Repetitions

Default: 10

The number of repetitions indicates how often PLS predict algorithm runs the k-fold cross validation on random splits of the full dataset into k folds.

Traditionally, cross-validation only uses one random split into k-folds. However, a single random split can make the predictions strongly dependent on this random assignment of data (observations) into the k-folds. Due to the random partition of data, executions of the algorithm at different points of time may vary in their predictive performance results (e.g., RMSE, MAPE, etc.).

Repeating the k-fold cross-validation with different random data partitions and computing the average across the repetitions ensures a more stable estimate of the predictive performance of the PLS path model.

Links

References

Link to More Literature