PLS Predict

Abstract

The PLS predict algorithm has been developed by Shmueli et al. (2016). The method uses training and holdout samples to generate and evaluate predictions from PLS path model estimations.

Description

The research by Shmueli et al. (2016) proposes a set of procedures for prediction with PLS path models and the evaluation of their predictive performance. These procedures are combined in the PLSpredict package https://github.com/ISS-Analytics/pls-predict for the statistical software R. They allow generating different out-of-sample and in-sample predictions (e.g., case-wise and average predictions), which facilitate the evaluation of the predictive performance when analyzing new data (that was not used to estimate the PLS path model). The analysis serves as a diagnostic for possible overfitting of the PLS path model to the training data.

Based on the procedures suggested by Shmueli et al. (2016), the current PLS predict algorithm implementation in the SmartPLS software allows researchers to obtain k-fold cross-validated prediction errors and prediction error summaries statistics such as the root mean squared error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE) to assess the predictive performance of their PLS path model for the manifest variables (MV or indicators) and the latent variables (LV or constructs). Note that all three criteria are available for the MV results, while it is only possible to compute the RMSE and MAE for the LV results. These criteria allow to compare the predictive performance of alternative PLS path models.

In addition, to assess the results of a specific PLS path model, its predictive performance can be compared against two naïve benchmarks:

(1) The Q² value in PLSPredict compares the prediction errors of the PLS path model against simple mean predictions. For this purpose, it uses the mean value of the training sample to predict the outcomes of the holdout sample. The Q² value results interpretation is similar to the assessment of Q² values obtained by the blindfolding procedure in PLS-SEM. If the Q² value is positive, the prediction error of the PLS-SEM results is smaller than the prediction error of simply using the mean values. In that case, the PLS-SEM models offers better predictive performance.

(2) The linear regression model (LM) offers prediction errors and summary statistics that ignore the specified PLS path model. Instead, the LM approach regresses all exogenous indicator variables on each endogenous indicator variable to generate predictions. Thereby, a comparison with the PLS-SEM results offers information whether using a theoretically established path model improves (or at least does not worsen) the predictive performance of the available indicator data. In comparison with the LM outcomes, the PLS-SEM results should have a lower prediction error (e.g., in terms of RMSE or MAE) than the LM. Note that the LM prediction error is only available for the manifest variables and not the latent variables.

Additional procedures and extensions are under development and may become part of future SmartPLS releases.

PLS Predict Settings in SmartPLS

Number of Folds

Default: 10

In k-fold cross-validation the algorithm splits the full dataset into k equally sized subsets of data. The algorithm then predicts each fold (hold-out sample) with the remaining k-1 subsets, which, in combination, become the training sample. For example, when k equals 10 (i.e., 10-folds), a dataset of 200 observations will be split into 10 subsets with 20 observations per subset. The algorithm then predicts ten times each fold with the nine remaining subsets.

Number of Repetitions

Default: 10

The number of repetitions indicates how often PLS predict algorithm runs the k-fold cross validation on random splits of the full dataset into k folds.

Traditionally, cross-validation only uses one random split into k-folds. However, a single random split can make the predictions strongly dependent on this random assignment of data (observations) into the k-folds. Due to the random partition of data, executions of the algorithm at different points of time may vary in their predictive performance results (e.g., RMSE, MAPE, etc.).

Repeating the k-fold cross-validation with different random data partitions and computing the average across the repetitions ensures a more stable estimate of the predictive performance of the PLS path model.

Links

References

Link to More Literature