Best Fit Selection to Prevent Overfitting

A primary desire when creating a Fit is to construct it with high predictive accuracy. HyperStudy provides several metrics which can be used to quantitatively judge the quality of a Fit. Selecting a Fit based on observing how the metrics perform on the input data is simple, but may result in overfitting the model.

Tip: These metrics are presented in the Post Processing step, Diagnostic tab of the Fit. For more information, see Diagnostics Post Processing.

Overfitting describes the phenomena of a Fit with very high input data diagnostics, but the Fit results in inaccurate predictions when presented with new data. Essentially, the model has been tuned to be too specific to the exact input data.

Figure 1. Difference between Two Curves Fitting the Same Data Points

In Figure 1, the blue curve produces the exact values of the green data points, while the red curve captures the data trend without capturing small deviations in the original data. In most cases the red curve will generalize to new data better than the overfit blue curve.

To avoid overfitting, a Fit is trained with three conceptually unique sets of data. Input data is used to build a Fit, validation data is used to tune and compare different Fit options, and the testing data is used in a final step to quantify the predictive ability to unseen data.

Note: Test data is never used in the construction and tuning of the Fit.

In HyperStudy, testing data is optional and the validation data is automatically constructed from the input data using a technique known as k-fold cross validation.

This technique begins with the input data and segments it into multiple folds (or groups). Imagine having 10 data points and 3 folds, the folding may look like:

Fold #: Run #
1: 1,4,7,10
2: 2,5,8
3: 3,6,9

A fold is first withheld and a response surface is built using the remaining data. The prediction is then tested on data from the withheld fold. In this example, a Fit is first built using folds 2 and 3 and tested on fold 1. Next, it is built data from folds 1 and 3, while predicted on fold 2. This process continues for each fold. When this process is completed, the predictions on the folded data sets are compared to their known values and traditional diagnostic measures can be evaluated. Selecting a Fit based on cross-validation metrics is good practice to ensure a balance between predictive accuracy and avoiding overfitting. The size of the cross-validation folds can be set via the Cross-Validation option (accessed in the Evaluate step of the Fit); the method Fit Automatically Selected by Training calculates an internal fold size to ensure efficiency.