Benign Overfit (Surprises in High-Dimensional Ridgeless Least Squares Interpolation by Hastie et al.)

In the past, I used the word overfit as synonymous with bad predictive power. This was wrong. It turns out that there are models that fit the training data perfectly and still predict well out-of-sample.

For a couple of years now, one could read about finding that overfit models could generalize well out-of-sample, but it semed unclear if this was a general phenomenon. Luckiy, benign overfit can be explained even in the context of simple linear regression. The main takeway is that, counterintuitively, increasing the number of parameters, when the this number is higher than the number of data points, acts as regularisation and stabilizes estimates. With more parameters than data points, there are many models available that fit the data perfectly to choose from. The ridgeless regression estimator studided in the paper and, more importantly, gradient descent in neural networks tend to choose the model with least variability.

Setup

Hastie et al. study the estimator $$ \hat{\beta}=\arg\min \{||b||_2: b \text{ minimizes } y-Xb\}, $$ in the model $$ y_i = \beta'x_i + \epsilon_i,\quad x_i\sim\mathcal{N}(0,\Sigma), \: \epsilon_i\sim\mathcal{N}(0,\sigma^2), \: i \in 1,\ldots n, $$ and $y$ and $X$ being the stacked observations $y_i$ and predictors $x_i$. The main parameter is the number of columns of $x_i$, $p$, compared to rows, or data points, $n$. When $p \leq n$, $\hat{\beta}$ is just the Ordinary Least Squares (OLS) estimator. When $p>n$, $\hat{\beta}$ is $k>n$ is the OLS estimator that inverts the covariance matrix with the Moore-Penrose pseudoinverse, $(X'X)^+ X'y$, with $X$ adnd $y$ the stacked $x_i$ and $y_i$. It corresponds to the limit of a ridge regression when the regularisation parameter goes to zero, hence the term ridgeless regression. The paper studies the prediction risk of the estimator conditional on a realization of the predictors, i.e., $\mathbb{E}[(\beta'x_i - \beta'x_i)^2|x_i]$, which naturally decomposes into a variance and squared bias component. They compare it to:

The overparametrisation ratio $\gamma=p/n$ is positive and they focus on the case where it is larger than one, which is the case where we compare only models that fit the data perfectly.

Results

It is useful to define the signal-to-noise-ratio as the norm of the true coefficients relative to the irreducible component of the variance $SNR = \frac{||\beta||_2^2}{\sigma^2}$.

No factor structure

For an identity $\Sigma$, the ridgeless regression will beat the null risk if $SNR > 1$ and if $\gamma > SNR / (SNR - 1)$. In that case, the squared bias will be $||\beta||_2^2(1-1/\gamma)$ and the variance will be $\sigma^2 / (\gamma-1)$. So the squared bias increases in the number of parameters, while the variance decreases. As a result, $\hat{\beta}$ will have minimum risk at $\gamma = \sqrt{SNR} / (\sqrt{SNR} - 1)$. The lower the signal-to-noise ratio, the higher this optimal $\gamma$, because more noise requires more heavily regularised estimators. For an identity $\Sigma$, the risk of the ridgeless regression is dominated by the risk of an optimally ridge regression though. For correlated features without factor structure, qualitatively similar results hold, but with less neat analytical expressions.

Factor structure

They also investigate a model with a factor structure in which the signal is due to a few latent features that are correlated with $x$, $$ y_i = \theta'z_i + \xi_i,\quad x_{ij} = w_j'z_i + u_{ij}, \: z_i\sim\mathcal{N}(0,I_d), \: \xi_i\sim\mathcal{N}(0,\sigma_{\xi}^2), \: u_{ij} \sim\mathcal{N}(0,1), $$ where $d$ is much smaller than $p$. In this model risk is monotone decreasing in $\gamma$. The intuition is that, as the number of predictors increases, the model represents $z_i$ better, averaging out the noise in the approximation of $x$, $u_{ij}$.

Simulation

Let us check these results with simulated data. We run two sets of simulations:

  1. Vary $\gamma$ from 1 to 3 while keeping SNR fixed, generate training data and fit the ridgeless regression. Evaluate the R-squared on independent realizations out-of-sample (the test set).
  2. Fixing $\gamma$ at the optimal value $\sqrt{SNR} / (\sqrt{SNR} - 1)$, fit both the ridgeless regression and a ridge regression with hyperparameter tuned via cross-validation. Compare the two model's R-squared.

All simulations are for identity $\Sigma$ (isotropic model).

Preparation

First, load some packages and define the necessary functions and classes.

Define the ridgeless regression estimator:

Functions for data generation:

Utilities:

Simulation 1: varying $\gamma$

First, we define the parameter settings:

Now, we simulate:

We evaluate the out-of-sample $R^2$ for the ridgeless regression for different values of $p$, keeping the signal-to-noise ratio fixed. For values of $p$ that are equal to slightly larger than sample size $n$, this out-of-sample $R^2$ is hugely negative, corresponding to overfit that generalizes badly. The first dotted line is at $p = SNR / (SNR-1)$, the point at which risk of the ridgeless regression should become positive. Indeed, the $R^2$ becomes positive around positive around that point. The dashed line is for the value of $\gamma$ at which risk should be minimized.

To elucidate the mechanism, we also plot the norm of the estimated coefficients. As expected, it decreases with $p$ as the minimum norm estimator can be chosen from a larger class of models.

Simulation 2: fix $\gamma$, ridgeless vs. ridge

We simulate with the same parameter settings and seed except that $\gamma$ is fixed at $\sqrt{SNR} / (\sqrt{SNR} - 1)$. In each simulation, we train and predict from the ridgeless regression and from a ridge regression that was tuned via cross validation.

The following plot shows the $R^2$ for each simulation, split by:

Contrary to ridgeless regression, tuned ridge regression does not fit the data perfectly in-sample anymore. Out-of-sample, ridge regression has slightly higher average $R^2$, in line with the theoretical prediction. Maybe more importantly, the ridge regression $R^2$ out-of-sample is less variable than that of the ridgeless regression. Still, it is impressive that the non-cross-validated, perfectly overfit ridgeless regression performs almost on par with the tuned ridge regression.

Reference

Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R.J., 2019. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560. https://arxiv.org/abs/1903.08560.