In the past, I used the word overfit as synonymous with bad predictive power. This was wrong. It turns out that there are models that fit the training data perfectly and still predict well out-of-sample.
For a couple of years now, one could read about finding that overfit models could generalize well out-of-sample, but it semed unclear if this was a general phenomenon. Luckiy, benign overfit can be explained even in the context of simple linear regression. The main takeway is that, counterintuitively, increasing the number of parameters, when the this number is higher than the number of data points, acts as regularisation and stabilizes estimates. With more parameters than data points, there are many models available that fit the data perfectly to choose from. The ridgeless regression estimator studided in the paper and, more importantly, gradient descent in neural networks tend to choose the model with least variability.
Hastie et al. study the estimator $$ \hat{\beta}=\arg\min \{||b||_2: b \text{ minimizes } y-Xb\}, $$ in the model $$ y_i = \beta'x_i + \epsilon_i,\quad x_i\sim\mathcal{N}(0,\Sigma), \: \epsilon_i\sim\mathcal{N}(0,\sigma^2), \: i \in 1,\ldots n, $$ and $y$ and $X$ being the stacked observations $y_i$ and predictors $x_i$. The main parameter is the number of columns of $x_i$, $p$, compared to rows, or data points, $n$. When $p \leq n$, $\hat{\beta}$ is just the Ordinary Least Squares (OLS) estimator. When $p>n$, $\hat{\beta}$ is $k>n$ is the OLS estimator that inverts the covariance matrix with the Moore-Penrose pseudoinverse, $(X'X)^+ X'y$, with $X$ adnd $y$ the stacked $x_i$ and $y_i$. It corresponds to the limit of a ridge regression when the regularisation parameter goes to zero, hence the term ridgeless regression. The paper studies the prediction risk of the estimator conditional on a realization of the predictors, i.e., $\mathbb{E}[(\beta'x_i - \beta'x_i)^2|x_i]$, which naturally decomposes into a variance and squared bias component. They compare it to:
The overparametrisation ratio $\gamma=p/n$ is positive and they focus on the case where it is larger than one, which is the case where we compare only models that fit the data perfectly.
It is useful to define the signal-to-noise-ratio as the norm of the true coefficients relative to the irreducible component of the variance $SNR = \frac{||\beta||_2^2}{\sigma^2}$.
For an identity $\Sigma$, the ridgeless regression will beat the null risk if $SNR > 1$ and if $\gamma > SNR / (SNR - 1)$. In that case, the squared bias will be $||\beta||_2^2(1-1/\gamma)$ and the variance will be $\sigma^2 / (\gamma-1)$. So the squared bias increases in the number of parameters, while the variance decreases. As a result, $\hat{\beta}$ will have minimum risk at $\gamma = \sqrt{SNR} / (\sqrt{SNR} - 1)$. The lower the signal-to-noise ratio, the higher this optimal $\gamma$, because more noise requires more heavily regularised estimators. For an identity $\Sigma$, the risk of the ridgeless regression is dominated by the risk of an optimally ridge regression though. For correlated features without factor structure, qualitatively similar results hold, but with less neat analytical expressions.
They also investigate a model with a factor structure in which the signal is due to a few latent features that are correlated with $x$, $$ y_i = \theta'z_i + \xi_i,\quad x_{ij} = w_j'z_i + u_{ij}, \: z_i\sim\mathcal{N}(0,I_d), \: \xi_i\sim\mathcal{N}(0,\sigma_{\xi}^2), \: u_{ij} \sim\mathcal{N}(0,1), $$ where $d$ is much smaller than $p$. In this model risk is monotone decreasing in $\gamma$. The intuition is that, as the number of predictors increases, the model represents $z_i$ better, averaging out the noise in the approximation of $x$, $u_{ij}$.
Let us check these results with simulated data. We run two sets of simulations:
All simulations are for identity $\Sigma$ (isotropic model).
First, load some packages and define the necessary functions and classes.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from glum import GeneralizedLinearRegressorCV
from sklearn.metrics import r2_score
from sklearn.base import RegressorMixin
from tqdm.notebook import tqdm
plt.rcParams["figure.figsize"] = [16.0, 10.0]
sns.set_palette(sns.color_palette("hls", 2))
Define the ridgeless regression estimator:
class RidgelessLinearRegressor(RegressorMixin):
"""Ridgeless linear regression using Moore Penrose pseudoinverse."""
def __init__(self):
self.beta = None
def fit(self, x, y):
xxinv = np.linalg.pinv(np.matmul(x.T, x))
xy = np.matmul(x.T, y)
self.beta = np.matmul(xxinv, xy)
return self
def predict(self, x):
return np.matmul(x, self.beta)
Functions for data generation:
def gen_dat(n, snr, p):
"""Simulate from model with isotropic features."""
bet = np.ones(p) * np.sqrt(snr / p)
x = np.random.normal(size=(n, p))
eps = np.random.normal(size=n, scale=1)
y = np.matmul(x, bet) + eps
return x, y, bet
def simulate(ns, gamma_max, n, snr, m, seed):
"""Generate data over a range of gamma, in-sample and out-of-sample, fit and evaluate a ridgeless regression."""
np.random.seed(seed)
res = []
p_max = np.round(gamma_max * n)
for sim in tqdm(range(ns)):
for p in np.arange(n, p_max + 1):
x, y, bet = gen_dat(n, snr, p)
x_oos, y_oos, _ = gen_dat(m, snr, p)
ridgeless = RidgelessLinearRegressor()
ridgeless = ridgeless.fit(x, y)
preds = ridgeless.predict(x)
preds_oos = ridgeless.predict(x_oos)
_res = pd.DataFrame(
{
"norm_bet": l2(bet),
"mean_b": np.mean(ridgeless.beta),
"norm_b": l2(ridgeless.beta),
"abs_b": np.sum(np.abs(ridgeless.beta)),
"r2": r2_score(y, preds),
"r2_oos": r2_score(y_oos, preds_oos),
"p": p,
"sim": sim,
},
index=[0],
)
res += [_res]
return pd.concat(res, ignore_index=True)
def simulate_ridge_vs_ridgeless(ns, gamma, n, snr, m, seed, **kwargs):
"""Generate data in-sample and out-of-sample, fit and evaluate ridgeless and ridge regression."""
np.random.seed(seed)
res = []
p = round(gamma * n)
for sim in tqdm(range(ns)):
x, y, bet = gen_dat(n, snr, p)
x_oos, y_oos, _ = gen_dat(m, snr, p)
# tuned ridge
ridge = GeneralizedLinearRegressorCV(
l1_ratio=0, fit_intercept=False, family="normal", **kwargs
)
ridge = ridge.fit(x, y)
preds_ridge = ridge.predict(x)
pred_ridge_oos = ridge.predict(x_oos)
# ridgeless
ridgeless = RidgelessLinearRegressor()
ridgeless = ridgeless.fit(x, y)
preds_ridgeless = ridgeless.predict(x)
preds_ridgeless_oos = ridgeless.predict(x_oos)
_res = pd.DataFrame(
{
"ridgeless_r2": r2_score(y, preds_ridgeless),
"ridgeless_r2_oos": r2_score(y_oos, preds_ridgeless_oos),
"ridge_r2": r2_score(y, preds_ridge),
"ridge_r2_oos": r2_score(y_oos, pred_ridge_oos),
"ridge_alph": ridge.alpha_,
"sim": sim,
},
index=[0],
)
res += [_res]
return pd.concat(res, ignore_index=True)
Utilities:
def l2(vec):
"""L2-norm."""
return np.sqrt(np.inner(vec, vec))
def gamma_opt(snr):
"""Optimal gamma as function of signal-to-noise-ratio."""
return np.sqrt(snr) / (np.sqrt(snr) - 1)
def gamma_null(snr):
"""Value of gamma at which ridgeless should become better than null model."""
return snr / (snr - 1)
First, we define the parameter settings:
ns = 50
gamma_max = 3
seed = 42
n = 50
snr = 3
m = 500 # observations out of sample
Now, we simulate:
df_res = simulate(ns=ns, gamma_max=gamma_max, n=n, snr=snr, m=m, seed=seed)
We evaluate the out-of-sample $R^2$ for the ridgeless regression for different values of $p$, keeping the signal-to-noise ratio fixed. For values of $p$ that are equal to slightly larger than sample size $n$, this out-of-sample $R^2$ is hugely negative, corresponding to overfit that generalizes badly. The first dotted line is at $p = SNR / (SNR-1)$, the point at which risk of the ridgeless regression should become positive. Indeed, the $R^2$ becomes positive around positive around that point. The dashed line is for the value of $\gamma$ at which risk should be minimized.
ax = sns.lineplot(data=df_res, x="p", y="r2_oos")
ax.set_ylim(-1, 1)
ax.axhline(y=0, color="k", linewidth=0.5)
ax.axvline(x=gamma_opt(snr) * n, linestyle="--")
ax.axvline(x=gamma_null(snr) * n, linestyle=":")
ax.set_title(
"Out-of-sample R^2 for perfectly fit estimator increasing in number of parameters"
)
ax.set_ylabel("Out-of-sample R^2")
ax.set_xlabel("Number of parameters (number of data pints: 50)")
Text(0.5, 0, 'Number of parameters (number of data pints: 50)')
To elucidate the mechanism, we also plot the norm of the estimated coefficients. As expected, it decreases with $p$ as the minimum norm estimator can be chosen from a larger class of models.
ax = sns.lineplot(data=df_res, x="p", y="norm_b")
ax.set_ylim(0, 5)
ax.axvline(x=gamma_opt(snr) * n, linestyle="--")
ax.axvline(x=gamma_null(snr) * n, linestyle=":")
ax.set_title("Norm of estimated coefficients decreasing in number of parameters")
ax.set_ylabel("L2-norm of beta-hat")
ax.set_xlabel("Number of parameters (number of data pints: 50)")
Text(0.5, 0, 'Number of parameters (number of data pints: 50)')
We simulate with the same parameter settings and seed except that $\gamma$ is fixed at $\sqrt{SNR} / (\sqrt{SNR} - 1)$. In each simulation, we train and predict from the ridgeless regression and from a ridge regression that was tuned via cross validation.
df_res2 = simulate_ridge_vs_ridgeless(
ns=ns, gamma=gamma_opt(snr), n=n, snr=snr, m=m, seed=seed
)
The following plot shows the $R^2$ for each simulation, split by:
Contrary to ridgeless regression, tuned ridge regression does not fit the data perfectly in-sample anymore. Out-of-sample, ridge regression has slightly higher average $R^2$, in line with the theoretical prediction. Maybe more importantly, the ridge regression $R^2$ out-of-sample is less variable than that of the ridgeless regression. Still, it is impressive that the non-cross-validated, perfectly overfit ridgeless regression performs almost on par with the tuned ridge regression.
df_r2 = pd.melt(df_res2.drop(columns="ridge_alph"), id_vars="sim", value_name="r2")
df_r2["sample"] = np.where(df_r2["variable"].str.contains("oos"), "test", "train")
df_r2["estimator"] = np.where(
df_r2["variable"].str.contains("ridgeless"), "ridgeless", "ridge"
)
ax = sns.swarmplot(x="sample", y="r2", hue="estimator", data=df_r2, dodge=True)
Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R.J., 2019. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560. https://arxiv.org/abs/1903.08560.