The material here is largely paraphrased from Hastie, Tibshirani & Freidman (2017), Chapters 5, 7 & 8.
We want to take baby steps towards response surface methods based on non-parametric regression models,
We’ll start by considering linear methods via basis expansion, generically,
But they don’t scale well, and are wily when extrapolating.
A desire for
will motivate our transition to GP regression.
We will ultimately see how spline-based methods and GPs are examples of one in the same thing.
Generically, the regression problem involves estimating \(f(x) = \mathbb{E}\{Y \mid x\}\), e.g., via
\[ Y = f(x) + \varepsilon. \]
However, it is extremely unlikely that the true function \(f(x)\) is actually linear in \(x\).
Nonlinear fits, by augmenting/replacing \(x\) with additional variables that are transformations of \(x\), represent an attractive option.
Modern basis expansion methods take the linear modeling apparatus to the extreme by
Denote by \(h_m(x): \mathbb{R}^p \rightarrow \mathbb{R}\), the \(m^\mathrm{th}\) transformation of \(x\), for \(m=1,\dots,M\).
We then model
\[ f(x) = \sum_{m=1}^M \beta_m h_m(x), \]
a linear basis expansion in \(x\).
Once the basis is determined, the models are linear in these new variables.
Examples include:
This last one seems silly at first glance,
Ultimately, we will be fitting piecewise polynomials which will lead us to so-called splines.
For now, we assume that \(x\) is one-dimensional.
A piecewise polynomial \(f(x)\) is obtained by
Common special cases include piecewise constant, linear, quadratic, and cubic models.
Consider two knots \(\xi_1\) and \(\xi_2\) and \(f(x) = \sum_{m=1}^2 \beta_m h_m(x)\) with
\[ \begin{aligned} h_1(x) &= I(x < \xi_1) & h_2(x) &= I(\xi_1 \leq x < \xi_2) & h_3(x) &= I(\xi_2 \leq x). \end{aligned} \]
Consider some simple sinusoidal data.
x <- seq(0,10,length=50)
ytrue <- (sin(pi*x/5) + 0.2*cos(4*pi*x/5))
y <- ytrue + rnorm(length(ytrue), sd=0.1)
plot(x,y); lines(x, ytrue)
xi <- c(3, 7)
hc1 <- function(x) { as.numeric(x < xi[1]) }
hc2 <- function(x) { as.numeric(x >= xi[1] & x < xi[2]) }
hc3 <- function(x) { as.numeric(x >= xi[2]) }
H.c <- data.frame(cbind(hc1(x), hc2(x), hc3(x)))
names(H.c) <- paste("hc", 1:3, sep="")
Performing the fit
pc.fit <- lm(y~.-1, data=H.c)
Creating a predictive grid, and the true output values for comparison.
xx <- seq(-1,11,length=199)
HH.c <- data.frame(cbind(hc1(xx), hc2(xx), hc3(xx)))
names(HH.c) <- paste("hc", 1:3, sep="")
yytrue <- (sin(pi*xx/5) + 0.2*cos(4*pi*xx/5))
Not a fantastic fit.
plot(x,y); lines(xx, yytrue, col="blue")
pc <- predict(pc.fit, newdata=HH.c, interval="prediction")
lines(xx, pc[,1]); lines(xx, pc[,2], col=2, lty=2); lines(xx, pc[,3], col=2, lty=2)
abline(v=xi, col="gray", lty=2)
However, if we augment with
OLS amounts to \(\hat{\beta}\) via MLE (with intercept) for the \(y\) in the \(m^\mathrm{th}\) region.
Creating the basis design and performing the fit.
H.l <- cbind(H.c, hc1(x)*x, hc2(x)*x, hc3(x)*x)
names(H.l)[4:6] <- paste("hl", 1:3, sep="")
pl.fit <- lm(y~.-1, data=H.l)
Creating the predictive design and predicting.
HH.l <- cbind(HH.c, hc1(xx)*xx, hc2(xx)*xx, hc3(xx)*xx)
names(HH.l)[4:6] <- paste("hl", 1:3, sep="")
pl <- predict(pl.fit, newdata=HH.l, interval="prediction")
A better fit.
plot(x,y); lines(xx, yytrue, col="blue")
lines(xx, pl[,1])
lines(xx, pl[,2], col=2, lty=2); lines(xx, pl[,3], col=2, lty=2)
abline(v=xi, col="gray", lty=2)
That fit is better, but there is nothing that forces the line segments to line up.
Can we add some constraints to fix it?
\[ \begin{aligned} f(\xi_1^-) &= f(\xi_1^+) & f(\xi_2^-) &= f(\xi_2^+) \\ \mbox{i.e. } \quad \beta_1 + \xi_1 \beta_4 &= \beta_2 + \xi_1 \beta_5 & \beta_2 + \xi_2 \beta_5 &= \beta_3 + \xi_2 \beta_6. \end{aligned} \]
Adding those constraints makes it so that there are only really four free parameters (not six).
Here is a basis for \(f(x) = \sum_{m=1}^m \beta_m h_m(x)\) that incorporates those constraints.
\[ \begin{aligned} h_1(x) &= 1 & h_2(x) &= x & h_3(x) &= (x - \xi_1)_+ & h_4(x) &= (x - \xi_2)_+ \end{aligned} \]
In R the hockey stick function could be coded up generically as follows.
hs <- function(x) {
x[x < 0] <- 0
return(x)
}
First create the design from the basis, and then obtain the fit.
H.lc <- rep(1,length(x))
H.lc <- cbind(H.lc, x)
H.lc <- cbind(H.lc, hs(x-xi[1]))
H.lc <- data.frame(cbind(H.lc, hs(x-xi[2])))
names(H.lc) <- paste("hlc", 1:4, sep="")
plc.fit <- lm(y~.-1, data=H.lc)
Then create the predictive design and gather the predictions.
HH.lc <- rep(1,length(xx))
HH.lc <- cbind(HH.lc, xx)
HH.lc <- cbind(HH.lc, hs(xx-xi[1]))
HH.lc <- data.frame(cbind(HH.lc, hs(xx-xi[2])))
names(HH.lc) <- paste("hlc", 1:4, sep="")
plc <- predict(plc.fit, newdata=HH.lc, interval="prediction")
Even better (because it is less “jumpy”).
plot(x,y); lines(xx, yytrue, col="blue")
lines(xx, plc[,1])
lines(xx, plc[,2], col=2, lty=2); lines(xx, plc[,3], col=2, lty=2)
abline(v=xi, col="gray", lty=2)
Smoother fits can be obtained by
A piecewise cubic that is continuous, and has continuous 1st & 2nd derivatives
The following basis represents a cubic spline with knots at \(\xi_1\) and \(\xi_2\):
\[ \begin{aligned} h_1(x) &= 1 & h_3(x) &= x^2 & h_5(x) &= (x-\xi_1)_+^3 \\ h_2(x) &= x & h_4(x) &= x^3 & h_6(x) &= (x-\xi_2)_+^3 \\ \end{aligned} \]
\[ \begin{aligned} (3 \mbox{ regions}) &\times (4 \mbox{ parameters per region}) \\ - (2 \mbox{ knots}) &\times (3 \mbox{ constraints per knot})\\ &= 6. \end{aligned} \]
Building the design in the cubic spline basis and obtaining the fit.
H.cs <- rep(1,length(x))
H.cs <- cbind(H.cs, x)
H.cs <- cbind(H.cs, x^2)
H.cs <- cbind(H.cs, x^3)
H.cs <- cbind(H.cs, hs((x-xi[1])^3))
H.cs <- data.frame(cbind(H.cs, hs((x-xi[2])^3)))
names(H.cs) <- paste("x", 1:6, sep="")
fit.cs <- lm(y~., data=H.cs)
Now build the corresponding predictive design and gather prediciotns.
HH.cs <- rep(1,length(xx))
HH.cs <- cbind(HH.cs, xx)
HH.cs <- cbind(HH.cs, xx^2)
HH.cs <- cbind(HH.cs, xx^3)
HH.cs <- cbind(HH.cs, hs((xx-xi[1])^3))
HH.cs <- data.frame(cbind(HH.cs, hs((xx-xi[2])^3)))
names(HH.cs) <- paste("x", 1:6, sep="")
suppressWarnings(pcs <- predict(fit.cs, newdata=HH.cs, interval="prediction"))
Pretty amazing, but not perfect (yet).
plot(x,y); lines(xx, yytrue, col="blue")
lines(xx, pcs[,1])
lines(xx, pcs[,2], col=2, lty=2); lines(xx, pcs[,3], col=2, lty=2)
abline(v=xi, col="gray", lty=2)
An order-M spline with knots \(\xi_j\), \(j=1,\dots,K\) is
The basis set is
\[ \begin{aligned} h_j(x) &= x^{j-1}, & j&=1,\dots,M \\ h_{M+\ell}(x) &= (x - \xi_\ell)_+^{M-1}, & \ell &= 1,\dots,K. \end{aligned} \]
The most widely used special cases:
It is claimed that cubic splines are the lowest order spline for which the knot-discontinuity (in derivatives) is not visible to the human eye.
The piecewise constant and (discontinuous) linear cases are common too, especially for higher dimensional inputs (\(p > 1\)).
These fixed knot splines are also known as regression splines.
One needs to specify:
The bs
function in the splines
library for R can aid in generating bases.
E.g.,
bs(x, degree=1, knots=c(0.2, 0.4, 0.6), intercept=TRUE)
generates a basis for linear splines with three knots, and returns an \(n \times 4\) matrix.
Illustrating bs
.
library(splines); fit.cs2 <- lm(y ~ bs(x, knots=xi, degree=3))
suppressWarnings(pcs2 <- predict(fit.cs2,data.frame(x=xx),interval="prediction"))
plot(x,y); lines(xx, yytrue, col="blue"); lines(xx, pcs2[,1]);
abline(v=xi, col="gray", lty=2)
lines(xx, pcs2[,2], col=2, lty=2); lines(xx, pcs2[,3], col=2, lty=2)
Polynomials fit to data tend to be erratic near the boundaries,
This is exasperated with splines.
Consider a comparison to a global cubic fit.
fit.cub <- lm(y~x + I(x^2) + I(x^3))
suppressWarnings({pcub <- predict(fit.cub, newdata=data.frame(x=xx),se.fit=TRUE)
pcs3 <- predict(fit.cs2, data.frame(x=xx), se.fit=TRUE)})
plot(xx, sqrt(pcs3$se.fit^2 + summary(fit.cs2)$sigma^2), type="l",
xlab="x", ylab="pred var")
lines(xx, sqrt(pcub$se.fit^2 + summary(fit.cub)$sigma^2), col=2)
legend("top", c("cubic spline", "global cubic"), lty=1, col=1:2)
A natural cubic spline is a cubic spline with the additional “constraint” of linearity beyond the boundary knots.
\[ \begin{aligned} N_1(x) &= 1, & N_2(x) &= x, & N_{k+2}(x) &= d_k(x) - d_{K-1}(x), \end{aligned} \]
where \[ d_k(x) = \frac{(x - \xi_k)_+^3 - (x- \xi_K)_+^3}{\xi_K - \xi_k} \]
ns
in the splines
library for R.Paradoxically, this is a simpler model:
fit.ns <- lm(y ~ ns(x, knots=xi))
suppressWarnings(pns <- predict(fit.ns, data.frame(x=xx), se.fit=TRUE))
plot(xx, sqrt(pcs3$se.fit^2 + summary(fit.cs2)$sigma^2), type="l",
xlab="x", ylab="pred var")
lines(xx, sqrt(pcub$se.fit^2 + summary(fit.cub)$sigma^2), col=2)
lines(xx, sqrt(pns$se.fit^2 + summary(fit.ns)$sigma^2), col=3)
legend("top", c("cubic spline", "global cubic", "natural cubic spline"),
lty=1, col=1:3)
Adding four more knots.
xi2 <- c(0.5,2,3.5,5,6.5,8,9.5); fit.ns2 <- lm(y ~ ns(x, knots=xi2))
pns2 <- predict(fit.ns2, data.frame(x=xx), interval="prediction")
plot(x,y); lines(xx, yytrue, col="blue"); lines(xx, pns2[,1]);
abline(v=xi2, col="gray", lty=2)
lines(xx, pns2[,2], col=2, lty=2); lines(xx, pns2[,3], col=2, lty=2)
Choosing knots is cumbersome.
Consider instead finding, among all functions \(f(x)\) with two continuous derivatives, the one that minimizes the penalized residual sum of squares:
\[ \mathrm{RSS}(f, \lambda) = \sum_{i=1}^n \{y_i - f(x_i)\}^2 + \lambda \int \{f''(t)\}^2 dt, \]
where \(\lambda\) is a fixed smoothing parameter:
The solution is a so-called smoothing spline (SS).
Remarkably, it can be shown that there is an explicit, finite-dimensional, unique minimizer:
But it seems over-parameterized at first glance:
However the regularization term implies a penalty on the spline coefficients, \(\beta_j\), which shrink the overall \(\hat{f}\) towards a linear fit.
Since the SS solution is a natural spline
\[ f(x) = \sum_{j=1}^n N_j(x) \beta_j \equiv N(x) \beta \equiv N\beta, \]
so the criterion reduces to
\[ \mathrm{RSS}(\beta, \lambda) = ||Y - N\beta||^2 + \lambda \beta^\top \Omega_n \beta, \]
where \(\{\Omega_n\}_{jk} = \int N_j''(t)N_k''(t) dt\).
This is a generalized ridge regression, so \[ \hat{\beta} = (N^\top N + \lambda \Omega_n)^{-1} N^\top Y. \]
A smoothing spline with pre-chosen \(\lambda\) is an example of a linear smoother:
\[ \begin{aligned} \hat{f} &= N(N^\top N + \lambda \Omega_n)^{-1} N^\top Y\\ &= S_\lambda Y. \end{aligned} \]
\(S_\lambda\) is similar to the hat matrix \(H \equiv S_\infty\) in an OLS regression.
Recall that in that context \(\hat{\beta} = (X^\top X)^{-1} X^\top y\),
Both \(H\) and \(S_\lambda\) are symmetric, positive semidefinite matrices.
\[ \begin{aligned} \mbox{Recall that } \quad \text{Rank}(H) &= \text{tr}(H) = \text{tr}(X(X^\top X)^{-1}X^\top) \\ &= \text{tr}((X^\top X)^{-1}X^\top X) = \text{tr}(I_p) = p \end{aligned} \]
gives the dimension of the projection space, and hence the number of parameters involved in the fit (of the mean \(X\beta\)).
\(S_\lambda\) is of full rank, but since it is not idempotent its rank does not equal its trace.
By analogy then, the effective degrees of freedom (df) of a SS is defined to be
\[ \mathrm{df}_\lambda = \mathrm{trace}(S_\lambda). \]
This allows a more intuitive parameterization.
smooth.spline
in RIn R we can use the smooth.spline
function in the splines
library, which has a df
argument.
This encourages a more traditional mode of model selection:
df
,fit.ss5 <- smooth.spline(x, y, df=5)
pss5 <- as.matrix(predict(fit.ss5, data.frame(x=xx))$y)
fit.ss10 <- smooth.spline(x, y, df=10)
pss10 <- as.matrix(predict(fit.ss10, data.frame(x=xx))$y)
fit.ss20 <- smooth.spline(x, y, df=20)
pss20 <- as.matrix(predict(fit.ss20, data.frame(x=xx))$y)
predict
method does not provide an interval calculation.Nice that its automatic; not sure I love the fit(s) though.
plot(x,y); lines(xx, yytrue, col="blue")
lines(xx, pss5, lwd=2); lines(xx, pss10, col=2, lwd=2);
lines(xx, pss20, col=3, lwd=2)
legend("topright", c("df = 5", "df = 10", "df = 20"), lty=1, lwd=2, col=1:3)
Cross validation (CV) and the bootstrap are frequentist procedures for:
where it is otherwise difficult to do analytically or via asymptotics.
Some would say CV&B are a cheap alternative to MCMC
There is some truth to that,
Suppose we have
Quantifying the goodness of fit requires specifying a loss function,
Typical choices are
\[ L(y, \hat{f}(x)) = \left \{ \begin{array}{ll} (y - \hat{f}(x))^2 & \mbox{squared error}\\ |y - \hat{f}(x)| & \mbox{absolute error}. \end{array} \right. \]
The training error is the average loss over the training sample:
\[ \overline{\mathrm{err}} = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}(x_i)). \]
Depending on how \(\hat{f}\) is obtained, we can usually make the training error arbitrarily small,
The test error or generalization error is the expected prediction error over an independent test sample
\[ \mathrm{Err} = \mathbb{E}\{L(Y, \hat{f}(X))\}, \]
where both \(X\) and \(Y\) are drawn randomly from their joint distribution (population).
Whereas the training error is available explicitly, we would like to estimate the testing error for particular \(\hat{f}\).
As a (good) model becomes more and more complex,
In between there is an optimal complexity that gives the minimum testing error.
So our goals are twofold:
Usually there is a single tuning parameter \(\lambda\) that varies the complexity of the model.
The presumed unity of \(\lambda\) is the crutch of CV relative to Bayesian methods for the inference of unknown parameters.
If data is overly abundant, the best approach is to randomly divide it into three quarantined parts.
When it is not abundant, we have to be more clever
But first it will be helpful to have a bit of digression in order to better understand the bias–variance trade-off as model complexity is varied.
Assume that \(Y = f(x) + \varepsilon\) where \(\mathbb{E}\{\varepsilon\} = 0\) and \(\mathbb{V}\mathrm{ar}(\varepsilon) = \sigma_\varepsilon^2\).
Then the expected prediction error \(\hat{f}(x)\), under squared-error loss is
\[ \begin{aligned} \mathrm{Err}(x) &= \mathbb{E}\{(Y(x) - \hat{f}(x))^2\} \\ &= \sigma_\varepsilon^2 + [ \mathbb{E}\{\hat{f}(x)\} - f(x)]^2 + \mathbb{E}\{(\hat{f}(x) - \mathbb{E}\{\hat{f}(x)\})^2\}\\ &= \sigma_\varepsilon^2 + \mathrm{Bias}^2[\hat{f}(x)] + \mathbb{V}\mathrm{ar}(\hat{f}(x)) \\ &= \mbox{Irreducible Error} + \mathrm{Bias}^2 + \mathrm{Variance}. \end{aligned} \]
The more complex \(\hat{f}\) is, the lower the (squared) bias but the higher the variance.
OLS for a \(p\)-vector of predictors \(x\), i.e., \(\hat{f}_p(x) = x^\top \hat{\beta}\), gives:
\[ \begin{aligned} \mathrm{Err}(x) &= \mathbb{E}\{ (Y(x) - \hat{f}_p(x))^2 \} \\ &= \sigma_\varepsilon^2 + [f(x) - \mathbb{E}\{\hat{f}_p(x)\}]^2 + || h(x)||^2 \sigma_\varepsilon^2, \end{aligned} \]
where \(h(x) = x^\top (X^\top X)^{-1} X^\top\), an \(n\)-vector, describing a projection.
Therefore we can read off that \(\mathbb{V}\mathrm{ar}(\hat{f}_p(x)) = || h(x)||^2 \sigma_\varepsilon^2\).
\[ \frac{1}{n} \sum_{i=1}^n \text{Err}(x_i) = \sigma_\varepsilon^2 + \frac{1}{n} \sum_{i=1}^n[ f(x_i) - \mathbb{E}\{\hat{f}_p(x_i)\}]^2 + \frac{p}{n}\sigma_\varepsilon^2 \]
reveals how model complexity is directly related to \(p\).
The form of the error for a ridge regression \(\hat{f}_\lambda\) is similar.
For the bias, let \(\beta_*\) denote the parameters of the best linear fit \(f\):
\[ \beta_* = \text{argmin}_\beta \; \mathbb{E}\{ (f(x) - x^\top \beta)^2 \}, \]
where the expectation is over the (random) \(x\)-vector of predictors. Then,
\[ \begin{aligned} \mathbb{E}_x&\{ (f(x) - \mathbb{E}\{\hat{f}_\lambda(x)\})^2\} \\ & = \mathbb{E}_x \{ (f(x) - x^\top \beta_*)^2\} + \mathbb{E}_x \{ (x^\top \beta_*- \mathbb{E}\{x^\top \beta_*\})^2\} \\ &= \mbox{Av[Model Bias]}^2 + \mbox{Av[Estimation Bias]}^2. \end{aligned} \]
Estimation bias:
Model bias:
Consider a popular synthetic data-generating mechanism that came from the original MARS paper.
fried <- function (n=50, p=6)
{
if(p < 5) stop("must have at least 5 cols")
X <- matrix(runif(n * p), nrow = n)
Ytrue <- 10*sin(pi*X[,1]*X[,2]) + 20*(X[,3]-0.5)^2 + 10*X[,4] + 5*X[,5]
Y <- Ytrue + rnorm(n, 0, 1)
return(data.frame(X, Y, Ytrue))
}
Now generate some random training and testing data.
n <- 100
p <- 7
train <- fried(n, p); test <- fried(1000, p)
Xorig <- as.matrix(train[,1:p]); XXorig <- as.matrix(test[,1:p])
yq <- drop(train$Y); yyq <- drop(test$Y)
yyqtrue <- drop(test$Ytrue)
Then expand out the design matrix for a global second-order model.
X <- cbind(Xorig, Xorig^2); XX <- cbind(XXorig, XXorig^2)
for(i in 1:(ncol(Xorig)-1)) for(j in (i+1):ncol(Xorig)) {
X <- cbind(X, Xorig[,i]*Xorig[,j])
XX <- cbind(XX, XXorig[,i]*XXorig[,j])
}
Consider the following function which
X
-y
with penalty lambda
,XX
locations,yy
values provided,yytrue
.library(MASS)
ridge.Err <- function(X, y, lambda, XX, yytrue)
{
beta <- as.numeric(coef(lm.ridge(y~X, lambda=lambda)))
yypred <- beta[1] + XX %*% beta[-1]
Err <- (yypred - yytrue)^2
return(Err)
}
Consider a range of lambda
values.
lambda <- seq(0,2,length=100)
Err.lam <- Err.lam.true <- rep(NA, length(lambda))
for(i in 1:length(lambda)) {
Err.lam[i] <- mean(ridge.Err(X, yq, lambda[i], XX, yyq))
Err.lam.true[i] <- mean(ridge.Err(X, yq, lambda[i], XX, yyqtrue))
}
Err.lam
corresponds to a “data rich” situation where we have a big validation set.Err.lam.true
is a fantasy where we get to compare to true values.plot(lambda, Err.lam, type="l", col=2, ylim=range(c(Err.lam, Err.lam.true)))
lines(lambda, Err.lam.true, col=3)
legend("topleft", c("Data Rich", "Truth"), lty=1, col=c(2,3))
df
Numerically solve for the \(\lambda\) implied by a specified degrees of freedom, df
:
lambda.from.df <- function(X, df)
{
fn <- function(lambda, df, X) {
Htrace <- sum(diag(X %*% solve(t(X)%*%X + diag(lambda, ncol(X))) %*% t(X)))
return(abs(Htrace - df))
}
lambda <- optimize(fn, interval=c(0, sum(yq^2)), df=df, X=X)$minimum
}
Now evaluate that function on a grid of df
s.
df <- seq(1,ncol(X), by=2)
Err.df.true <- Err.df <- rep(NA, length(df))
for(i in 1:length(df)) {
lambda.df <- lambda.from.df(X, df[i])
Err.df.true[i] <- mean(ridge.Err(X, yq, lambda.df, XX, yyqtrue))
Err.df[i] <- mean(ridge.Err(X, yq, lambda.df, XX, yyq))
}
plot(df, Err.df, type="b", lwd=2, col=2, ylim=range(c(Err.df, Err.df.true)))
points(df, Err.df.true, type="b", lwd=2, col=3, pch=19)
legend("top", c("Data Rich", "Truth"), pch=c(18,21), col=c(2,3))
Probably the simplest and most widely used method for estimating the prediction error is cross-validation (CV).
It provides estimates of the generalization error \(\mathrm{Err} = \mathbb{E}\{L(Y, \hat{f}(X))\}\)
For fold \(k=1,\dots,K\) (\(k=3\) of \(K=5\) above),
Let
\[ \kappa: \{1,\dots, n\} \rightarrow \{1, \dots, K\} \]
be a random indexing function that indicates which fold (\(k\)) each observation (\(i\)) belongs to.
Here is a simple way to do that in R:
cv.folds <- function (n, folds = 10)
{
split(sample(1:n), rep(1:folds, length = n))
}
Let \(\hat{f}^{-k}(x)\) denote the fitted function for the \(k^{\mathrm{th}}\) fold.
Then the CV estimate of prediction error is
\[ \mathrm{CV} = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}^{-\kappa(i)}(x_i)). \]
Given a set of models \(f(x,\lambda)\) indexed by a tuning parameter \(\lambda\), we have
\[ \text{CV}(\lambda) = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}^{-\kappa(i)}(x_i, \lambda)), \]
and then CV\((\lambda)\) traces out a generalization error curve.
Here is an implementation on our Friedman example.
folds <- cv.folds(n, 5)
Err.cv <- rep(NA, length(df))
for(j in 1:length(df)) {
Err.folds <- rep(NA, n)
for(i in 1:length(folds)) {
o <- folds[[i]]
lambda.df <- lambda.from.df(X[-o,], df[j])
Err.folds[o] <- ridge.Err(X[-o,], yq[-o], lambda.df, X[o,], yq[o])
}
Err.cv[j] <- mean(Err.folds)
}
A nice approximation to the unrealistic data-rich setting.
plot(df, Err.df, type="b", lwd=2, col=2, ylim=range(c(Err.df,Err.df.true,Err.cv)))
points(df, Err.cv, type="b", pch=18)
points(df, Err.df.true, type="b", lwd=2, col=3, pch=19)
legend("top", c("CV", "Data Rich", "Truth"), pch=c(18,21,19), col=c(1,2,3))
Now repeat the above idea 100 times:
reps <- 100
Err.cvs <- matrix(NA, nrow=reps, ncol=length(df))
for(r in 1:reps) {
folds <- cv.folds(n, 5)
for(j in 1:length(df)) {
Err.folds <- rep(NA, n)
for(i in 1:length(folds)) {
o <- folds[[i]]
lambda.df <- lambda.from.df(X[-o,], df[j])
Err.folds[o] <- ridge.Err(X[-o,], yq[-o], lambda.df, X[o,], yq[o])
}
Err.cvs[r,j] <- mean(Err.folds)
}
}
… and summarize the distribution of the CV error:
m <- apply(Err.cvs, 2, mean);
q2 <- apply(Err.cvs, 2, quantile, 0.75)
q1 <- apply(Err.cvs, 2, quantile, 0.25)
plot(df, Err.df, type="b", lwd=2, col=2, ylim=range(c(Err.df,Err.df.true,Err.cv)))
points(df, m, type="b", pch=18); lines(df, q1, lty=2); lines(df, q2, lty=2)
points(df, Err.df.true, type="b", lwd=2, col=3, pch=19)
legend("top", c("CV", "Data Rich", "Truth"), pch=c(18,21,19), col=c(1,2,3))
That distribution can be useful for finding the minimum
The one-standard-error rule suggests choosing the “most parsimonious” model whose error is no more than one standard error above the error of the best model.
df
whose mean CV error is smaller than the lowest upper quantile.Typical choices for \(K\) include 5, 10, and \(n\).
Compared to \(K=5\), \(K=10\)
\(K=n\) is called leave-one-out cross validation (LOO-CV). Advantages include
But disadvantages include
Generalized cross-validation (GCV) provides a convenient approximation to LOO-CV when the model is linear and under squared error loss,
Now for many LMs, \[ \frac{1}{n} \sum_{i=1}^n [y_i - \hat{f}^{-i}(x_i)]^2 = \frac{1}{n} \sum_{i=1}^n \left[ \frac{y_i - \hat{f}(x_i)}{1 - S_{ii}} \right]^2. \]
The GCV approximation is \[ \mathrm{GCV} = \frac{1}{n} \sum_{i=1}^n \left[\frac{y_i - \hat{f}(x_i)}{1 - \text{trace}(S)/n } \right]^2. \]
Fast CV and GCV routines are built into many software packages.
smooth.spline
function has a cv
argument allowing CV or GCV values to be output.E.g., revisiting our sinusoidal data from before, and evaluating for all degrees of freedom,
df2 <- 2:(length(x)-1)
cv <- rep(NA, length(df2))
for(i in 1:length(df2)) {
cv[i] <- smooth.spline(x, y, df=df2[i])$cv.crit
}
plot(df2, cv)
df
\(\in \{11,12,13\}\) or so is best.The bootstrap is a general tool for assessing statistical accuracy.
At first glance it seems like voodoo, or like you get something for nothing. Hence the name …
The phrase “pull oneself over a fence by one’s bootstraps” goes back to the early 19th century United States.
In statistics, its all about (re-) sampling with replacement.
Suppose we have a model and a fitting method for a particular set of training data.
Denote the training set by \[ Z = (z_1, z_2, \dots, z_n) \;\;\; \mbox{where} \;\;\; z_i = (x_i, y_i). \]
Suppose that you were interested in some statistic \(S(Z)\) about an aspect of the fit \(\hat{f} \mid Z\).
The idea is to randomly draw datasets with replacement from the training data (\(Z\)), each of which gives a different realization of \(S\).
A bootstrap cartoon …
Based on the bootstrap we can estimate any aspect of the distribution of \(S(Z)\).
As a simple example, consider estimating uncertainty in the mean of a population.
alpha <- 4; ibeta <- 1/4
n <- 30
z <- rgamma(n, alpha, ibeta)
zbar <- mean(z)
B <- 199
Z.star <- matrix(NA, nrow=B, ncol=n)
for(b in 1:B) {
Z.star[b,] <- sample(z, n, replace=TRUE)
}
hist(z, freq=FALSE, main="", ylim=c(0,0.3))
zbar.boot <- apply(Z.star, 1, mean)
hist(zbar.boot, freq=FALSE, col=2, add=TRUE)
points(zbar, 0, col=3, cex=1.5)
q.boot <- quantile(zbar.boot, c(0.025, 0.975))
text(q.boot, c(0,0), c("[", "]"), col=4, cex=1.5)
Consider a bootstrapped uncertainty calculation for our smoothing splines predictor.
B <- 1000
YYpred.boot <- matrix(NA, nrow=B, ncol=length(xx))
for(b in 1:B) {
indices <- sample(1:length(x), length(x), replace=TRUE)
xi <- x[indices]
fit <- smooth.spline(xi, y[indices], df=11)
YYpred.boot[b,] <- as.matrix(predict(fit, data.frame(xi=xx))$y)
}
Now calculate means and quantiles for plotting.
m <- apply(YYpred.boot, 2, mean)
q1 <- apply(YYpred.boot, 2, quantile, p=0.05)
q2 <- apply(YYpred.boot, 2, quantile, p=0.95)
Accounting for mean uncertainty.
matplot(xx, t(YYpred.boot), type="l", lty=1, col="gray", xlab="y", ylab="y")
points(x, y); lines(xx, m, lwd=2)
lines(xx, q1, col=2, lty=2); lines(xx, q2, col=2, lty=2)
Can we account for uncertainty in the complexity of the fit?
YYpred.boot.cv <- matrix(NA, nrow=B, ncol=length(xx))
for(b in 1:B) {
cv2 <- rep(NA, length(df2))
indices <- sample(1:length(x), length(x), replace=TRUE)
df3 <- 2:length(unique(indices))
for(i in 1:length(df3))
cv2[i] <- smooth.spline(x[indices], y[indices], df=df3[i])$cv.crit
xi <- x[indices]
fit <- smooth.spline(xi, y[indices], df=df3[which.min(cv2)])
YYpred.boot.cv[b,] <- as.matrix(predict(fit, data.frame(xi=xx))$y)
}
Again, getting means and quantiles for plotting.
m <- apply(YYpred.boot.cv, 2, mean)
q1 <- apply(YYpred.boot.cv, 2, quantile, p=0.05)
q2 <- apply(YYpred.boot.cv, 2, quantile, p=0.95)
Accounting for full uncertainty.
matplot(xx, t(YYpred.boot.cv), type="l", lty=1, col="gray", xlab="y", ylab="y")
points(x, y); lines(xx, m, lwd=2)
lines(xx, q1, col=2, lty=2); lines(xx, q2, col=2, lty=2)
That was quite a digression!
Everything presented so far has a multidimensional analog (i.e., for \(p > 1\)), technically.
Consider \(\mathbb{R}^2\). Given a set of \(M_1\) basis functions \(h_{1j}(x)\) and \(M_2\) ones \(h_{2k}(x)\), then the \(M_1 \times M_2\) tensor product basis defined by
\[ g_{jk}(x) = h_{1j}(x_1) h_{2k}(x_2), \quad j=1,\dots,M_1,\; k=1,\dots, M_2 \]
can be used to represent a two-dimensional function
\[ g(x) = \sum_{j=1}^{M_1} \sum_{k=1}^{M_2} \beta_{jk} g_{jk}(x). \]
Instead, one usually simplifies the basis, e.g., as in
Or by greedy methods,
mda
package for R.
Basically, it turns out you mostly want to do something else!