The material here is largely paraphrased from Hastie, Tibshirani & Freidman (2017), Chapters 5, 7 & 8.

Goals

We want to take baby steps towards response surface methods based on non-parametric regression models,

We’ll start by considering linear methods via basis expansion, generically,

But they don’t scale well, and are wily when extrapolating.

Making the transition

A desire for

will motivate our transition to GP regression.

We will ultimately see how spline-based methods and GPs are examples of one in the same thing.

Basis expansion

Nonlinear regression?

Generically, the regression problem involves estimating \(f(x) = \mathbb{E}\{Y \mid x\}\), e.g., via

\[ Y = f(x) + \varepsilon. \]

  • The linear model/OLS assumes \(f(x) = x^\top \beta\).

However, it is extremely unlikely that the true function \(f(x)\) is actually linear in \(x\).

  • \(f(x)\) will typically be nonlinear and non-additive in \(x\).
  • Still, representing \(f(x)\) by a linear model is usually convenient, and sometimes a necessary tactic by way of (local) approximation.

Nonlinear fits, by augmenting/replacing \(x\) with additional variables that are transformations of \(x\), represent an attractive option.

  • Of course the first-order model with interactions, and the second-order model are examples of this, but they barely scratch the surface.

Basis expansion

Modern basis expansion methods take the linear modeling apparatus to the extreme by

  • setting up appropriate (and large) bases for expanding the nonlinear capability of OLS regression,
  • pairing with appropriate regularization to encourage parsimony and separate signal from noise.

Denote by \(h_m(x): \mathbb{R}^p \rightarrow \mathbb{R}\), the \(m^\mathrm{th}\) transformation of \(x\), for \(m=1,\dots,M\).

We then model

\[ f(x) = \sum_{m=1}^M \beta_m h_m(x), \]

a linear basis expansion in \(x\).

Simple transformations

Once the basis is determined, the models are linear in these new variables.

  • In the simplest version (small \(M\)), the fitting proceeds via OLS/MLE calculations.

Examples include:

  • \(h_m(x) = 1\), an intercept term
  • \(h_m(x) = x_j\), for coordinates \(j=1,\dots, p\); i.e., the original LM
  • \(h_m(x) = x_j^2\) or \(h_m(x) = x_j x_k\), i.e., higher order Taylor expansion(s)
  • \(h_m(x) = \log(x_j)\), \(\sqrt{x_j}, \dots\) or other nonlinear transformations
  • \(h_m(x) = ||x||^2\), or other function of several inputs
  • \(h_m(x) = I(L_m < x_j < U_m)\), i.e., piecewise constant

This last one seems silly at first glance,

  • but it is actually the most interesting from a flexibility perspective.

Piecewise polynomials and splines

Ultimately, we will be fitting piecewise polynomials which will lead us to so-called splines.

For now, we assume that \(x\) is one-dimensional.

A piecewise polynomial \(f(x)\) is obtained by

  • dividing the domain of \(x\) into contiguous intervals.
  • The points in the input space that delineate the partition boundaries are called knots: \(\xi_1, \xi_2, \dots\).
  • Each region, an interval between contiguous knots \([\xi_i, \xi_{i+1})\), is represented by a polynomial fit the the local data
  • possibly under other (more global) constraints.

Common special cases include piecewise constant, linear, quadratic, and cubic models.

Piecewise constant

Consider two knots \(\xi_1\) and \(\xi_2\) and \(f(x) = \sum_{m=1}^2 \beta_m h_m(x)\) with

\[ \begin{aligned} h_1(x) &= I(x < \xi_1) & h_2(x) &= I(\xi_1 \leq x < \xi_2) & h_3(x) &= I(\xi_2 \leq x). \end{aligned} \]

  • OLS amounts to \(\hat{\beta}_m = \bar{y}_m\), an estimate of the mean of \(Y\) in the \(m^\mathrm{th}\) region.

Consider some simple sinusoidal data.

x <- seq(0,10,length=50)
ytrue <- (sin(pi*x/5) + 0.2*cos(4*pi*x/5))
y <- ytrue + rnorm(length(ytrue), sd=0.1)
plot(x,y); lines(x, ytrue)

Building the constant basis in R

xi <- c(3, 7)
hc1 <- function(x) { as.numeric(x < xi[1]) }
hc2 <- function(x) { as.numeric(x >= xi[1] & x < xi[2]) }
hc3 <- function(x) { as.numeric(x >= xi[2]) }
H.c <- data.frame(cbind(hc1(x), hc2(x), hc3(x)))
names(H.c) <- paste("hc", 1:3, sep="")

Performing the fit

pc.fit <- lm(y~.-1, data=H.c)

Creating a predictive grid, and the true output values for comparison.

xx <- seq(-1,11,length=199)
HH.c <- data.frame(cbind(hc1(xx), hc2(xx), hc3(xx)))
names(HH.c) <- paste("hc", 1:3, sep="")
yytrue <- (sin(pi*xx/5) + 0.2*cos(4*pi*xx/5))

Not a fantastic fit.

plot(x,y); lines(xx, yytrue, col="blue")
pc <- predict(pc.fit, newdata=HH.c, interval="prediction")
lines(xx, pc[,1]); lines(xx, pc[,2], col=2, lty=2); lines(xx, pc[,3], col=2, lty=2)
abline(v=xi, col="gray", lty=2)

Piecewise linear

However, if we augment with

  • \(h_{m+3}(x) = h_m(x)x\), for \(m=1,\dots,3\), and fit \(f(x) = \sum_{m=1}^6 \beta_m h_m(x)\),

OLS amounts to \(\hat{\beta}\) via MLE (with intercept) for the \(y\) in the \(m^\mathrm{th}\) region.

Piecewise linear in R

Creating the basis design and performing the fit.

H.l <- cbind(H.c, hc1(x)*x, hc2(x)*x, hc3(x)*x)
names(H.l)[4:6] <- paste("hl", 1:3, sep="")
pl.fit <- lm(y~.-1, data=H.l)

Creating the predictive design and predicting.

HH.l <- cbind(HH.c, hc1(xx)*xx, hc2(xx)*xx, hc3(xx)*xx)
names(HH.l)[4:6] <- paste("hl", 1:3, sep="")
pl <- predict(pl.fit, newdata=HH.l, interval="prediction")

A better fit.

plot(x,y); lines(xx, yytrue, col="blue")
lines(xx, pl[,1])
lines(xx, pl[,2], col=2, lty=2); lines(xx, pl[,3], col=2, lty=2)
abline(v=xi, col="gray", lty=2)

Enforcing continuity

That fit is better, but there is nothing that forces the line segments to line up.

Can we add some constraints to fix it?

\[ \begin{aligned} f(\xi_1^-) &= f(\xi_1^+) & f(\xi_2^-) &= f(\xi_2^+) \\ \mbox{i.e. } \quad \beta_1 + \xi_1 \beta_4 &= \beta_2 + \xi_1 \beta_5 & \beta_2 + \xi_2 \beta_5 &= \beta_3 + \xi_2 \beta_6. \end{aligned} \]

Adding those constraints makes it so that there are only really four free parameters (not six).

Here is a basis for \(f(x) = \sum_{m=1}^m \beta_m h_m(x)\) that incorporates those constraints.

\[ \begin{aligned} h_1(x) &= 1 & h_2(x) &= x & h_3(x) &= (x - \xi_1)_+ & h_4(x) &= (x - \xi_2)_+ \end{aligned} \]

  • \(t_+\) denotes the positive part of \(t\), zero otherwise. It is sometimes called the “hockey stick” function.

Hockey sticks and constraints

In R the hockey stick function could be coded up generically as follows.

hs <- function(x) {
  x[x < 0] <- 0
  return(x)
}

     

On the sinusoid example

First create the design from the basis, and then obtain the fit.

H.lc <- rep(1,length(x))
H.lc <- cbind(H.lc, x)
H.lc <- cbind(H.lc, hs(x-xi[1]))
H.lc <- data.frame(cbind(H.lc, hs(x-xi[2])))
names(H.lc) <- paste("hlc", 1:4, sep="")
plc.fit <- lm(y~.-1, data=H.lc)

Then create the predictive design and gather the predictions.

HH.lc <- rep(1,length(xx))
HH.lc <- cbind(HH.lc, xx)
HH.lc <- cbind(HH.lc, hs(xx-xi[1]))
HH.lc <- data.frame(cbind(HH.lc, hs(xx-xi[2])))
names(HH.lc) <- paste("hlc", 1:4, sep="")
plc <- predict(plc.fit, newdata=HH.lc, interval="prediction")

Even better (because it is less “jumpy”).

plot(x,y); lines(xx, yytrue, col="blue")
lines(xx, plc[,1])
lines(xx, plc[,2], col=2, lty=2); lines(xx, plc[,3], col=2, lty=2)
abline(v=xi, col="gray", lty=2)

Smoother fits

Smoother fits can be obtained by

  • increasing the order of the polynomial (e.g., cubic), and
  • increasing the order of continuity at the knots (e.g., continuous first derivative).

     

Cubic splines

A piecewise cubic that is continuous, and has continuous 1st & 2nd derivatives

  • is called a cubic spline.
  • Enforcing one more order of continuity would lead to a global cubic polynomial.

Cubic spline basis

The following basis represents a cubic spline with knots at \(\xi_1\) and \(\xi_2\):

\[ \begin{aligned} h_1(x) &= 1 & h_3(x) &= x^2 & h_5(x) &= (x-\xi_1)_+^3 \\ h_2(x) &= x & h_4(x) &= x^3 & h_6(x) &= (x-\xi_2)_+^3 \\ \end{aligned} \]

  • There are six basis functions corresponding to a six-dimensional linear space of functions.
  • A quick check confirms the parameter count:

\[ \begin{aligned} (3 \mbox{ regions}) &\times (4 \mbox{ parameters per region}) \\ - (2 \mbox{ knots}) &\times (3 \mbox{ constraints per knot})\\ &= 6. \end{aligned} \]

Continuing our sinusoidal example

Building the design in the cubic spline basis and obtaining the fit.

H.cs <- rep(1,length(x))
H.cs <- cbind(H.cs, x)
H.cs <- cbind(H.cs, x^2)
H.cs <- cbind(H.cs, x^3)
H.cs <- cbind(H.cs, hs((x-xi[1])^3))
H.cs <- data.frame(cbind(H.cs, hs((x-xi[2])^3)))
names(H.cs) <- paste("x", 1:6, sep="")
fit.cs <- lm(y~., data=H.cs)

Now build the corresponding predictive design and gather prediciotns.

HH.cs <- rep(1,length(xx))
HH.cs <- cbind(HH.cs, xx)
HH.cs <- cbind(HH.cs, xx^2)
HH.cs <- cbind(HH.cs, xx^3)
HH.cs <- cbind(HH.cs, hs((xx-xi[1])^3))
HH.cs <- data.frame(cbind(HH.cs, hs((xx-xi[2])^3)))
names(HH.cs) <- paste("x", 1:6, sep="")
suppressWarnings(pcs <- predict(fit.cs, newdata=HH.cs, interval="prediction"))

Pretty amazing, but not perfect (yet).

plot(x,y); lines(xx, yytrue, col="blue")
lines(xx, pcs[,1])
lines(xx, pcs[,2], col=2, lty=2); lines(xx, pcs[,3], col=2, lty=2)
abline(v=xi, col="gray", lty=2)

Order-\(M\) splines

An order-M spline with knots \(\xi_j\), \(j=1,\dots,K\) is

  • a piecewise polynomial of order \(M\)
  • and has continuous derivatives up to order \(M-2\).

The basis set is

\[ \begin{aligned} h_j(x) &= x^{j-1}, & j&=1,\dots,M \\ h_{M+\ell}(x) &= (x - \xi_\ell)_+^{M-1}, & \ell &= 1,\dots,K. \end{aligned} \]

Most common special cases

The most widely used special cases:

  • \(M=1\), piecewise constant
  • \(M=2\), continuous piecewise linear
  • \(M=4\), cubic spline

It is claimed that cubic splines are the lowest order spline for which the knot-discontinuity (in derivatives) is not visible to the human eye.

  • There is seldom much reason to consider \(M > 4\).

The piecewise constant and (discontinuous) linear cases are common too, especially for higher dimensional inputs (\(p > 1\)).

Regression plines

These fixed knot splines are also known as regression splines.

One needs to specify:

  • the order of the spline, \(M\), and
  • the number of knots, \(K\), and their placement.

The bs function in the splines library for R can aid in generating bases.

E.g.,

bs(x, degree=1, knots=c(0.2, 0.4, 0.6), intercept=TRUE)

generates a basis for linear splines with three knots, and returns an \(n \times 4\) matrix.

Illustrating bs.

library(splines); fit.cs2 <- lm(y ~ bs(x, knots=xi, degree=3))
suppressWarnings(pcs2 <- predict(fit.cs2,data.frame(x=xx),interval="prediction"))
plot(x,y); lines(xx, yytrue, col="blue"); lines(xx, pcs2[,1]); 
abline(v=xi, col="gray", lty=2)
lines(xx, pcs2[,2], col=2, lty=2); lines(xx, pcs2[,3], col=2, lty=2)

Boundary behavior

Polynomials fit to data tend to be erratic near the boundaries,

  • so extrapolation can be dangerous.

This is exasperated with splines.

  • Polynomials fit beyond the boundary knots behave even more wildly than global polynomials.

Consider a comparison to a global cubic fit.

fit.cub <- lm(y~x + I(x^2) + I(x^3))
suppressWarnings({pcub <- predict(fit.cub, newdata=data.frame(x=xx),se.fit=TRUE)
pcs3 <- predict(fit.cs2, data.frame(x=xx), se.fit=TRUE)})

plot(xx, sqrt(pcs3$se.fit^2 + summary(fit.cs2)$sigma^2), type="l", 
  xlab="x", ylab="pred var")
lines(xx, sqrt(pcub$se.fit^2 + summary(fit.cub)$sigma^2), col=2)
legend("top", c("cubic spline", "global cubic"), lty=1, col=1:2)

  • Although predictions are more accurate in the interior of the design space,
  • they are far less accurate outside.

Natural cubic splines

A natural cubic spline is a cubic spline with the additional “constraint” of linearity beyond the boundary knots.

\[ \begin{aligned} N_1(x) &= 1, & N_2(x) &= x, & N_{k+2}(x) &= d_k(x) - d_{K-1}(x), \end{aligned} \]

where \[ d_k(x) = \frac{(x - \xi_k)_+^3 - (x- \xi_K)_+^3}{\xi_K - \xi_k} \]

  • See ns in the splines library for R.

Paradoxically, this is a simpler model:

  • it has four fewer parameters (a quadratic and cubic term need not be fit to the two boundary regions),
  • which means that we can add more knots (i.e., interior modeling), if desired, and retain the same complexity of fit.

fit.ns <- lm(y ~ ns(x, knots=xi))
suppressWarnings(pns <- predict(fit.ns, data.frame(x=xx), se.fit=TRUE))
plot(xx, sqrt(pcs3$se.fit^2 + summary(fit.cs2)$sigma^2), type="l", 
  xlab="x", ylab="pred var")
lines(xx, sqrt(pcub$se.fit^2 + summary(fit.cub)$sigma^2), col=2)
lines(xx, sqrt(pns$se.fit^2 + summary(fit.ns)$sigma^2), col=3)
legend("top", c("cubic spline", "global cubic", "natural cubic spline"), 
  lty=1, col=1:3)

Adding four more knots.

xi2 <- c(0.5,2,3.5,5,6.5,8,9.5); fit.ns2 <- lm(y ~ ns(x, knots=xi2))
pns2 <- predict(fit.ns2, data.frame(x=xx), interval="prediction")
plot(x,y); lines(xx, yytrue, col="blue"); lines(xx, pns2[,1]); 
abline(v=xi2, col="gray", lty=2)
lines(xx, pns2[,2], col=2, lty=2); lines(xx, pns2[,3], col=2, lty=2)

Smoothing splines

Choosing knots is cumbersome.

Consider instead finding, among all functions \(f(x)\) with two continuous derivatives, the one that minimizes the penalized residual sum of squares:

\[ \mathrm{RSS}(f, \lambda) = \sum_{i=1}^n \{y_i - f(x_i)\}^2 + \lambda \int \{f''(t)\}^2 dt, \]

where \(\lambda\) is a fixed smoothing parameter:

  • \(\lambda = 0\): \(f\) can be any function that interpolates the data
  • \(\lambda = \infty\): OLS fit from the linear model
  • \(\lambda \in (0,\infty)\): trades off roughness and smoothness

The solution is a so-called smoothing spline (SS).

Smoothing splines are natural splines

Remarkably, it can be shown that there is an explicit, finite-dimensional, unique minimizer:

  • a natural cubic spline with knots at the unique \(x_i\)’s.

But it seems over-parameterized at first glance:

  • \(K = n\) knots \(\rightarrow O(n)\) free parameters \(\rightarrow\) overfit!

However the regularization term implies a penalty on the spline coefficients, \(\beta_j\), which shrink the overall \(\hat{f}\) towards a linear fit.

Generalized ridge regression

Since the SS solution is a natural spline

\[ f(x) = \sum_{j=1}^n N_j(x) \beta_j \equiv N(x) \beta \equiv N\beta, \]

so the criterion reduces to

\[ \mathrm{RSS}(\beta, \lambda) = ||Y - N\beta||^2 + \lambda \beta^\top \Omega_n \beta, \]

where \(\{\Omega_n\}_{jk} = \int N_j''(t)N_k''(t) dt\).

This is a generalized ridge regression, so \[ \hat{\beta} = (N^\top N + \lambda \Omega_n)^{-1} N^\top Y. \]

Smoothing matrix

A smoothing spline with pre-chosen \(\lambda\) is an example of a linear smoother:

\[ \begin{aligned} \hat{f} &= N(N^\top N + \lambda \Omega_n)^{-1} N^\top Y\\ &= S_\lambda Y. \end{aligned} \]

\(S_\lambda\) is similar to the hat matrix \(H \equiv S_\infty\) in an OLS regression.

Recall that in that context \(\hat{\beta} = (X^\top X)^{-1} X^\top y\),

  • so fitted values are \(\hat{y} = X (X^\top X)^{-1} X^\top y = H y\) where \(H = X (X^\top X)^{-1} X^\top\).
  • \(H\) “puts the hat” on \(y\).

Both \(H\) and \(S_\lambda\) are symmetric, positive semidefinite matrices.

  • \(H\) is idempotent (\(H^2=H\)), while \(S_\lambda^2 \preceq S_\lambda\), meaning that the RHS exceeds the LHS by a positive semidefinite matrix due to the shrinking nature of \(S_\lambda\).

Effective degrees of freedom

\[ \begin{aligned} \mbox{Recall that } \quad \text{Rank}(H) &= \text{tr}(H) = \text{tr}(X(X^\top X)^{-1}X^\top) \\ &= \text{tr}((X^\top X)^{-1}X^\top X) = \text{tr}(I_p) = p \end{aligned} \]

gives the dimension of the projection space, and hence the number of parameters involved in the fit (of the mean \(X\beta\)).

\(S_\lambda\) is of full rank, but since it is not idempotent its rank does not equal its trace.

  • Actually, \(\mathrm{trace}(S_\lambda) \leq \mathrm{trace}(H)\).

By analogy then, the effective degrees of freedom (df) of a SS is defined to be

\[ \mathrm{df}_\lambda = \mathrm{trace}(S_\lambda). \]

This allows a more intuitive parameterization.

  • Numerically solve for \(\lambda\) in \(\text{trace}(S_\lambda) = d\), given a particular df \(d\) to control the roughness of the fit.

smooth.spline in R

In R we can use the smooth.spline function in the splines library, which has a df argument.

This encourages a more traditional mode of model selection:

  • try a couple different values of df,
  • select one based on approximate \(F\)-tests,
  • and use residual plots and other subjective criteria to judge goodness of fit.
  • Although cross validation (CV) probably represents a better option.
fit.ss5 <- smooth.spline(x, y, df=5)
pss5 <- as.matrix(predict(fit.ss5, data.frame(x=xx))$y)
fit.ss10 <- smooth.spline(x, y, df=10)
pss10 <- as.matrix(predict(fit.ss10, data.frame(x=xx))$y)
fit.ss20 <- smooth.spline(x, y, df=20)
pss20 <- as.matrix(predict(fit.ss20, data.frame(x=xx))$y)
  • Unfortunately, the predict method does not provide an interval calculation.
  • For that we will need a bootstrap.

Nice that its automatic; not sure I love the fit(s) though.

plot(x,y); lines(xx, yytrue, col="blue") 
lines(xx, pss5, lwd=2); lines(xx, pss10, col=2, lwd=2); 
lines(xx, pss20, col=3, lwd=2)
legend("topright", c("df = 5", "df = 10", "df = 20"), lty=1, lwd=2, col=1:3)

CV & Bootstrap

Data re-use

Cross validation (CV) and the bootstrap are frequentist procedures for:

  • setting/tuning a single parameter – usually one controlling model complexity (CV)
  • choosing amongst a discrete set of models/method (CV)
  • assessing the uncertainty in fits, predictions, etc., (B)

where it is otherwise difficult to do analytically or via asymptotics.

Some would say CV&B are a cheap alternative to MCMC

  • without the crutch of choosing a prior.

There is some truth to that,

  • although in the settings where these methods are most often applied, a de-facto prior is essentially already in place, providing “regularization”.

Loss

Suppose we have

  • a target variable \(y\),
  • measured at a vector of inputs \(x\),
  • and a prediction model \(\hat{f}(x)\) estimated from a training sample, i.e., pairs \((x_1, y_1), \dots, (x_n, y_n)\).

Quantifying the goodness of fit requires specifying a loss function,

  • saying what is a good fit and what’s not.

Typical choices are

\[ L(y, \hat{f}(x)) = \left \{ \begin{array}{ll} (y - \hat{f}(x))^2 & \mbox{squared error}\\ |y - \hat{f}(x)| & \mbox{absolute error}. \end{array} \right. \]

Training error

The training error is the average loss over the training sample:

\[ \overline{\mathrm{err}} = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}(x_i)). \]

Depending on how \(\hat{f}\) is obtained, we can usually make the training error arbitrarily small,

  • but we won’t [necessarily] end up with a good predictor.
  • It may be helpful to think of the training set as random.

Generalization error

The test error or generalization error is the expected prediction error over an independent test sample

\[ \mathrm{Err} = \mathbb{E}\{L(Y, \hat{f}(X))\}, \]

where both \(X\) and \(Y\) are drawn randomly from their joint distribution (population).

  • Expectation is over all that’s random, including the training sample yielding \(\hat{f}\).

Whereas the training error is available explicitly, we would like to estimate the testing error for particular \(\hat{f}\).

As a (good) model becomes more and more complex,

  • it is able to adapt to more complicated underlying structures (decrease bias),
  • but the estimation error increases (increase in variance).

In between there is an optimal complexity that gives the minimum testing error.

Generalization v. model complexity

Goals

So our goals are twofold:

  • Model selection: estimating the performance of a suite of models (\(\hat{f}\)) in order to choose the (approximate) best one. (CV)
  • Model assessment: having chosen a final model, estimating its error structure on new data. (B)

Usually there is a single tuning parameter \(\lambda\) that varies the complexity of the model.

  • Our goal is to find \(\lambda\) minimizing the error of \(\hat{f}_\lambda(x)\).
  • Sometimes \(\lambda\) is suppressed in the notation.


The presumed unity of \(\lambda\) is the crutch of CV relative to Bayesian methods for the inference of unknown parameters.

An idealization

If data is overly abundant, the best approach is to randomly divide it into three quarantined parts.


  • Use the training set [50%] to fit (several) models: the \(\hat{f}\)s
  • Use the validation set [25%] to choose a winner.
  • Use the test set [25%] to assess any uncertainties in the chosen model fit.

When it is not abundant, we have to be more clever

  • and develop methods of data re-use.

But first it will be helpful to have a bit of digression in order to better understand the bias–variance trade-off as model complexity is varied.

Bias-variance decomposition

Assume that \(Y = f(x) + \varepsilon\) where \(\mathbb{E}\{\varepsilon\} = 0\) and \(\mathbb{V}\mathrm{ar}(\varepsilon) = \sigma_\varepsilon^2\).

Then the expected prediction error \(\hat{f}(x)\), under squared-error loss is

\[ \begin{aligned} \mathrm{Err}(x) &= \mathbb{E}\{(Y(x) - \hat{f}(x))^2\} \\ &= \sigma_\varepsilon^2 + [ \mathbb{E}\{\hat{f}(x)\} - f(x)]^2 + \mathbb{E}\{(\hat{f}(x) - \mathbb{E}\{\hat{f}(x)\})^2\}\\ &= \sigma_\varepsilon^2 + \mathrm{Bias}^2[\hat{f}(x)] + \mathbb{V}\mathrm{ar}(\hat{f}(x)) \\ &= \mbox{Irreducible Error} + \mathrm{Bias}^2 + \mathrm{Variance}. \end{aligned} \]

  • The first term is the variance of the target around its true mean, and cannot be avoided no matter how well we estimate the function, unless \(\sigma_\varepsilon = 0\).
  • The second term is the squared bias, the amount by which the average of our estimate differs from the true mean;
  • the last term is the variance: the expected deviation of our fit around its mean.

The more complex \(\hat{f}\) is, the lower the (squared) bias but the higher the variance.

For ordinary least squares

OLS for a \(p\)-vector of predictors \(x\), i.e., \(\hat{f}_p(x) = x^\top \hat{\beta}\), gives:

\[ \begin{aligned} \mathrm{Err}(x) &= \mathbb{E}\{ (Y(x) - \hat{f}_p(x))^2 \} \\ &= \sigma_\varepsilon^2 + [f(x) - \mathbb{E}\{\hat{f}_p(x)\}]^2 + || h(x)||^2 \sigma_\varepsilon^2, \end{aligned} \]

where \(h(x) = x^\top (X^\top X)^{-1} X^\top\), an \(n\)-vector, describing a projection.

Therefore we can read off that \(\mathbb{V}\mathrm{ar}(\hat{f}_p(x)) = || h(x)||^2 \sigma_\varepsilon^2\).

  • Although this depends on \(x\), the in-sample error

\[ \frac{1}{n} \sum_{i=1}^n \text{Err}(x_i) = \sigma_\varepsilon^2 + \frac{1}{n} \sum_{i=1}^n[ f(x_i) - \mathbb{E}\{\hat{f}_p(x_i)\}]^2 + \frac{p}{n}\sigma_\varepsilon^2 \]

reveals how model complexity is directly related to \(p\).

For ridge regression

The form of the error for a ridge regression \(\hat{f}_\lambda\) is similar.

  • The variance term involves \(h(x) = x^\top (X^\top X + \lambda I_p)^{-1} X^\top\).

For the bias, let \(\beta_*\) denote the parameters of the best linear fit \(f\):

\[ \beta_* = \text{argmin}_\beta \; \mathbb{E}\{ (f(x) - x^\top \beta)^2 \}, \]

where the expectation is over the (random) \(x\)-vector of predictors. Then,

\[ \begin{aligned} \mathbb{E}_x&\{ (f(x) - \mathbb{E}\{\hat{f}_\lambda(x)\})^2\} \\ & = \mathbb{E}_x \{ (f(x) - x^\top \beta_*)^2\} + \mathbb{E}_x \{ (x^\top \beta_*- \mathbb{E}\{x^\top \beta_*\})^2\} \\ &= \mbox{Av[Model Bias]}^2 + \mbox{Av[Estimation Bias]}^2. \end{aligned} \]

  • model bias: error between the best linear fit and the true function
  • estimation bias: error between the average estimate and the best linear fit

Estimation v. model bias

Estimation bias:

  • For standard LMs, it is zero.
  • For restricted fits like ridge regression it is positive, and we trade it off with the benefits of reduced variance.


Model bias:

  • Always present in practice.
  • But we can reduce it by enriching the model, say by enlarging the predictor set to include interactions and transformations.
  • This may increase the variance, which can be controlled by using a larger penalty in the ridge regression.
  • That may increase the estimation bias.

A “Friedman” example

Consider a popular synthetic data-generating mechanism that came from the original MARS paper.

fried <- function (n=50, p=6)
{
  if(p < 5) stop("must have at least 5 cols")
  X <- matrix(runif(n * p), nrow = n)
  Ytrue <- 10*sin(pi*X[,1]*X[,2]) + 20*(X[,3]-0.5)^2 + 10*X[,4] + 5*X[,5]
  Y <- Ytrue + rnorm(n, 0, 1)
  return(data.frame(X, Y, Ytrue))
}
  • The surface is nonlinear in five input coordinates,
  • however you can ask for however many more (useless) coordinates you want.
  • Note that the function generates both the \(X\) values, uniform in \([0,1]^p\) and \(Y\)-values.

Quadratic fit

Now generate some random training and testing data.

n <- 100
p <- 7
train <- fried(n, p); test <- fried(1000, p)
Xorig <- as.matrix(train[,1:p]); XXorig <- as.matrix(test[,1:p])
yq <- drop(train$Y); yyq <- drop(test$Y)
yyqtrue <- drop(test$Ytrue)

Then expand out the design matrix for a global second-order model.

X <- cbind(Xorig, Xorig^2); XX <- cbind(XXorig, XXorig^2)
for(i in 1:(ncol(Xorig)-1)) for(j in (i+1):ncol(Xorig)) {
  X <- cbind(X, Xorig[,i]*Xorig[,j])
  XX <- cbind(XX, XXorig[,i]*XXorig[,j])
}

Measuring error

Consider the following function which

  • performs a ridge regression fit on X-y with penalty lambda,
  • predicts at the XX locations,
  • and measures squared error between the predictions and yy values provided,
  • which may be yytrue.
library(MASS)
ridge.Err <- function(X, y, lambda, XX, yytrue)
  {
    beta <- as.numeric(coef(lm.ridge(y~X, lambda=lambda)))
    yypred <- beta[1] + XX %*% beta[-1]
    Err <-  (yypred - yytrue)^2
    return(Err)
  }

Evaluation for many penalties \(\lambda\)

Consider a range of lambda values.

lambda <- seq(0,2,length=100)
Err.lam <- Err.lam.true <- rep(NA, length(lambda))
for(i in 1:length(lambda)) {
  Err.lam[i] <- mean(ridge.Err(X, yq, lambda[i], XX, yyq))
  Err.lam.true[i] <- mean(ridge.Err(X, yq, lambda[i], XX, yyqtrue))
}
  • Err.lam corresponds to a “data rich” situation where we have a big validation set.
  • Err.lam.true is a fantasy where we get to compare to true values.
  • (But still interesting in this illustration.)

plot(lambda, Err.lam, type="l", col=2, ylim=range(c(Err.lam, Err.lam.true)))
lines(lambda, Err.lam.true, col=3)
legend("topleft", c("Data Rich", "Truth"), lty=1, col=c(2,3))

  • Super hard to predict shape of the curve from one run to the next.

Converting from \(\lambda\) to df

Numerically solve for the \(\lambda\) implied by a specified degrees of freedom, df:

lambda.from.df <- function(X, df)
  {
    fn <- function(lambda, df, X) {
      Htrace <- sum(diag(X %*% solve(t(X)%*%X + diag(lambda, ncol(X))) %*% t(X)))
      return(abs(Htrace - df))
    }
    lambda <- optimize(fn, interval=c(0, sum(yq^2)), df=df, X=X)$minimum
  }

Now evaluate that function on a grid of dfs.

df <- seq(1,ncol(X), by=2)
Err.df.true <- Err.df <- rep(NA, length(df))
for(i in 1:length(df)) {
  lambda.df <- lambda.from.df(X, df[i])
  Err.df.true[i] <- mean(ridge.Err(X, yq, lambda.df, XX, yyqtrue))
  Err.df[i] <- mean(ridge.Err(X, yq, lambda.df, XX, yyq))
}

plot(df, Err.df, type="b", lwd=2, col=2, ylim=range(c(Err.df, Err.df.true)))
points(df, Err.df.true, type="b", lwd=2, col=3, pch=19)
legend("top", c("Data Rich", "Truth"), pch=c(18,21), col=c(2,3))

  • Much tamer: but what do we do when we’re not in a data-rich situation?

Cross validation

Probably the simplest and most widely used method for estimating the prediction error is cross-validation (CV).

It provides estimates of the generalization error \(\mathrm{Err} = \mathbb{E}\{L(Y, \hat{f}(X))\}\)

  • where \(X\) and \(Y\) are members of both the training and testing set,
  • by splitting the data into \(K\) roughly equal-sized parts.


For fold \(k=1,\dots,K\) (\(k=3\) of \(K=5\) above),

  • train on the other \(K-1\) folds,
  • and test (or validate) on the \(k^\mathrm{th}\) fold.

CV partition into folds

Let

\[ \kappa: \{1,\dots, n\} \rightarrow \{1, \dots, K\} \]

be a random indexing function that indicates which fold (\(k\)) each observation (\(i\)) belongs to.

  • \(\kappa\) should also insure that fold size is equally balanced.

Here is a simple way to do that in R:

cv.folds <- function (n, folds = 10)
  {
    split(sample(1:n), rep(1:folds, length = n))
  }

CV error calculation

Let \(\hat{f}^{-k}(x)\) denote the fitted function for the \(k^{\mathrm{th}}\) fold.

Then the CV estimate of prediction error is

\[ \mathrm{CV} = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}^{-\kappa(i)}(x_i)). \]

Given a set of models \(f(x,\lambda)\) indexed by a tuning parameter \(\lambda\), we have

\[ \text{CV}(\lambda) = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}^{-\kappa(i)}(x_i, \lambda)), \]

and then CV\((\lambda)\) traces out a generalization error curve.

  • We can search for the \(\hat{\lambda}\) that minimizes it.

Back on the Friedman data

Here is an implementation on our Friedman example.

folds <- cv.folds(n, 5)
Err.cv <- rep(NA, length(df))
for(j in 1:length(df)) {
  Err.folds <- rep(NA, n)
  for(i in 1:length(folds)) {
    o <- folds[[i]]
    lambda.df <- lambda.from.df(X[-o,], df[j])
    Err.folds[o] <- ridge.Err(X[-o,], yq[-o], lambda.df, X[o,], yq[o])
  }
  Err.cv[j] <- mean(Err.folds)
}

A nice approximation to the unrealistic data-rich setting.

plot(df, Err.df, type="b", lwd=2, col=2, ylim=range(c(Err.df,Err.df.true,Err.cv)))
points(df, Err.cv, type="b", pch=18)
points(df, Err.df.true, type="b", lwd=2, col=3, pch=19)
legend("top", c("CV", "Data Rich", "Truth"), pch=c(18,21,19), col=c(1,2,3))

Now repeat the above idea 100 times:

reps <- 100
Err.cvs <- matrix(NA, nrow=reps, ncol=length(df))
for(r in 1:reps) {
  folds <- cv.folds(n, 5)
  for(j in 1:length(df)) {
    Err.folds <- rep(NA, n)
    for(i in 1:length(folds)) {
      o <- folds[[i]]
      lambda.df <- lambda.from.df(X[-o,], df[j])
      Err.folds[o] <- ridge.Err(X[-o,], yq[-o], lambda.df, X[o,], yq[o])
    }
    Err.cvs[r,j] <- mean(Err.folds)
  }
}

… and summarize the distribution of the CV error:

m <- apply(Err.cvs, 2, mean);
q2 <- apply(Err.cvs, 2, quantile, 0.75)
q1 <- apply(Err.cvs, 2, quantile, 0.25)

plot(df, Err.df, type="b", lwd=2, col=2, ylim=range(c(Err.df,Err.df.true,Err.cv)))
points(df, m, type="b", pch=18); lines(df, q1, lty=2); lines(df, q2, lty=2)
points(df, Err.df.true, type="b", lwd=2, col=3, pch=19)
legend("top", c("CV", "Data Rich", "Truth"), pch=c(18,21,19), col=c(1,2,3))

One-standard-error rule

That distribution can be useful for finding the minimum

  • Standard practice is to choose the model \(\hat{f}\) via the CV error–bars rather than the mean directly, or via single realization.

The one-standard-error rule suggests choosing the “most parsimonious” model whose error is no more than one standard error above the error of the best model.

  • For ridge regression, say, we choose the model with the smallest df whose mean CV error is smaller than the lowest upper quantile.

Choosing K

Typical choices for \(K\) include 5, 10, and \(n\).

Compared to \(K=5\), \(K=10\)

  • involves 2x larger training sets, slightly smaller testing sets, and 2x more computation.

\(K=n\) is called leave-one-out cross validation (LOO-CV). Advantages include

  • not random;
  • approximately unbiased for the true prediction error.

But disadvantages include

  • high variance because the \(n\) training sets are so similar to one another, with singleton testing sets;
  • a large computational burden, requiring \(n\) applications of the fitting method.

Generalized CV

Generalized cross-validation (GCV) provides a convenient approximation to LOO-CV when the model is linear and under squared error loss,

  • in which case case we have that \(\hat{y} = S y\).

Now for many LMs, \[ \frac{1}{n} \sum_{i=1}^n [y_i - \hat{f}^{-i}(x_i)]^2 = \frac{1}{n} \sum_{i=1}^n \left[ \frac{y_i - \hat{f}(x_i)}{1 - S_{ii}} \right]^2. \]

The GCV approximation is \[ \mathrm{GCV} = \frac{1}{n} \sum_{i=1}^n \left[\frac{y_i - \hat{f}(x_i)}{1 - \text{trace}(S)/n } \right]^2. \]

  • GCV can have a computational advantage in some settings—where the trace of \(S\) (i.e., df) can be computed more easily than the individual elements \(S_{ii}\).

Software

Fast CV and GCV routines are built into many software packages.

  • For example, smooth.spline function has a cv argument allowing CV or GCV values to be output.

E.g., revisiting our sinusoidal data from before, and evaluating for all degrees of freedom,

df2 <- 2:(length(x)-1)
cv <- rep(NA, length(df2))
for(i in 1:length(df2)) {
  cv[i] <- smooth.spline(x, y, df=df2[i])$cv.crit
}

CV for smoothing splines

plot(df2, cv)

  • Looks like df \(\in \{11,12,13\}\) or so is best.

Bootstrap

The bootstrap is a general tool for assessing statistical accuracy.

At first glance it seems like voodoo, or like you get something for nothing. Hence the name …

The phrase “pull oneself over a fence by one’s bootstraps” goes back to the early 19th century United States.

  • It means that one has initiated a self-sustaining process without external help—something absurdly impossible.
  • Its most common use is in the business world.

In statistics, its all about (re-) sampling with replacement.

Suppose we have a model and a fitting method for a particular set of training data.

Denote the training set by \[ Z = (z_1, z_2, \dots, z_n) \;\;\; \mbox{where} \;\;\; z_i = (x_i, y_i). \]

Suppose that you were interested in some statistic \(S(Z)\) about an aspect of the fit \(\hat{f} \mid Z\).

  • \(S\) may be the variance of some parameter \(\theta\) used by \(f\),
  • or an aspect of the predictive equations \(\hat{f}(x)\), etc.

The idea is to randomly draw datasets with replacement from the training data (\(Z\)), each of which gives a different realization of \(S\).

  • Each sample is typically the same size as the original set (\(n\)).
  • This is done \(B\) times (\(B = 100\), say), producing \(B\) bootstrap datasets.
  • In particular, \(B\) samples of \(S\) are obtained, which define a nonparametric distribution.

A bootstrap cartoon …

Estimate anything

Based on the bootstrap we can estimate any aspect of the distribution of \(S(Z)\).

  • variance: \[ \widehat{\mathbb{V}\mathrm{ar}}(S(Z)) = \frac{1}{B-1} \sum_{b=1}^B (S(Z^{*b}) - \bar{S}^*)^2, \quad \mbox{ where } \quad \bar{S}^* = \frac{1}{B} \sum_b S(Z^{*b}) \]
  • confidence intervals via the \((1-\frac{\alpha}{2})(B+1)\) and \(\frac{\alpha}{2}(B+1)\) order statistics of \(S(Z^{*1}), \dots, S(Z^{*B})\).
  • bias: \[ \widehat{\mathrm{Bias}}(S(Z)) = \frac{1}{B} \sum_{b=1}^B S(Z^{*b}) - \hat{S} \quad \mbox{ where } \quad \hat{S} = S(Z), \] i.e., based on the fit \(\hat{f}\) obtained on the original (full) training set.

Simple example

As a simple example, consider estimating uncertainty in the mean of a population.

  • Some synthetic data, but we don’t presume to know the distribution.
alpha <- 4; ibeta <- 1/4
n <- 30
z <- rgamma(n, alpha, ibeta)
  • An estimate of the mean.
zbar <- mean(z)
  • Bootstrap samples.
B <- 199
Z.star <- matrix(NA, nrow=B, ncol=n)
for(b in 1:B) {
  Z.star[b,] <- sample(z, n, replace=TRUE)
}

hist(z, freq=FALSE, main="", ylim=c(0,0.3))
zbar.boot <- apply(Z.star, 1, mean)
hist(zbar.boot, freq=FALSE, col=2, add=TRUE)
points(zbar, 0, col=3, cex=1.5)
q.boot <- quantile(zbar.boot, c(0.025, 0.975))
text(q.boot, c(0,0), c("[", "]"), col=4, cex=1.5)

Bootstrapped smoothing spline

Consider a bootstrapped uncertainty calculation for our smoothing splines predictor.

B <- 1000
YYpred.boot <- matrix(NA, nrow=B, ncol=length(xx))
for(b in 1:B) {
  indices <- sample(1:length(x), length(x), replace=TRUE)
  xi <- x[indices]
  fit <- smooth.spline(xi, y[indices], df=11)
  YYpred.boot[b,] <- as.matrix(predict(fit, data.frame(xi=xx))$y)
}

Now calculate means and quantiles for plotting.

m <- apply(YYpred.boot, 2, mean)
q1 <- apply(YYpred.boot, 2, quantile, p=0.05)
q2 <- apply(YYpred.boot, 2, quantile, p=0.95)

Accounting for mean uncertainty.

matplot(xx, t(YYpred.boot), type="l", lty=1, col="gray", xlab="y", ylab="y")
points(x, y); lines(xx, m, lwd=2)
lines(xx, q1, col=2, lty=2); lines(xx, q2, col=2, lty=2)

Uncertainty in complexity?

Can we account for uncertainty in the complexity of the fit?

  • Yes, just add an inner-loop of CV to choose the df.
YYpred.boot.cv <- matrix(NA, nrow=B, ncol=length(xx))
for(b in 1:B) {
  cv2 <- rep(NA, length(df2))
  indices <- sample(1:length(x), length(x), replace=TRUE)
  df3 <- 2:length(unique(indices))
  for(i in 1:length(df3)) 
    cv2[i] <- smooth.spline(x[indices], y[indices], df=df3[i])$cv.crit
  xi <- x[indices]
  fit <- smooth.spline(xi, y[indices], df=df3[which.min(cv2)])
  YYpred.boot.cv[b,] <- as.matrix(predict(fit, data.frame(xi=xx))$y)
}

Again, getting means and quantiles for plotting.

m <- apply(YYpred.boot.cv, 2, mean)
q1 <- apply(YYpred.boot.cv, 2, quantile, p=0.05)
q2 <- apply(YYpred.boot.cv, 2, quantile, p=0.95)

Accounting for full uncertainty.

matplot(xx, t(YYpred.boot.cv), type="l", lty=1, col="gray", xlab="y", ylab="y")
points(x, y); lines(xx, m, lwd=2)
lines(xx, q1, col=2, lty=2); lines(xx, q2, col=2, lty=2)

Higher dimension?

Multidimensional splines

That was quite a digression!

Everything presented so far has a multidimensional analog (i.e., for \(p > 1\)), technically.

Consider \(\mathbb{R}^2\). Given a set of \(M_1\) basis functions \(h_{1j}(x)\) and \(M_2\) ones \(h_{2k}(x)\), then the \(M_1 \times M_2\) tensor product basis defined by

\[ g_{jk}(x) = h_{1j}(x_1) h_{2k}(x_2), \quad j=1,\dots,M_1,\; k=1,\dots, M_2 \]

can be used to represent a two-dimensional function

\[ g(x) = \sum_{j=1}^{M_1} \sum_{k=1}^{M_2} \beta_{jk} g_{jk}(x). \]

  • The coefficients \(\beta_{jk}\) can be fit by OLS as before.

Alternatives

Instead, one usually simplifies the basis, e.g., as in

  • thin plate splines
  • radial bases, which are related to Gaussian processes
  • using a coarse lattice of (i.e., fewer) knots.

Or by greedy methods,

  • like MARS: Multivariate Adaptive Regression Splines; see the mda package for R.


Basically, it turns out you mostly want to do something else!