To understand the Gaussian Process
We’ll see that, almost in spite of a technical (over) analysis of its properties, and sometimes strange vocabulary used to describe its features,
Gaussian process (GP) is a very generic term.
All it means is that any finite collection of realizations (or observations) have a multivariate normal (MVN) distribution.
That, in turn, means that the characteristics of those realizations are completely described by their
You’ll hear people talk about function spaces, and reproducing kernel Hilbert spaces, and so on,
But mostly that makes things seem fancier than they really are.
Consider a covariance function defined by Euclidean distance:
\[ \mathbb{C}\mathrm{ov}(Y(x), Y(x')) = \Sigma(x, x') = \exp\{ - || x - x '||^2 \} \]
Finally, note that if we define a covariance matrix \(\Sigma_n\), based evaluating \(\Sigma(x_i, x_j\)) on pairs of \(n\) \(x\)-values \(x_1, \dots, x_n\), it must be positive definite
\[ x^\top \Sigma_n x > 0 \quad \mbox{ for all } x > 0, \]
To see how GPs can be used to perform regression, lets first see how they can be used to generate random data following a smooth functional relationship.
Suppose we
Note that
Here is a version of that with \(x\)-values on a 1d grid.
n <- 100
X <- matrix(seq(0, 10, length=n), ncol=1)
library(plgp)
D <- distance(X)
eps <- sqrt(.Machine$double.eps) ## defining a small number
Sigma <- exp(-D + diag(eps, n)) ## for numerical stability
library(mvtnorm)
Y <- rmvnorm(1, sigma=Sigma)
That’s it!
plot(X, Y, type="l")
What are the properties of this function?
Several are super easy to deduce from the form of the covariance structure.
c(exp(-1^2), exp(-4^2))
## [1] 3.678794e-01 1.125352e-07
But we can’t anticipate much else about the nature of a particular realization.
Y <- rmvnorm(3, sigma=Sigma)
matplot(X, t(Y), type="l", ylab="Y")
Of course, we’re not in the business of generating random functions.
Instead, we want to ask
I.e., we want to know about the conditional distribution of \(Y(x) \mid D_n\).
But we don’t need to get all Bayesian about it,
That conditional distribution, the predictive distribution, is the cornerstone of regression analysis.
Forget, for the moment, that
The curious, and most noteworthy, thing is that so far there are no parameters in the current setup!
Lets shelve interpretation (Bayesian updating or a twist on simple regression) for a moment and just focus on conditional distributions.
Deriving that predictive distribution is a simple application of deducing conditional distributions from a (joint) MVN.
From Wikipedia, if an \(N\) dimensional random vector \(x\) is partitioned as
\[ x = \left( \begin{array}{c} x_1 \\ x_2 \end{array} \right) \quad \mbox{ with sizes } \quad \left( \begin{array}{c} q \times 1 \\ (N-q) \times 1 \end{array} \right), \]
and accordingly \(\mu\) and \(\Sigma\) are partitioned as,
\[ \mu = \left( \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right) \quad \mbox{ with sizes } \quad \left( \begin{array}{c} q \times 1 \\ (N-q) \times 1 \end{array} \right) \] and \[ \Sigma = \left(\begin{array}{cc} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array} \right) \ \mbox{ with sizes } \ \left(\begin{array}{cc} q \times q & q \times (N-q) \\ (N-q)\times q & (N - q)\times (N-q) \end{array} \right), \]
… then, the distribution of \(x_1\) conditional on \(x_2\) is MVN \(x_1 \mid x_2 \sim \mathcal{N}_q (\bar{\mu}, \bar{\Sigma})\), where
\[ \begin{aligned} \bar{\mu} &= \mu_1 + \Sigma_{12} \Sigma_{22}^{-1}(x_2 - \mu_2) \\ \mbox{and } \quad \bar{\Sigma} &= \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}. \end{aligned} \]
An interesting feature of this result is that conditioning upon \(x_2\) alters (decreases) the variance of \(x_1\) compared to its marginal variance \(\Sigma_{11}\),
How do we deploy that to derive \(Y(x) \mid D_n\)?
Consider an \(n+1^\mathrm{st}\) observation \(Y(x)\). Now, let \(Y(x)\) and \(Y_n\) have a joint MVN distribution with mean zero and covariance function \(\Sigma(x,x')\).
That is, stack
\[ \left( \begin{array}{c} Y(x) \\ Y_n \end{array} \right) \quad \mbox{ with sizes } \quad \left( \begin{array}{c} 1 \times 1 \\ n \times 1 \end{array} \right), \]
and if \(\Sigma(X_n,x)\) is the \(n \times 1\) matrix comprised of \(\Sigma(x_1, x), \dots, \Sigma(x_n, x)\), its covariance structure can be partitioned as follows.
\[ \left(\begin{array}{cc} \Sigma(x,x) & \Sigma(x,X_n) \\ \Sigma(X_n,x) & \Sigma_n \end{array} \right) \ \mbox{ with sizes } \ \left(\begin{array}{cc} 1 \times 1 & 1 \times n \\ n\times 1 & n \times n \end{array} \right) \]
Our conditional results for the MVN give us the following predictive distribution
\[ Y(x) \mid D_n \sim \mathcal{N}(\mu(x), \sigma^2(x)) \]
with
\[ \begin{aligned} \mbox{mean } \quad \mu(x) &= \Sigma(x, X_n) \Sigma_n^{-1} Y_n \\ \mbox{and variance } \quad \sigma^2(x) &= 1 - \Sigma(x, X_n) \Sigma_n^{-1} \Sigma(x, X_n)^\top. \end{aligned} \]
\(\mu(x)\) is linear in the observations \(Y_n\).
\(\sigma^2(x)\) is lower than the marginal variance \(\Sigma(x,x) = 1\).
Those were “pointwise” predictive calculations.
However, that’s easily fixed by considering a bunch of \(x\) locations, in a predictive design \(\mathcal{X}\) of \(n'\) rows, say, all at once:
\[ Y(\mathcal{X}) \mid D_n \sim \mathcal{N}_{n'}(\mu(\mathcal{X}), \Sigma(\mathcal{X})) \]
with
\[ \begin{aligned} \mbox{mean } \quad \mu(\mathcal{X}) &= \Sigma(\mathcal{X}, X_n) \Sigma_n^{-1} Y_n \\ \mbox{and variance } \quad \Sigma(\mathcal{X}) &= \Sigma(\mathcal{X},\mathcal{X}) - \Sigma(\mathcal{X}, X_n) \Sigma_n^{-1} \Sigma(\mathcal{X}, X_n)^\top. \end{aligned} \]
Consider some super simple data in 1d.
Here are the relevant data quantities.
n <- 8
X <- matrix(seq(0,2*pi,length=n), ncol=1)
y <- sin(X)
D <- distance(X)
Sigma <- exp(-D)
Here are the relevant predictive quantities.
XX <- matrix(seq(-0.5, 2*pi+0.5, length=100), ncol=1)
DXX <- distance(XX)
SXX <- exp(-DXX) + diag(eps, ncol(DXX))
DX <- distance(XX, X)
SX <- exp(-DX)
Now we just follow the formulas.
Si <- solve(Sigma)
mup <- SX %*% Si %*% y
Sigmap <- SXX - SX %*% Si %*% t(SX)
Joint draws are obtained from the predictive distribution as follows.
YY <- rmvnorm(100, mup, Sigmap)
We can plot those \(Y(\mathcal{X}) =\) YY
samples as a function of the input \(\mathcal{X} =\) XX
locations.
q1 <- mup + qnorm(0.05, 0, sqrt(diag(Sigmap)))
q2 <- mup + qnorm(0.95, 0, sqrt(diag(Sigmap)))
matplot(XX, t(YY), type="l", col="gray", lty=1, xlab="x", ylab="y")
points(X, y, pch=20, cex=2)
lines(XX, mup, lwd=2); lines(XX, sin(XX), col="blue")
lines(XX, q1, lwd=2, lty=2, col=2); lines(XX, q2, lwd=2, lty=2, col=2)
What do we observe?
These features, but especially the football shape, is what makes GPs popular for computer experiments.
Nothing special here, except visualization is lots simpler in 1d or 2d.
Consider a random function in 2d sampled from the GP prior.
nx <- 20
x <- seq(0,2,length=nx)
X <- expand.grid(x, x)
D <- distance(X)
Sigma <- exp(-D) + diag(eps, nrow(X))
Two realizations:
Y <- rmvnorm(2, sigma=Sigma)
par(mfrow=c(1,2), mar=c(1,0.5,0.5,0.5))
persp(x,x, matrix(Y[1,], ncol=nx), theta=-30,phi=30,xlab="x1",ylab="x2",zlab="y")
persp(x,x, matrix(Y[2,], ncol=nx), theta=-30,phi=30,xlab="x1",ylab="x2",zlab="y")
Consider some 2d synthetic data and a 2d predictive grid.
library(lhs)
X <- randomLHS(40, 2)
X[,1] <- (X[,1] - 0.5)*6 + 1
X[,2] <- (X[,2] - 0.5)*6 + 1
y <- X[,1] * exp(-X[,1]^2 - X[,2]^2)
xx <- seq(-2,4, length=40); XX <- expand.grid(xx, xx)
Here are the relevant data quantities, essentially cut-and-paste from above.
D <- distance(X)
Sigma <- exp(-D)
Here are the relevant predictive quantities.
DXX <- distance(XX)
SXX <- exp(-DXX) + diag(eps, ncol(DXX))
DX <- distance(XX, X)
SX <- exp(-DX)
Now we just follow the formulas: these are identical.
Si <- solve(Sigma)
mup <- SX %*% Si %*% y
Sigmap <- SXX - SX %*% Si %*% t(SX)
It is hard to visualize a multitude of sample paths in 2d,
rmvnorm
command if we’d like.Instead, we’ll focus on plotting pointwise summaries, namely
mup
above, andsdp <- sqrt(diag(Sigmap))
Beautiful:
par(mfrow=c(1,2)); cols <- heat.colors(128)
image(xx,xx, matrix(mup, ncol=length(xx)), xlab="x1",ylab="x2", col=cols)
points(X[,1], X[,2])
image(xx,xx, matrix(sdp, ncol=length(xx)), xlab="x1",ylab="x2", col=cols)
points(X[,1], X[,2])
What do we observe? Pretty much the same thing as in the 1d case.
Here is another look at what we predicted.
par(mar=c(1,0.5,0,0.5))
persp(xx,xx, matrix(mup, ncol=40), theta=-30,phi=30,xlab="x1",ylab="x2",zlab="y")
Hopefully you’re starting to be convinced that GPs represent a powerful nonparametric regression tool.
Its kinda-cool that they do so well without really having to learn anything.
But when you think about it a little bit, there are lots of (hidden) assumptions which are going to be violated by most real-data contexts.
Lets suppose you wanted your prior to generate random functions which had an amplitude larger than two.
Here, \(C\) is basically the same as our \(\Sigma\) before: a correlation function for which \(C(x,x) = 1\) and \(C(x,x') < 1\) for \(x \ne x'\), and positive definite. E.g.,
\[ C(x, x') = \exp \{ || x - x' ||^2 \} \]
Now our MVN generator looks like
\[ Y \sim \mathcal{N}_n(0, \tau^2 C_n). \]
That ought to do the trick. E.g., for an amplitude of 10, choose \(\tau^2 = 5^2 = 25\).
n <- 100; X <- matrix(seq(0, 10, length=n), ncol=1)
D <- distance(X); C <- exp(-D + diag(eps, n))
tau2 <- 25; Y <- rmvnorm(1, sigma=tau2 * C)
plot(X, Y, type="l")
Again, who cares about generating data?
What would happen if we had some data with an amplitude of 5, but we used a GP with a built-in scale of 1 [amplitude of 2].
n <- 8
X <- matrix(seq(0,2*pi,length=n), ncol=1)
y <- 5*sin(X) ## this is the only difference
D <- distance(X)
Sigma <- exp(-D)
XX <- matrix(seq(-0.5, 2*pi+0.5, length=100), ncol=1)
DXX <- distance(XX)
SXX <- exp(-DXX) + diag(eps, ncol(DXX))
DX <- distance(XX, X)
SX <- exp(-DX)
Si <- solve(Sigma);
mup <- SX %*% Si %*% y
Sigmap <- SXX - SX %*% Si %*% t(SX)
YY <- rmvnorm(100, mup, Sigmap)
q1 <- mup + qnorm(0.05, 0, sqrt(diag(Sigmap)))
q2 <- mup + qnorm(0.95, 0, sqrt(diag(Sigmap)))
matplot(XX, t(YY), type="l", col="gray", lty=1, xlab="x", ylab="y")
points(X, y, pch=20, cex=2)
lines(XX, mup, lwd=2); lines(XX, 5*sin(XX), col="blue")
lines(XX, q1, lwd=2, lty=2, col=2); lines(XX, q2, lwd=2, lty=2, col=2)
Actually, the “scale 1” GP was pretty robust.
But it is over confident.
So we are under-estimating the predictive uncertainty,
And if we look closely, we can see that the true function goes well outside our predictive interval at the edges of the input space.
How do we estimate the scale?
As with any “parameter”, we have many choices when it comes to estimation.
I have a strong preference for likelihood-based methods (MLE/Bayes) because they are relatively hands-off, and nicely generalize to higher dimensional parameter spaces.
(Strange that we’ve been talking priors and posteriors without likelihoods.)
Our data-generating process is \(Y \sim \mathcal{N}_n(0, \tau^2 C_n)\), so
\[ L \equiv L(\tau^2, C) = (2\pi \tau^2)^{-\frac{n}{2}} | C_n |^{-\frac{1}{2}} \exp\left\{- \frac{1}{2\tau^2} Y_n^\top C_n^{-1} Y_n \right\}. \]
\[ \ell = \log L = -\frac{n}{2} \log 2\pi - \frac{n}{2} \log \tau^2 - \frac{1}{2} \log |C_n| - \frac{1}{2\tau^2} Y_n^\top C_n^{-1} Y_n. \]
To maximize that likelihood with respect to \(\tau^2\), say, just differentiate and solve.
\[ \begin{aligned} 0 \stackrel{\mathrm{set}}{=} \ell' &= - \frac{n}{2 \tau^2} + \frac{1}{2 (\tau^2)^2} Y_n^\top C_n^{-1} Y_n. \\ \mbox{so } \hat{\tau}^2 &= \frac{Y_n^\top C_n^{-1} Y_n}{n}. \end{aligned} \]
Now when we plug \(\hat{\tau}^2\) into the predictive equations, we’re technically turning a (multivariate) normal into a (multivariate) Student-\(t\) with \(n\) degrees of freedom.
We have
\[ \begin{aligned} Y(\mathcal{X}) \mid D_n & \sim \mathcal{N}_{n'}(\mu(\mathcal{X}), \Sigma(\mathcal{X})) \\ \mbox{with mean } \quad \mu(\mathcal{X}) &= C(\mathcal{X}, X_n) C_n^{-1} Y_n \\ \mbox{and variance } \quad \Sigma(\mathcal{X}) &= \hat{\tau}^2[C(\mathcal{X},\mathcal{X}) - C(\mathcal{X}, X_n) C_n^{-1} C(\mathcal{X}, X_n)^\top]. \end{aligned} \]
First estimate \(\tau^2\).
CX <- SX; Ci <- Si; CXX <- SXX
tau2hat <- drop(t(y) %*% Ci %*% y / length(y))
2*sqrt(tau2hat)
## [1] 5.486648
Now estimate the predictive mean vector and covariance matrix …
mup2 <- CX %*% Ci %*% y
Sigmap2 <- tau2hat * (CXX - CX %*% Ci %*% t(CX))
YY <- rmvnorm(100, mup2, Sigmap2)
q1 <- mup + qnorm(0.05, 0, sqrt(diag(Sigmap2)))
q2 <- mup + qnorm(0.95, 0, sqrt(diag(Sigmap2)))
… and visualize …
Much better.
matplot(XX, t(YY), type="l", col="gray", lty=1, xlab="x", ylab="y")
points(X, y, pch=20, cex=2)
lines(XX, mup, lwd=2); lines(XX, 5*sin(XX), col="blue")
lines(XX, q1, lwd=2, lty=2, col=2); lines(XX, q2, lwd=2, lty=2, col=2)
To “break” interpolation, we need to “break” the continuity of the correlation structure on the diagonal.
The simplest way to “break it” is to take
\[ K(x, x') = C(x, x') + g \delta_{x, x'} \]
What does that mean?
We only add in \(g\) when indices of \(x\) are the same, not simply for identical values.
So the previous slide abused notation a little.
This leads to the following representation of the data-generating mechanism.
\[ Y \sim \mathcal{N}_n(0, \tau^2 K_n) \]
I.e., the covariance matrix \(\Sigma_n\) is comprised of entries
\[ \Sigma_n^{ij} = \tau^2 (C(x_i, x_j) + g \delta_{ij}) \]
How, then, do we estimate the hyperparameter \(g\)?
The MLE \(\hat{\tau}^2\), given \(g\) (the only other hyperparameter) is
\[ \hat{\tau}^2 = \frac{Y_n^\top K_n^{-1} Y_n}{n} = \frac{Y_n (C_n + \mathbb{I}_g)^{-1} Y_n}{n}. \]
So lets plug that back into our log likelihood, to get a concentrated log likelihood (or profile likelihood) involving just the remaining parameter \(g\).
\[ \begin{aligned} \ell(g) &= -\frac{n}{2} \log 2\pi - \frac{n}{2} \log \hat{\tau}^2 - \frac{1}{2\hat{\tau}^2} Y_n^\top K_n^{-1} Y_n \\ &= c - \frac{n}{2} \log Y_n K_n^{-1} Y_n - \frac{1}{2} \log |K_n| \end{aligned} \]
Unfortunately, maximizing \(\ell(g)\) requires numerical methods.
The simplest thing to do is to throw it into optimize
.
counter <- 0
nlg <- function(g, D, Y)
{
n <- length(Y)
K <- exp(-D) + diag(g, n)
Ki <- solve(K)
ldetK <- determinant(K, logarithm=TRUE)$modulus
ll <- - (n/2) * log(t(Y) %*% Ki %*% Y) - (1/2) * ldetK
counter <<- counter + 1
return(-ll)
}
eps
and var(y)
.Defining some noisy data quantities.
X <- rbind(X, X); n <- nrow(X); D <- distance(X)
y <- 5*sin(X) + rnorm(n, sd=1)
Optimizing to estimate the nugget.
print(g <- optimize(nlg, interval=c(eps, var(y)), D=D, Y=y)$minimum)
## [1] 0.1028227
And that gives …
K <- exp(-D) + diag(g, n)
Ki <- solve(K)
print(tau2hat <- drop(t(y) %*% Ki %*% y / n))
## [1] 7.949731
Now prediction using the estimated hyperparameters …
DX <- distance(XX, X); KX <- exp(-DX)
KXX <- exp(-DXX) + diag(g, nrow(DXX))
Derive the predictive mean vector and covariance matrix.
mup <- KX %*% Ki %*% y
Sigmap <- tau2hat * (KXX - KX %*% Ki %*% t(KX))
q1 <- mup + qnorm(0.05, 0, sqrt(diag(Sigmap)))
q2 <- mup + qnorm(0.95, 0, sqrt(diag(Sigmap)))
If we want to show sample predictive realizations, and want them to look pretty, we should “subtract” out the extrinsic noise,
Sigma.int <- tau2hat * (exp(-DXX) + diag(eps, nrow(DXX)) - KX %*% Ki %*% t(KX))
YY <- rmvnorm(100, mup, Sigma.int)
matplot(XX, t(YY), type="l", lty=1, col="gray", xlab="x", ylab="y")
points(X, y, pch=20, cex=2)
lines(XX, mup, lwd=2); lines(XX, 5*sin(XX), col="blue")
lines(XX, q1, lwd=2, lty=2, col=2); lines(XX, q2, lwd=2, lty=2, col=2)
It can be unsatisfying to “brute-force” an optimization for a hyperparameter like \(g\),
optimize
is often superior to cleverer methods. Can we improve on the number of evaluations?print(nlg.count <- counter)
## [1] 19
How about using derivative information? For that, the following is useful.
\[ \frac{\partial K_n^{-1}}{\partial \phi} = - K_n^{-1} \frac{\partial K_n}{\partial \phi} K_n^{-1} \quad \mbox{ and } \quad \frac{\partial \log | K_n | }{\partial \phi} = \mathrm{tr} \left \{ K_n^{-1} \frac{\partial K_n}{\partial \phi} \right\} \]
By the chain rule,
\[ \begin{aligned} \ell'(g) &= - \frac{n}{2} \frac{Y_n \frac{\partial K_n^{-1}}{\partial g} Y_n}{Y_n K_n^{-1} Y_n} - \frac{1}{2} \frac{\partial \log |K_n|}{\partial g} \\ &= \frac{n}{2} \frac{Y_n K_n^{-1} \frac{\partial K_n}{\partial g} K_n^{-1} Y_n}{Y_n K_n^{-1} Y_n} - \frac{1}{2} \mathrm{tr} \left \{ K_n^{-1} \frac{\partial K_n}{\partial g} \right\} \end{aligned} \]
Therefore, we have
\[ \ell'(g) = \frac{n}{2} \frac{Y_n (K_n^{-1})^{2} Y_n}{Y_n K_n^{-1} Y_n} - \frac{1}{2} \mathrm{tr} \left \{ K_n^{-1} \right\}. \]
Here is an implementation of the gradient of our (negative) log likelihood in R.
gnlg <- function(g, D, Y)
{
n <- length(Y)
K <- exp(-D) + diag(g, n)
Ki <- solve(K)
KiY <- Ki %*% Y
dll <- (n/2) * t(KiY) %*% KiY / (t(Y) %*% KiY) - (1/2) * sum(diag(Ki))
return(-dll)
}
Lets throw that into optim
and see what happens.
method="L-BFGS-B"
because it supports derivatives, and bounds.out <- optim(0.1*var(y), nlg, gnlg, method="L-BFGS-B",
lower=eps, upper=var(y), D=D, Y=y)
c(g, out$par)
## [1] 0.1028227 0.1028132
optimize
output.How many iterations?
out$counts
## function gradient
## 14 14
optimize
’s 19.How about the rate of decay of correlation in terms of distance.
Consider the following generalization of the covariance function.
\[ C_\theta(x, x') = \exp\left\{ - \frac{||x - x'||^2}{\theta} \right\}. \]
This (hyper-) parameterized family of correlation functions,
is called the isotropic Gaussian family.
This is no different than our inference for \(g\),
Lets first go the brute-force route for maximizing the likelihood.
counter <- 0
nl <- function(par, D, Y)
{
theta <- par[1] ## change 1
g <- par[2]
n <- length(Y)
K <- exp(-D/theta) + diag(g, n) ## change 2
Ki <- solve(K)
ldetK <- determinant(K, logarithm=TRUE)$modulus
ll <- - (n/2) * log(t(Y) %*% Ki %*% Y) - (1/2) * ldetK
counter <<- counter + 1
return(-ll)
}
optim
.For fun, lets switch back to our 2d example.
library(lhs)
X <- randomLHS(40, 2)
X <- rbind(X,X)
X[,1] <- (X[,1] - 0.5)*6 + 1
X[,2] <- (X[,2] - 0.5)*6 + 1
y <- X[,1] * exp(-X[,1]^2 - X[,2]^2) + rnorm(nrow(X), sd=0.01)
Estimating a lengthscale and the nugget is an attempt at resolving a tension between signal and noise.
It helps to think a little about starting values and search ranges.
D <- distance(X)
out <- optim(c(0.1, 0.1*var(y)), nl, method="L-BFGS-B",
lower=eps, upper=c(10, var(y)), D=D, Y=y)
out$par
## [1] 1.064695805 0.008877664
Since "L-BFGS-B"
is calculating a gradient numerically, the reported count of evaluations in the output doesn’t match the number of actual evaluations:
print(brute <- c(out$counts, actual=counter))
## function gradient actual
## 24 24 120
Re-building the data quantities
K <- exp(- D/out$par[1]) + diag(out$par[2], nrow(X))
Ki <- solve(K)
tau2hat <- drop(t(y) %*% Ki %*% y / nrow(X))
And then the predictive quantities.
xx <- seq(-2,4, length=40)
XX <- expand.grid(xx, xx)
DXX <- distance(XX)
KXX <- exp(-DXX/out$par[1]) + diag(out$par[2], ncol(DXX))
DX <- distance(XX, X)
KX <- exp(-DX/out$par[1])
Pretty much the same as before.
mup <- KX %*% Ki %*% y; Sigmap <- tau2hat * (KXX - KX %*% Ki %*% t(KX))
par(mfrow=c(1,2))
image(xx,xx, matrix(mup, ncol=length(xx)), xlab="x1",ylab="x2", col=cols)
points(X[,1], X[,2])
image(xx,xx, matrix(sdp, ncol=length(xx)), xlab="x1",ylab="x2", col=cols)
points(X[,1], X[,2])
What if we take derivatives with respect to \(\theta\), and combine with those for \(g\) and form a gradient?
We’ll need \(\dot{K}_n \equiv \frac{\partial K_n}{\partial \theta}\).
So actually we have \(\dot{K}_n = K_n \cdot \mathrm{Dist}_n/\theta^2\) where the product is component-wise (Hadamard), and \(\mathrm{Dist}_n\) contains a matrix of Euclidean distances.
\[ \ell'(\theta) = \frac{n}{2} \frac{Y_n K_n^{-1} \dot{K}_n K_n^{-1} Y_n}{Y_n K_n^{-1} Y_n} - \frac{1}{2} \mathrm{tr} \left \{ K_n^{-1} \dot{K}_n \right\} \]
gradnl <- function(par, D, Y)
{
theta <- par[1]; g <- par[2]
n <- length(Y)
K <- exp(-D/theta) + diag(g, n)
Ki <- solve(K)
dotK <- K * D / theta^2
KiY <- Ki %*% Y
## for theta then g
dlltheta <- (n/2) * t(KiY) %*% dotK %*% KiY / (t(Y) %*% KiY) -
(1/2) * sum(diag(Ki %*% dotK))
dllg <- (n/2) * t(KiY) %*% KiY / (t(Y) %*% KiY) - (1/2) * sum(diag(Ki))
return(-c(dlltheta, dllg))
}
How does it work?
counter <- 0
outg <- optim(c(0.1, 0.1*var(y)), nl, gradnl, method="L-BFGS-B",
lower=eps, upper=c(10, var(y)), D=D, Y=y)
rbind(grad=outg$par, brute=out$par)
## [,1] [,2]
## grad 1.066092 0.008854119
## brute 1.064696 0.008877664
What about number of evaluations?
rbind(grad=c(outg$counts, actual=counter), brute)
## function gradient actual
## grad 12 12 12
## brute 24 24 120
Lets expand the dimension a bit, and get ambitious. Visualization will be hard, but we have other (relative) progress metrics.
Consider the so-called Friedman function from the MARS paper.
fried <- function (n=50, p=6)
{
if(p < 5) stop("must have at least 5 cols")
X <- matrix(runif(n * p), nrow = n)
Ytrue <- 10*sin(pi*X[,1]*X[,2]) + 20*(X[,3]-0.5)^2 + 10*X[,4] + 5*X[,5]
Y <- Ytrue + rnorm(n, 0, 1)
return(data.frame(X, Y, Ytrue))
}
Lets generate training and testing sets.
n <- 200; p <- 7
train <- fried(n, p); test <- fried(1000, p)
X <- as.matrix(train[,1:p]); XX <- as.matrix(test[,1:p])
y <- drop(train$Y); yy <- drop(test$Y); yytrue <- drop(test$Ytrue)
Lets learn the isotropic Gaussian lengthscale \(\theta\), and the nugget \(g\).
D <- distance(X)
print(out <- optim(c(0.1, 0.1*var(y)), nl, gradnl, method="L-BFGS-B",
lower=eps, upper=c(10, var(y)), D=D, Y=y))
## $par
## [1] 2.004663156 0.005202063
##
## $value
## [1] 668.8463
##
## $counts
## function gradient
## 30 30
##
## $convergence
## [1] 0
##
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
Re-building the data quantities.
K <- exp(- D/out$par[1]) + diag(out$par[2], nrow(D))
Ki <- solve(K)
tau2hat <- drop(t(y) %*% Ki %*% y / nrow(D))
And then the predictive quantities.
DXX <- distance(XX)
KXX <- exp(-DXX/out$par[1]) + diag(out$par[2], ncol(DXX))
DX <- distance(XX, X)
KX <- exp(-DX/out$par[1])
Predicting
mup <- KX %*% Ki %*% y
Sigmap <- tau2hat * (KXX - KX %*% Ki %*% t(KX))
How about MARS?
library(mda)
fit.mars <- mars(X, y)
p.mars <- predict(fit.mars, XX)
Which wins on RMSE to the truth?
rmse <- c(gp=sqrt(mean((yytrue-mup)^2)),mars=sqrt(mean((yytrue-p.mars)^2)))
rmse
## gp mars
## 1.106706 1.412837
How can we improve upon our GP results?
Is it reasonable for correlation to decay uniformly in each input direction?
How about the following generalization?
\[ C_\theta(x, x') = \exp\left\{ - \sum_{k=1}^m \frac{(x_k - x'_k)^2}{\theta_k} \right\} \]
This correlation function is called the separable or anisotropic Gaussian.
How should we do inference for such a vectorized parameter?
Simple; just expand the log likelihood and derivative functions to work with a vectorized \(\theta\).
for
in the gradient function can iterate over coordinates.Each coordinate has a different \(\theta_k\), so pre-computing a distance matrix isn’t helpful.
covar.sep
function from the plgp
package which takes d
\(= \theta\) and g
arguments.Before going derivative crazy, lets focus on the likelihood.
nlsep <- function(par, X, Y)
{
theta <- par[1:ncol(X)]
g <- par[ncol(X)+1]
n <- length(Y)
K <- covar.sep(X, d=theta, g=g)
Ki <- solve(K)
ldetK <- determinant(K, logarithm=TRUE)$modulus
ll <- - (n/2) * log(t(Y) %*% Ki %*% Y) - (1/2) * ldetK
counter <<- counter + 1
return(-ll)
}
Here we go.
tic <- proc.time()[3]; counter <- 0
out <- optim(c(rep(0.1, ncol(X)), 0.1*var(y)), nlsep, method="L-BFGS-B", X=X, Y=y,
lower=eps, upper=c(rep(10, ncol(X)), var(y)))
out$par
## [1] 1.040053979 1.436516616 3.700269057 5.831383028 5.991777483 5.817489045
## [7] 5.609181014 0.005047608
And how about the number of evaluations?
brute <- c(out$counts, actual=counter, time=proc.time()[3]-tic)
brute
## function gradient actual time.elapsed
## 82.000 82.000 1394.000 12.257
gradnlsep <- function(par, X, Y)
{
theta <- par[1:ncol(X)]
g <- par[ncol(X)+1]
n <- length(Y)
K <- covar.sep(X, d=theta, g=g)
Ki <- solve(K)
KiY <- Ki %*% Y
## loop over theta components
dlltheta <- rep(NA, length(theta))
for(k in 1:length(dlltheta)) {
dotK <- K * distance(X[,k])/(theta[k]^2)
dlltheta[k] <- (n/2) * t(KiY) %*% dotK %*% KiY / (t(Y) %*% KiY) -
(1/2) * sum(diag(Ki %*% dotK))
}
## for g
dllg <- (n/2) * t(KiY) %*% KiY / (t(Y) %*% KiY) - (1/2) * sum(diag(Ki))
return(-c(dlltheta, dllg))
}
Now with closed form gradients.
tic <- proc.time()[3]; counter <- 0
outg <- optim(c(rep(0.1, ncol(X)), 0.1*var(y)), nlsep, gradnlsep,
method="L-BFGS-B", lower=eps, upper=c(rep(10, ncol(X)), var(y)), X=X, Y=y)
round(rbind(grad=outg$par, brute=out$par), 5)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## grad 1.13209 1.04383 1.87069 10.00000 10.00000 10.00000 10.00000 0.00673
## brute 1.04005 1.43652 3.70027 5.83138 5.99178 5.81749 5.60918 0.00505
What about number of evaluations?
rbind(grad=c(outg$counts, actual=counter, time=proc.time()[3]-tic), brute)
## function gradient actual time.elapsed
## grad 105 105 105 5.424
## brute 82 82 1394 12.257
How does the separable GP compare against the isotropic one and MARS?
K <- covar.sep(X, d=out$par[1:ncol(X)], g=out$par[ncol(X)+1]); Ki <- solve(K)
tau2hat <- drop(t(y) %*% Ki %*% y / nrow(X))
KXX <- covar.sep(XX, d=out$par[1:ncol(X)], g=out$par[ncol(X)+1])
KX <- covar.sep(XX, X, d=out$par[1:ncol(X)], g=0)
mup <- KX %*% Ki %*% y
Sigmap <- tau2hat * (KXX - KX %*% Ki %*% t(KX))
Evaluating by RMSE.
rmse <- c(rmse, gpsep=sqrt(mean((yytrue - mup)^2)))
rmse
## gp mars gpsep
## 1.1067063 1.4128369 0.8422309
library(laGP)
tic <- proc.time()[3]
gpi <- newGPsep(X, y, d=0.1, g=0.1*var(y), dK=TRUE)
## the MLE calculation is (Bayes) integrated rather than concentrated
mle <- mleGPsep(gpi, param="both", tmin=c(eps, eps), tmax=c(10, var(y)))
elapsed <- as.numeric(proc.time()[3] - tic)
p <- predGPsep(gpi, XX)
deleteGPsep(gpi)
rmse <- c(rmse, bobby=sqrt(mean((yytrue - p$mean)^2)))
rmse
## gp mars gpsep bobby
## 1.1067063 1.4128369 0.8422309 0.7740061
laGP
’s optimized C
backend is the fastest.elapsed
## [1] 0.999