Department of Statistics, Virginia Tech

This homework is **due on Tuesday, October 31st at 12:30pm** (the start of class). Please turn in all your work. The primary purpose of this homework is to hone debugging and profiling skills and explore parallelization with Monte Carlo experiments. This description may change at any time, however notices about substantial changes (requiring more/less work) will be additionally noted on the class web page. Note that there are two prongs to submission, via Canvas and Bitbucket (in `asc-repo/hwk/hw4`

). You don’t need to use `Rmarkdown`

but your work should be just as pretty if you want full marks.

*Below are two, very compact, versions of functions for calculating powers of a matrix.*

```
powers3 <- function(x, dg) outer(x, 1:dg, "^")
powers4 <- function(x, dg)
{
repx <- matrix(rep(x, dg), nrow=length(x))
return(t(apply(repx, 1, cumprod)))
}
```

*First, briefly explain the thought process behind these two new methods.*

The `powers3`

function calculates the outer “product” of `x`

with the `Y`

-argument taking on the desired powers. The word “product” is put in quotes, because the in this application the `FUN`

argument is not the default `*`

, rather it is the power argument `^`

. One reason we can expect this method to be slow is that it doesn’t re-use work done with lower powers when calculating the higher ones.

The `powers4`

function is a more compact version of our `powers2`

version from earlier, which is quoted below. Rather than using a `for`

loop, it uses `apply`

(which we know is slower). However, it does utilize a vectorized subroutine, `cumprob`

, so perhaps that will help “make up” some of the time lost to `apply`

.

*Then provide a summary of the computation time for all four versions (two from lecture and the two above). Use the same*`x`

from lecture.

`x <- runif(10000000)`

First, lets paste the two earlier versions in so we can compare those as well.

```
powers1 <- function(x, dg)
{
pw <- matrix(x, nrow=length(x))
prod <- x
for(i in 2:dg) {
prod <- prod * x
pw <- cbind(pw, prod)
}
return(pw)
}
powers2 <- function(x, dg)
{
pw <- matrix(nrow=length(x), ncol=dg)
prod <- x
pw[,1] <- prod
for(i in 2:dg) {
prod <- prod * x
pw[,i] <- prod
}
}
```

Now consider the following summary of execution times.

```
c(p1=system.time(p1 <- powers1(x, 16))[3],
p2=system.time(p2 <- powers2(x, 16))[3],
p3=system.time(p3 <- powers3(x, 16))[3],
p4=system.time(p4 <- powers4(x, 16))[3])
```

```
## p1.elapsed p2.elapsed p3.elapsed p4.elapsed
## 2.892 1.125 8.412 30.899
```

- Woah, the fourth one is particularly bad! I guess having a bit of vectorization doesn’t help in the face of a slow
`apply`

. But lets profile to get more information.

*Profile the code to explain why the two new versions disappoint relative to the original two. Cite particular subroutines which cause the slowdowns with reference to the profile summaries. Are these subroutines they creating memory or computational bottlenecks, or both?*

First consider version 3, with memory profiling turned on.

```
Rprof(memory.profiling=TRUE)
p3 <- powers3(x, 16)
Rprof(NULL)
Rp3 <- summaryRprof(memory="both")
Rp3
```

```
## $by.self
## self.time self.pct total.time total.pct mem.total
## "outer" 8.4 100 8.4 100 3051.7
##
## $by.total
## total.time total.pct mem.total self.time self.pct
## "outer" 8.4 100 3051.7 8.4 100
## "block_exec" 8.4 100 3051.7 0.0 0
## "call_block" 8.4 100 3051.7 0.0 0
## "eval" 8.4 100 3051.7 0.0 0
## "evaluate" 8.4 100 3051.7 0.0 0
## "evaluate_call" 8.4 100 3051.7 0.0 0
## "handle" 8.4 100 3051.7 0.0 0
## "in_dir" 8.4 100 3051.7 0.0 0
## "knitr::knit" 8.4 100 3051.7 0.0 0
## "powers3" 8.4 100 3051.7 0.0 0
## "process_file" 8.4 100 3051.7 0.0 0
## "process_group" 8.4 100 3051.7 0.0 0
## "process_group.block" 8.4 100 3051.7 0.0 0
## "rmarkdown::render" 8.4 100 3051.7 0.0 0
## "timing_fn" 8.4 100 3051.7 0.0 0
## "withCallingHandlers" 8.4 100 3051.7 0.0 0
## "withVisible" 8.4 100 3051.7 0.0 0
##
## $sample.interval
## [1] 0.02
##
## $sampling.time
## [1] 8.4
```

This is a particular boring case. All of the work is happening inside the `outer`

command, and can’t be attributed to any of its subroutines. Notice that there are some extra calls to `knitr`

and `rmarkdown`

(and its subroutines) which wouldn’t be present in a typical R session. We can conclude that our instinct above, which is that work from lower power calculations is not being re-used when calculating earlier powers, is causing `powers3`

to be sluggish.

- I.e., it is much faster to perform one more multiplication to get \(3^3\) from \(3^2\) than it is to re-caluclate it from scratch.
- Moreover, a single
`^`

call is much slower than the a single (or even a small number of multiple)`*`

call(s), as products are implemented natively by fast machine instructions.

Now consider version 4.

```
Rprof(memory.profiling=TRUE)
p4 <- powers4(x, 16)
Rprof(NULL)
Rp4 <- summaryRprof(memory="both")
Rp4
```

```
## $by.self
## self.time self.pct total.time total.pct mem.total
## "apply" 21.96 75.26 26.18 89.72 12024.8
## "t.default" 2.48 8.50 2.48 8.50 1220.7
## "FUN" 2.06 7.06 2.06 7.06 1230.7
## "unlist" 1.12 3.84 1.12 3.84 1220.7
## "aperm.default" 0.58 1.99 0.58 1.99 1220.7
## "matrix" 0.52 1.78 0.52 1.78 1220.7
## "array" 0.24 0.82 0.24 0.82 1220.7
## "lengths" 0.22 0.75 0.22 0.75 43.8
##
## $by.total
## total.time total.pct mem.total self.time self.pct
## "block_exec" 29.18 100.00 14466.2 0.00 0.00
## "call_block" 29.18 100.00 14466.2 0.00 0.00
## "eval" 29.18 100.00 14466.2 0.00 0.00
## "evaluate" 29.18 100.00 14466.2 0.00 0.00
## "evaluate_call" 29.18 100.00 14466.2 0.00 0.00
## "handle" 29.18 100.00 14466.2 0.00 0.00
## "in_dir" 29.18 100.00 14466.2 0.00 0.00
## "knitr::knit" 29.18 100.00 14466.2 0.00 0.00
## "powers4" 29.18 100.00 14466.2 0.00 0.00
## "process_file" 29.18 100.00 14466.2 0.00 0.00
## "process_group" 29.18 100.00 14466.2 0.00 0.00
## "process_group.block" 29.18 100.00 14466.2 0.00 0.00
## "rmarkdown::render" 29.18 100.00 14466.2 0.00 0.00
## "timing_fn" 29.18 100.00 14466.2 0.00 0.00
## "withCallingHandlers" 29.18 100.00 14466.2 0.00 0.00
## "withVisible" 29.18 100.00 14466.2 0.00 0.00
## "t" 28.66 98.22 13245.5 0.00 0.00
## "apply" 26.18 89.72 12024.8 21.96 75.26
## "t.default" 2.48 8.50 1220.7 2.48 8.50
## "FUN" 2.06 7.06 1230.7 2.06 7.06
## "unlist" 1.12 3.84 1220.7 1.12 3.84
## "aperm.default" 0.58 1.99 1220.7 0.58 1.99
## "aperm" 0.58 1.99 1220.7 0.00 0.00
## "matrix" 0.52 1.78 1220.7 0.52 1.78
## "array" 0.24 0.82 1220.7 0.24 0.82
## "lengths" 0.22 0.75 43.8 0.22 0.75
##
## $sample.interval
## [1] 0.02
##
## $sampling.time
## [1] 29.18
```

This is rather more interesting. Ignoring the `rmarkdown`

and other commands, clearly `apply`

is creating a bottleneck. It is high on both `$by.self`

and `$by.total`

lists. It is also surprising how much time is spent transposing: more than two seconds. That is probably not just to do with the final `t()`

command in the code above. There must be some transposes being called under the hood as well.

- However the real story has to do with memory. This one uses an order of magnitude more memory than
`powers3`

, for example. Working with memory is much slower than computing. - Notice that
`cumprod`

doesn’t even show up! Actually calculating the products takes not time at all. - All of the time is spent (inefficiently) navigating the loop and memory management within the
`apply`

statement.

The take-home message is that `apply`

makes for clean code, but that could come at the expense of big slowdowns.

*How would adjust the code for the Annie & Sam example(s) to accommodate other distributions? E.g., if \(S \sim \mathcal{N}(10.5, 1)\) and \(A\sim\mathcal{N}(11, 1.5)\)?*

The code below summarizes what we had in lecture, based on uniform distributions.

```
n <- 1000
S <- runif(n, 10, 11.5)
A <- runif(n, 10.5, 12)
```

Then, using the normal approximation to the binomial, we may obtain the following estimate and interval for the probability that Annie arrives before Sam.

```
prob <- mean(A < S)
se <- sqrt(prob*(1-prob)/n)
c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se)
```

```
## lower est upper
## 0.1876234 0.2130000 0.2383766
```

Similarly for the difference in arrival times …

```
diff <- A - S
d <- c(mean(diff), mean(abs(diff)))
se <- c(sd(diff), sd(abs(diff)))/sqrt(n)
```

- In terms of mean differences:

`c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1])`

```
## lower est upper
## 0.4807015 0.5187167 0.5567319
```

- and in terms of mean absolute differences:

`c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2])`

```
## lower est upper
## 0.6315966 0.6599678 0.6883390
```

The only modifications required in order to accommodate the Gaussian distributions, compared to `runif`

, involve the following.

```
S <- rnorm(n, 10.5, 1)
A <- rnorm(n, 11, 1.5)
```

Then, cut-and-pasting from above gives the following estimates and intervals.

```
prob <- mean(A < S)
se <- sqrt(prob*(1-prob)/n)
results <- c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se)
diff <- A - S
d <- c(mean(diff), mean(abs(diff)))
se <- c(sd(diff), sd(abs(diff)))/sqrt(n)
rbind(prob=results, md=c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1]),
mad=c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2]))
```

```
## lower est upper
## prob 0.3430260 0.3730000 0.4029740
## md 0.4256972 0.5342263 0.6427553
## mad 1.3912087 1.4596419 1.5280751
```

In fact, we could probably write a simple function which allows any distribution for arrival times for Annie and Sam.

```
anniesam <- function(n, rA, rS)
{
## simulate
S <- rA(n)
A <- rS(n)
## estimate
prob <- mean(A < S)
se <- sqrt(prob*(1-prob)/n)
results <- c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se)
diff <- A - S
d <- c(mean(diff), mean(abs(diff)))
se <- c(sd(diff), sd(abs(diff)))/sqrt(n)
results <- rbind(prob=results,
md=c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1]),
mad=c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2]))
## report results
return(results)
}
```

Now lets try it with our two distributions. First uniform:

```
rS <- rA <- runif
formals(rA)$min <- 10; formals(rA)$max <- 11.5
formals(rS)$min <- 10.5; formals(rS)$max <- 12
anniesam(n, rA, rS)
```

```
## lower est upper
## prob 0.1914514 0.2170000 0.2425486
## md 0.4686009 0.5054547 0.5423084
## mad 0.6151313 0.6425700 0.6700086
```

Then Gaussian:

```
rS <- rA <- rnorm
formals(rA)$mean <- 10.5; formals(rA)$sd <- 1
formals(rS)$mean <- 11; formals(rS)$sd <- 1.5
anniesam(n, rA, rS)
```

```
## lower est upper
## prob 0.3449938 0.3750000 0.4050062
## md 0.4869160 0.5997114 0.7125068
## mad 1.4492789 1.5214223 1.5935658
```

- Bingo.

`boot`

(15 pts)*Re-write the least-squares regression bootstrap from lecture using the boot library.*

*Briefly compare and contrast to the results we obtained in lecture.*

First we need some data. The code below duplicates the data-generating mechanism from lecture.

```
n <- 200
X <- 1/cbind(rt(n, df=1), rt(n, df=1), rt(n,df=1))
beta <- c(1,2,3,0)
Y <- beta[1] + X %*% beta[-1] + rnorm(100, sd=3)
fit <- lm(Y~X)
```

Now, the `boot`

function needs (at the very least) a `data`

argument and a `statistic`

argument. For the `data`

argument, lets put together a `matrix`

where the last column is `Y`

and the first columns are `X`

.

`data <- cbind(X,Y)`

Now, for the `statistic`

, we simply need to “wrap” `lm`

in a function that takes a (`data`

) argument and an `index`

argument.

```
stat <- function(data, index)
{
Y <- data[index,ncol(data)]
X <- data[index,1:(ncol(data)-1)]
coef(lm(Y~X))
}
```

Ok, lets try it.

```
library(boot)
beta.hat.boot <- boot(data, stat, R=1000)$t
```

The visulation code below is identical to what we used in lecture.

```
par(mfrow=c(2,2))
for(i in 1:4) {
m <- coef(fit)[i]
s <- sqrt(cov(beta.hat.boot)[i,i])
x <- m + seq(-4*s, 4*s, length=1000)
xlim <- range(c(x, beta.hat.boot[,i]))
hist(beta.hat.boot[,i], freq=FALSE, xlim=xlim,
xlab=paste("beta_", i, sep=""), main="posterior", breaks=20)
lines(x, dnorm(x, m, s), col=2, lwd=2)
}
```

- These results are very similar to the plots we obtained from that code in lecture, at least up to Monte Carlo error.

The code effort (using `boot`

as compared to our own “by hand” version) is very similar in my opinion. The extra scaffolding that building the `stat`

function implies is comparable to the burden of writing your own `for`

loop. However, the `boot`

library does offer some nice auxiliary features which would be hard to duplicate “by hand”, such as the potential for parallelization.

*Design a bootstrap to assess the predictive uncertainty in our fits from slides 49–51 from stats.pdf. Rather than specifying df = 11, use the CV option to fit the degrees of freedom. You may code the routine by hand, or within the boot library. Provide a visualization of the bootstrapped average predictive mean and central 90% quantiles.*

First lets duplicate the data generation code from class to set up the problem.

```
X <- seq(0,10,length=50)
Ytrue <- (sin(pi*X/5) + 0.2*cos(4*pi*X/5))
Y <- Ytrue + rnorm(length(Ytrue), sd=0.1)
```

The code below visualizes the fit from class (using `df=11`

), overlaying the truth.

```
plot(X, Y, main="Smooth Spline Fit and Truth")
XX <- seq(0,10,length=199)
YYtrue <- (sin(pi*XX/5) + 0.2*cos(4*pi*XX/5))
lines(XX, YYtrue, col=1, type="l")
fit1 <- smooth.spline(X, Y, df=11)
YY1 <- as.matrix(predict(fit1, data.frame(X=XX))$y)
lines(XX, YY1, col=2, lty=2, lwd=2)
legend("topright", c("truth", "df=11"), col=c(1,2), lty=c(1,2))
```

*Now consider the following boostrapped version, using CV to infer the degrees of freedom.*

*Using*`cv=TRUE`

, which results in leave-one-out cross validation, can be problematic when there are duplicated rows in the data. The package reports a warning message, but does a sensible thing (leaving out the whole group of duplicates). So we can trust the answers it provides. To avoid seeing all the warning messages, the implementation below uses`suppressWarnings`

.*The code also saves the*`df`

from each bootstrap iteration.

```
B <- 1000
YYpred.boot <- matrix(NA, B, length(XX))
df <- rep(NA, B)
for(b in 1:B) {
indices <- sample(1:length(X), length(X), replace=TRUE)
suppressWarnings(fit <- smooth.spline(X[indices], Y[indices], cv=TRUE))
df[b] <- fit$df
YYpred.boot[b,] <- predict(fit, XX)$y
}
```

The plot below shows the bootstraped predictve paths as gray lines, whose distribution is summarized by a solid black (mean) line and dashed-red 90% quantiles.

```
matplot(XX, t(YYpred.boot), type="l", col="gray",
main="Bootstrapped Smoothing Splines", xlab="y", ylab="y")
points(X, Y)
m <- apply(YYpred.boot, 2, mean)
q1 <- apply(YYpred.boot, 2, quantile, p=0.05)
q2 <- apply(YYpred.boot, 2, quantile, p=0.95)
lines(XX, m, lwd=2)
lines(XX, q1, col=2, lty=2)
lines(XX, q2, col=2, lty=2)
```