Instructions

This homework is due on Tuesday, October 31st at 12:30pm (the start of class). Please turn in all your work. The primary purpose of this homework is to hone debugging and profiling skills and explore parallelization with Monte Carlo experiments. This description may change at any time, however notices about substantial changes (requiring more/less work) will be additionally noted on the class web page. Note that there are two prongs to submission, via Canvas and Bitbucket (in asc-repo/hwk/hw4). You don’t need to use Rmarkdown but your work should be just as pretty if you want full marks.

Problem 1: Profiling (15 pts)

Below are two, very compact, versions of functions for calculating powers of a matrix.

powers3 <- function(x, dg) outer(x, 1:dg, "^")
powers4 <- function(x, dg) 
 {
   repx <- matrix(rep(x, dg), nrow=length(x))
   return(t(apply(repx, 1, cumprod)))
 }
  1. First, briefly explain the thought process behind these two new methods.

The powers3 function calculates the outer “product” of x with the Y-argument taking on the desired powers. The word “product” is put in quotes, because the in this application the FUN argument is not the default *, rather it is the power argument ^. One reason we can expect this method to be slow is that it doesn’t re-use work done with lower powers when calculating the higher ones.

The powers4 function is a more compact version of our powers2 version from earlier, which is quoted below. Rather than using a for loop, it uses apply (which we know is slower). However, it does utilize a vectorized subroutine, cumprob, so perhaps that will help “make up” some of the time lost to apply.

  1. Then provide a summary of the computation time for all four versions (two from lecture and the two above). Use the same x from lecture.
x <- runif(10000000)

First, lets paste the two earlier versions in so we can compare those as well.

powers1 <- function(x, dg)
  {
    pw <- matrix(x, nrow=length(x))
    prod <- x
    for(i in 2:dg) {
      prod <- prod * x
      pw <- cbind(pw, prod)
    }
    return(pw)
  }

powers2 <- function(x, dg)
  {
    pw <- matrix(nrow=length(x), ncol=dg)
    prod <- x
    pw[,1] <- prod
    for(i in 2:dg) {
      prod <- prod * x
      pw[,i] <- prod
    }
  }

Now consider the following summary of execution times.

c(p1=system.time(p1 <- powers1(x, 16))[3],
    p2=system.time(p2 <- powers2(x, 16))[3],
    p3=system.time(p3 <- powers3(x, 16))[3],
    p4=system.time(p4 <- powers4(x, 16))[3])
## p1.elapsed p2.elapsed p3.elapsed p4.elapsed 
##      2.892      1.125      8.412     30.899
  1. Profile the code to explain why the two new versions disappoint relative to the original two. Cite particular subroutines which cause the slowdowns with reference to the profile summaries. Are these subroutines they creating memory or computational bottlenecks, or both?

First consider version 3, with memory profiling turned on.

Rprof(memory.profiling=TRUE)
p3 <- powers3(x, 16)
Rprof(NULL)
Rp3 <- summaryRprof(memory="both")
Rp3
## $by.self
##         self.time self.pct total.time total.pct mem.total
## "outer"       8.4      100        8.4       100    3051.7
## 
## $by.total
##                       total.time total.pct mem.total self.time self.pct
## "outer"                      8.4       100    3051.7       8.4      100
## "block_exec"                 8.4       100    3051.7       0.0        0
## "call_block"                 8.4       100    3051.7       0.0        0
## "eval"                       8.4       100    3051.7       0.0        0
## "evaluate"                   8.4       100    3051.7       0.0        0
## "evaluate_call"              8.4       100    3051.7       0.0        0
## "handle"                     8.4       100    3051.7       0.0        0
## "in_dir"                     8.4       100    3051.7       0.0        0
## "knitr::knit"                8.4       100    3051.7       0.0        0
## "powers3"                    8.4       100    3051.7       0.0        0
## "process_file"               8.4       100    3051.7       0.0        0
## "process_group"              8.4       100    3051.7       0.0        0
## "process_group.block"        8.4       100    3051.7       0.0        0
## "rmarkdown::render"          8.4       100    3051.7       0.0        0
## "timing_fn"                  8.4       100    3051.7       0.0        0
## "withCallingHandlers"        8.4       100    3051.7       0.0        0
## "withVisible"                8.4       100    3051.7       0.0        0
## 
## $sample.interval
## [1] 0.02
## 
## $sampling.time
## [1] 8.4

This is a particular boring case. All of the work is happening inside the outer command, and can’t be attributed to any of its subroutines. Notice that there are some extra calls to knitr and rmarkdown (and its subroutines) which wouldn’t be present in a typical R session. We can conclude that our instinct above, which is that work from lower power calculations is not being re-used when calculating earlier powers, is causing powers3 to be sluggish.

Now consider version 4.

Rprof(memory.profiling=TRUE)
p4 <- powers4(x, 16)
Rprof(NULL)
Rp4 <- summaryRprof(memory="both")
Rp4
## $by.self
##                 self.time self.pct total.time total.pct mem.total
## "apply"             21.96    75.26      26.18     89.72   12024.8
## "t.default"          2.48     8.50       2.48      8.50    1220.7
## "FUN"                2.06     7.06       2.06      7.06    1230.7
## "unlist"             1.12     3.84       1.12      3.84    1220.7
## "aperm.default"      0.58     1.99       0.58      1.99    1220.7
## "matrix"             0.52     1.78       0.52      1.78    1220.7
## "array"              0.24     0.82       0.24      0.82    1220.7
## "lengths"            0.22     0.75       0.22      0.75      43.8
## 
## $by.total
##                       total.time total.pct mem.total self.time self.pct
## "block_exec"               29.18    100.00   14466.2      0.00     0.00
## "call_block"               29.18    100.00   14466.2      0.00     0.00
## "eval"                     29.18    100.00   14466.2      0.00     0.00
## "evaluate"                 29.18    100.00   14466.2      0.00     0.00
## "evaluate_call"            29.18    100.00   14466.2      0.00     0.00
## "handle"                   29.18    100.00   14466.2      0.00     0.00
## "in_dir"                   29.18    100.00   14466.2      0.00     0.00
## "knitr::knit"              29.18    100.00   14466.2      0.00     0.00
## "powers4"                  29.18    100.00   14466.2      0.00     0.00
## "process_file"             29.18    100.00   14466.2      0.00     0.00
## "process_group"            29.18    100.00   14466.2      0.00     0.00
## "process_group.block"      29.18    100.00   14466.2      0.00     0.00
## "rmarkdown::render"        29.18    100.00   14466.2      0.00     0.00
## "timing_fn"                29.18    100.00   14466.2      0.00     0.00
## "withCallingHandlers"      29.18    100.00   14466.2      0.00     0.00
## "withVisible"              29.18    100.00   14466.2      0.00     0.00
## "t"                        28.66     98.22   13245.5      0.00     0.00
## "apply"                    26.18     89.72   12024.8     21.96    75.26
## "t.default"                 2.48      8.50    1220.7      2.48     8.50
## "FUN"                       2.06      7.06    1230.7      2.06     7.06
## "unlist"                    1.12      3.84    1220.7      1.12     3.84
## "aperm.default"             0.58      1.99    1220.7      0.58     1.99
## "aperm"                     0.58      1.99    1220.7      0.00     0.00
## "matrix"                    0.52      1.78    1220.7      0.52     1.78
## "array"                     0.24      0.82    1220.7      0.24     0.82
## "lengths"                   0.22      0.75      43.8      0.22     0.75
## 
## $sample.interval
## [1] 0.02
## 
## $sampling.time
## [1] 29.18

This is rather more interesting. Ignoring the rmarkdown and other commands, clearly apply is creating a bottleneck. It is high on both $by.self and $by.total lists. It is also surprising how much time is spent transposing: more than two seconds. That is probably not just to do with the final t() command in the code above. There must be some transposes being called under the hood as well.

The take-home message is that apply makes for clean code, but that could come at the expense of big slowdowns.

Problem 2: Annie & Sam (10 pts)

How would adjust the code for the Annie & Sam example(s) to accommodate other distributions? E.g., if \(S \sim \mathcal{N}(10.5, 1)\) and \(A\sim\mathcal{N}(11, 1.5)\)?

The code below summarizes what we had in lecture, based on uniform distributions.

n <- 1000
S <- runif(n, 10, 11.5)
A <- runif(n, 10.5, 12)

Then, using the normal approximation to the binomial, we may obtain the following estimate and interval for the probability that Annie arrives before Sam.

prob <- mean(A < S)
se <- sqrt(prob*(1-prob)/n)
c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se)
##     lower       est     upper 
## 0.1876234 0.2130000 0.2383766

Similarly for the difference in arrival times …

diff <- A - S
d <- c(mean(diff), mean(abs(diff)))
se <- c(sd(diff), sd(abs(diff)))/sqrt(n)
c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1])
##     lower       est     upper 
## 0.4807015 0.5187167 0.5567319
c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2])
##     lower       est     upper 
## 0.6315966 0.6599678 0.6883390

The only modifications required in order to accommodate the Gaussian distributions, compared to runif, involve the following.

S <- rnorm(n, 10.5, 1)
A <- rnorm(n, 11, 1.5)

Then, cut-and-pasting from above gives the following estimates and intervals.

prob <- mean(A < S)
se <- sqrt(prob*(1-prob)/n)
results <- c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se)
diff <- A - S
d <- c(mean(diff), mean(abs(diff)))
se <- c(sd(diff), sd(abs(diff)))/sqrt(n)
rbind(prob=results, md=c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1]),
    mad=c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2]))
##          lower       est     upper
## prob 0.3430260 0.3730000 0.4029740
## md   0.4256972 0.5342263 0.6427553
## mad  1.3912087 1.4596419 1.5280751

In fact, we could probably write a simple function which allows any distribution for arrival times for Annie and Sam.

anniesam <- function(n, rA, rS)
 {
    ## simulate
    S <- rA(n)
    A <- rS(n)

    ## estimate
    prob <- mean(A < S)
    se <- sqrt(prob*(1-prob)/n)
    results <- c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se)
    diff <- A - S
    d <- c(mean(diff), mean(abs(diff)))
    se <- c(sd(diff), sd(abs(diff)))/sqrt(n)
    results <- rbind(prob=results, 
        md=c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1]),
        mad=c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2]))

    ## report results
    return(results)
 }

Now lets try it with our two distributions. First uniform:

rS <- rA <- runif
formals(rA)$min <- 10; formals(rA)$max <- 11.5
formals(rS)$min <- 10.5; formals(rS)$max <- 12
anniesam(n, rA, rS)
##          lower       est     upper
## prob 0.1914514 0.2170000 0.2425486
## md   0.4686009 0.5054547 0.5423084
## mad  0.6151313 0.6425700 0.6700086

Then Gaussian:

rS <- rA <- rnorm
formals(rA)$mean <- 10.5; formals(rA)$sd <- 1
formals(rS)$mean <- 11; formals(rS)$sd <- 1.5
anniesam(n, rA, rS)
##          lower       est     upper
## prob 0.3449938 0.3750000 0.4050062
## md   0.4869160 0.5997114 0.7125068
## mad  1.4492789 1.5214223 1.5935658

Problem 3: Bootstrap with boot (15 pts)

Re-write the least-squares regression bootstrap from lecture using the boot library.

First we need some data. The code below duplicates the data-generating mechanism from lecture.

n <- 200
X <- 1/cbind(rt(n, df=1), rt(n, df=1), rt(n,df=1))
beta <- c(1,2,3,0)
Y <- beta[1] + X %*% beta[-1] + rnorm(100, sd=3)
fit <- lm(Y~X)

Now, the boot function needs (at the very least) a data argument and a statistic argument. For the data argument, lets put together a matrix where the last column is Y and the first columns are X.

data <- cbind(X,Y)

Now, for the statistic, we simply need to “wrap” lm in a function that takes a (data) argument and an index argument.

stat <- function(data, index)
 {
    Y <- data[index,ncol(data)]
    X <- data[index,1:(ncol(data)-1)]
    coef(lm(Y~X))
 }

Ok, lets try it.

library(boot)
beta.hat.boot <- boot(data, stat, R=1000)$t

The visulation code below is identical to what we used in lecture.

par(mfrow=c(2,2))
for(i in 1:4) {
  m <- coef(fit)[i]
  s <- sqrt(cov(beta.hat.boot)[i,i])
  x <- m + seq(-4*s, 4*s, length=1000)
  xlim <- range(c(x, beta.hat.boot[,i]))
  hist(beta.hat.boot[,i], freq=FALSE, xlim=xlim,
       xlab=paste("beta_", i, sep=""), main="posterior", breaks=20)
  lines(x, dnorm(x, m, s), col=2, lwd=2)
}

The code effort (using boot as compared to our own “by hand” version) is very similar in my opinion. The extra scaffolding that building the stat function implies is comparable to the burden of writing your own for loop. However, the boot library does offer some nice auxiliary features which would be hard to duplicate “by hand”, such as the potential for parallelization.

Problem 4: Bootstrapped splines (15 pts)

Design a bootstrap to assess the predictive uncertainty in our fits from slides 49–51 from stats.pdf. Rather than specifying df = 11, use the CV option to fit the degrees of freedom. You may code the routine by hand, or within the boot library. Provide a visualization of the bootstrapped average predictive mean and central 90% quantiles.

First lets duplicate the data generation code from class to set up the problem.

X <- seq(0,10,length=50)
Ytrue <- (sin(pi*X/5) + 0.2*cos(4*pi*X/5))
Y <- Ytrue + rnorm(length(Ytrue), sd=0.1)

The code below visualizes the fit from class (using df=11), overlaying the truth.

plot(X, Y, main="Smooth Spline Fit and Truth")
XX <- seq(0,10,length=199)
YYtrue <- (sin(pi*XX/5) + 0.2*cos(4*pi*XX/5))
lines(XX, YYtrue, col=1, type="l")
fit1 <- smooth.spline(X, Y, df=11)
YY1 <- as.matrix(predict(fit1, data.frame(X=XX))$y)
lines(XX, YY1, col=2, lty=2, lwd=2)
legend("topright", c("truth", "df=11"), col=c(1,2), lty=c(1,2))

Now consider the following boostrapped version, using CV to infer the degrees of freedom.

B <- 1000
YYpred.boot <- matrix(NA, B, length(XX))
df <- rep(NA, B)
for(b in 1:B) {
  indices <- sample(1:length(X), length(X), replace=TRUE)
  suppressWarnings(fit <- smooth.spline(X[indices], Y[indices], cv=TRUE))
  df[b] <- fit$df
  YYpred.boot[b,] <- predict(fit, XX)$y
}

The plot below shows the bootstraped predictve paths as gray lines, whose distribution is summarized by a solid black (mean) line and dashed-red 90% quantiles.

matplot(XX, t(YYpred.boot), type="l", col="gray",
        main="Bootstrapped Smoothing Splines", xlab="y", ylab="y")
points(X, Y)
m <- apply(YYpred.boot, 2, mean)
q1 <- apply(YYpred.boot, 2, quantile, p=0.05)
q2 <- apply(YYpred.boot, 2, quantile, p=0.95)
lines(XX, m, lwd=2)
lines(XX, q1, col=2, lty=2)
lines(XX, q2, col=2, lty=2)