--- title: "Homework 4 Solutions" subtitle: "Advanced Statistical Computing (STAT 6984)" author: "Robert B. Gramacy ( : )
Department of Statistics, Virginia Tech" output: html_document --- ## Instructions This homework is **due on Tuesday, October 31st at 12:30pm** (the start of class). Please turn in all your work. The primary purpose of this homework is to hone debugging and profiling skills and explore parallelization with Monte Carlo experiments. This description may change at any time, however notices about substantial changes (requiring more/less work) will be additionally noted on the class web page. Note that there are two prongs to submission, via Canvas and Bitbucket (in asc-repo/hwk/hw4). You don't need to use Rmarkdown but your work should be just as pretty if you want full marks. ## Problem 1: Profiling (15 pts) *Below are two, very compact, versions of functions for calculating powers of a matrix.* {r} powers3 <- function(x, dg) outer(x, 1:dg, "^") powers4 <- function(x, dg) { repx <- matrix(rep(x, dg), nrow=length(x)) return(t(apply(repx, 1, cumprod))) }  a. *First, briefly explain the thought process behind these two new methods.* The powers3 function calculates the outer "product" of x with the Y-argument taking on the desired powers. The word "product" is put in quotes, because the in this application the FUN argument is not the default *, rather it is the power argument ^. One reason we can expect this method to be slow is that it doesn't re-use work done with lower powers when calculating the higher ones. The powers4 function is a more compact version of our powers2 version from earlier, which is quoted below. Rather than using a for loop, it uses apply (which we know is slower). However, it does utilize a vectorized subroutine, cumprob, so perhaps that will help "make up" some of the time lost to apply. b. *Then provide a summary of the computation time for all four versions (two from lecture and the two above). Use the same x from lecture.* {r} x <- runif(10000000)  First, lets paste the two earlier versions in so we can compare those as well. {r} powers1 <- function(x, dg) { pw <- matrix(x, nrow=length(x)) prod <- x for(i in 2:dg) { prod <- prod * x pw <- cbind(pw, prod) } return(pw) } powers2 <- function(x, dg) { pw <- matrix(nrow=length(x), ncol=dg) prod <- x pw[,1] <- prod for(i in 2:dg) { prod <- prod * x pw[,i] <- prod } }  Now consider the following summary of execution times. {r cache=TRUE} c(p1=system.time(p1 <- powers1(x, 16))[3], p2=system.time(p2 <- powers2(x, 16))[3], p3=system.time(p3 <- powers3(x, 16))[3], p4=system.time(p4 <- powers4(x, 16))[3])  - Woah, the fourth one is particularly bad! I guess having a bit of vectorization doesn't help in the face of a slow apply. But lets profile to get more information. c. *Profile the code to explain why the two new versions disappoint relative to the original two. Cite particular subroutines which cause the slowdowns with reference to the profile summaries. Are these subroutines they creating memory or computational bottlenecks, or both?* First consider version 3, with memory profiling turned on. {r cache=TRUE} Rprof(memory.profiling=TRUE) p3 <- powers3(x, 16) Rprof(NULL) Rp3 <- summaryRprof(memory="both") Rp3  This is a particular boring case. All of the work is happening inside the outer command, and can't be attributed to any of its subroutines. Notice that there are some extra calls to knitr and rmarkdown (and its subroutines) which wouldn't be present in a typical R session. We can conclude that our instinct above, which is that work from lower power calculations is not being re-used when calculating earlier powers, is causing powers3 to be sluggish. - I.e., it is much faster to perform one more multiplication to get $3^3$ from $3^2$ than it is to re-caluclate it from scratch. - Moreover, a single ^ call is much slower than the a single (or even a small number of multiple) * call(s), as products are implemented natively by fast machine instructions. Now consider version 4. {r cache=TRUE} Rprof(memory.profiling=TRUE) p4 <- powers4(x, 16) Rprof(NULL) Rp4 <- summaryRprof(memory="both") Rp4  This is rather more interesting. Ignoring the rmarkdown and other commands, clearly apply is creating a bottleneck. It is high on both $by.self and $by.total lists. It is also surprising how much time is spent transposing: more than two seconds. That is probably not just to do with the final t() command in the code above. There must be some transposes being called under the hood as well. - However the real story has to do with memory. This one uses an order of magnitude more memory than powers3, for example. Working with memory is much slower than computing. - Notice that cumprod doesn't even show up! Actually calculating the products takes not time at all. - All of the time is spent (inefficiently) navigating the loop and memory management within the apply statement. The take-home message is that apply makes for clean code, but that could come at the expense of big slowdowns. ## Problem 2: Annie & Sam (10 pts) *How would adjust the code for the Annie \& Sam example(s) to accommodate other distributions? E.g., if $S \sim \mathcal{N}(10.5, 1)$ and $A\sim\mathcal{N}(11, 1.5)$?* The code below summarizes what we had in lecture, based on uniform distributions. {r} n <- 1000 S <- runif(n, 10, 11.5) A <- runif(n, 10.5, 12)  Then, using the normal approximation to the binomial, we may obtain the following estimate and interval for the probability that Annie arrives before Sam. {r} prob <- mean(A < S) se <- sqrt(prob*(1-prob)/n) c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se)  Similarly for the difference in arrival times ... {r} diff <- A - S d <- c(mean(diff), mean(abs(diff))) se <- c(sd(diff), sd(abs(diff)))/sqrt(n)  - In terms of mean differences: {r} c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1])  - and in terms of mean absolute differences: {r} c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2])  The only modifications required in order to accommodate the Gaussian distributions, compared to runif, involve the following. {r} S <- rnorm(n, 10.5, 1) A <- rnorm(n, 11, 1.5)  Then, cut-and-pasting from above gives the following estimates and intervals. {r} prob <- mean(A < S) se <- sqrt(prob*(1-prob)/n) results <- c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se) diff <- A - S d <- c(mean(diff), mean(abs(diff))) se <- c(sd(diff), sd(abs(diff)))/sqrt(n) rbind(prob=results, md=c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1]), mad=c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2]))  In fact, we could probably write a simple function which allows any distribution for arrival times for Annie and Sam. {r} anniesam <- function(n, rA, rS) { ## simulate S <- rA(n) A <- rS(n) ## estimate prob <- mean(A < S) se <- sqrt(prob*(1-prob)/n) results <- c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se) diff <- A - S d <- c(mean(diff), mean(abs(diff))) se <- c(sd(diff), sd(abs(diff)))/sqrt(n) results <- rbind(prob=results, md=c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1]), mad=c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2])) ## report results return(results) }  Now lets try it with our two distributions. First uniform: {r} rS <- rA <- runif formals(rA)$min <- 10; formals(rA)$max <- 11.5 formals(rS)$min <- 10.5; formals(rS)$max <- 12 anniesam(n, rA, rS)  Then Gaussian: {r} rS <- rA <- rnorm formals(rA)$mean <- 10.5; formals(rA)$sd <- 1 formals(rS)$mean <- 11; formals(rS)$sd <- 1.5 anniesam(n, rA, rS)  - Bingo. ## Problem 3: Bootstrap with boot (15 pts) *Re-write the least-squares regression bootstrap from lecture using the boot library.* - *Briefly compare and contrast to the results we obtained in lecture.* First we need some data. The code below duplicates the data-generating mechanism from lecture. {r} n <- 200 X <- 1/cbind(rt(n, df=1), rt(n, df=1), rt(n,df=1)) beta <- c(1,2,3,0) Y <- beta[1] + X %*% beta[-1] + rnorm(100, sd=3) fit <- lm(Y~X)  Now, the boot function needs (at the very least) a data argument and a statistic argument. For the data argument, lets put together a matrix where the last column is Y and the first columns are X. {r} data <- cbind(X,Y)  Now, for the statistic, we simply need to "wrap" lm in a function that takes a (data) argument and an index argument. {r} stat <- function(data, index) { Y <- data[index,ncol(data)] X <- data[index,1:(ncol(data)-1)] coef(lm(Y~X)) }  Ok, lets try it. {r, cache=TRUE} library(boot) beta.hat.boot <- boot(data, stat, R=1000)t  The visulation code below is identical to what we used in lecture. {r, dev.args = list(bg = 'transparent'), fig.width=8, fig.height=7, fig.align="center"} par(mfrow=c(2,2)) for(i in 1:4) { m <- coef(fit)[i] s <- sqrt(cov(beta.hat.boot)[i,i]) x <- m + seq(-4*s, 4*s, length=1000) xlim <- range(c(x, beta.hat.boot[,i])) hist(beta.hat.boot[,i], freq=FALSE, xlim=xlim, xlab=paste("beta_", i, sep=""), main="posterior", breaks=20) lines(x, dnorm(x, m, s), col=2, lwd=2) }  - These results are very similar to the plots we obtained from that code in lecture, at least up to Monte Carlo error. The code effort (using boot as compared to our own "by hand" version) is very similar in my opinion. The extra scaffolding that building the stat function implies is comparable to the burden of writing your own for loop. However, the boot library does offer some nice auxiliary features which would be hard to duplicate "by hand", such as the potential for parallelization. ## Problem 4: Bootstrapped splines (15 pts) *Design a bootstrap to assess the predictive uncertainty in our \R{smooth.spline()} fits from slides 49--51 from [stats.pdf](stats.pdf). Rather than specifying df = 11, use the CV option to fit the degrees of freedom. You may code the routine by hand, or within the boot library. Provide a visualization of the bootstrapped average predictive mean and central 90% quantiles.* First lets duplicate the data generation code from class to set up the problem. {r} X <- seq(0,10,length=50) Ytrue <- (sin(pi*X/5) + 0.2*cos(4*pi*X/5)) Y <- Ytrue + rnorm(length(Ytrue), sd=0.1)  The code below visualizes the fit from class (using df=11), overlaying the truth. {r, dev.args = list(bg = 'transparent'), fig.width=6, fig.height=5, fig.align="center"} plot(X, Y, main="Smooth Spline Fit and Truth") XX <- seq(0,10,length=199) YYtrue <- (sin(pi*XX/5) + 0.2*cos(4*pi*XX/5)) lines(XX, YYtrue, col=1, type="l") fit1 <- smooth.spline(X, Y, df=11) YY1 <- as.matrix(predict(fit1, data.frame(X=XX))y) lines(XX, YY1, col=2, lty=2, lwd=2) legend("topright", c("truth", "df=11"), col=c(1,2), lty=c(1,2))  *Now consider the following boostrapped version, using CV to infer the degrees of freedom.* - *Using cv=TRUE, which results in leave-one-out cross validation, can be problematic when there are duplicated rows in the data. The package reports a warning message, but does a sensible thing (leaving out the whole group of duplicates). So we can trust the answers it provides. To avoid seeing all the warning messages, the implementation below uses suppressWarnings.* - *The code also saves the df from each bootstrap iteration.* {r, cache=TRUE, message=FALSE} B <- 1000 YYpred.boot <- matrix(NA, B, length(XX)) df <- rep(NA, B) for(b in 1:B) { indices <- sample(1:length(X), length(X), replace=TRUE) suppressWarnings(fit <- smooth.spline(X[indices], Y[indices], cv=TRUE)) df[b] <- fit$df YYpred.boot[b,] <- predict(fit, XX)$y }  The plot below shows the bootstraped predictve paths as gray lines, whose distribution is summarized by a solid black (mean) line and dashed-red 90% quantiles. {r, dev.args = list(bg = 'transparent'), fig.width=6, fig.height=5, fig.align="center"} matplot(XX, t(YYpred.boot), type="l", col="gray", main="Bootstrapped Smoothing Splines", xlab="y", ylab="y") points(X, Y) m <- apply(YYpred.boot, 2, mean) q1 <- apply(YYpred.boot, 2, quantile, p=0.05) q2 <- apply(YYpred.boot, 2, quantile, p=0.95) lines(XX, m, lwd=2) lines(XX, q1, col=2, lty=2) lines(XX, q2, col=2, lty=2)  - Notice how the dashed red lines capture most of the points, - and how the predictive uncertainty is greatest at the edges of the input space. The plot below shows the bootstrap distribution of the CV-estimated degrees of freedom, df. {r, dev.args = list(bg = 'transparent'), fig.width=6, fig.height=5, fig.align="center"} hist(df, main="bootstrapped degrees of freedom")  - Notice how df = 11 has high support under this distribuition, but it is not the modal value, which is closer to df = 12 or 13. - Low values, df < 8 or so, do not have high support, but high ones df > 30 do, favoring smoothness. ## Problem 5: Spam MC shell script (15 pts) *Design a shell script spam_mc.sh which automates a crop of batch Monte Carlo instances for our spam "bakeoff", which are to be run in parallel. It must take an integer augment specifying how many parallel instances to create. For example, make it so one can execute*  ./spam_mc.sh 16  *to get 16 batches in parallel.* You may find the script [spam_mc.sh](spam_mc.sh) linked from the course web page, and pasted below for convenience. See-inline comments for an explanation of each chunk. {r} lines <- readLines("spam_mc.sh") cat(paste(lines, collapse="\n"))  - Notice that this version takes two arguments, the second being an optional reps setting, with the default being five. This will be useful later when we transport this to ARC. ## Problem 6: Spam MC "bakeoff" in parallel (30 pts) *Re-write the spam MC "bakeoff" from lecture with sockets (i.e., via the parallel package) rather than via "batch mode".* - *One thing this means is that you need to collate the list of outputs clusterApply provides; i.e., not by reading *.Rdata files.* Please see my [spam_snow.R](spam_snow.R), linked from the class web page. The highlights are provided below. First, the function spamfold below encapsulates the (previously coded) "bakeoff", with the intention that it will be called (separatley) for each fold. This might be a little different than what you did, possibly parallelizing only over repetitions. My version, as will be described in more detail below, combines repetitions and folds. {r} spamfold <- function(fold, spam) { library(textir) library(rpart) library(mda) library(MASS) library(randomForest) ## training and testing set test <- spam[fold,] train <- spam[-fold,] ## get rid of colums that are all the same one <- rep(1, nrow(train)) zo <- which(sqrt(drop(one %*% (train^2))) == 0) if(length(zo) > 0) { train <- train[,-zo]; test <- test[,-zo] } nt <- nrow(train) ## defining the universe of variables null <- glm(spam ~ 1, data=train, family="binomial") pnull <- round(predict(null, newdata=test, type="response")) hnull <- pnull == test$spam ## the full glm full <- suppressWarnings(glm(spam ~ ., data=train, family="binomial")) ## will warn pfull <- round(predict(full, newdata=test, type="response")) hfull <- pfull == test$spam ## go forward to get the best model greedily fwd <- suppressWarnings(step(null, scope=formula(full), k=log(nt), trace=0)) pfwd <- round(predict(fwd, newdata=test, type="response")) hfwd <- pfwd == test$spam ## now try adding in interactions fwdi <- suppressWarnings(step(fwd, scope=~.+.^2, k=log(nt), trace=0)) ## this takes a long time (about an hour on a fast machine) pfwdi <- round(predict(fwdi, newdata=test, type="response")) hfwdi <- pfwdi == test$spam ## LDA spam.lda <- lda(spam~., data=train) plda <- as.numeric(predict(spam.lda, newdata=test)$class)-1 hlda <- plda == test$spam ## QDA hqda <- rep(NA, nrow(test)) try({ ## necessary because quadratic expansion may not work spam.qda <- qda(spam~., data=train); pqda <- as.numeric(predict(spam.qda, newdata=test)$class)-1 hqda <- pqda == test$spam }, silent=TRUE) ## FDA spam.fda <- fda(spam~., data=train) pfda <- as.numeric(predict(spam.fda, newdata=test, type="class"))-1 hfda <- pfda == test$spam ## rpart ## otherwise rpart will do regression train <- transform(train, spam=as.factor(spam)) spam.rp <- rpart(spam~., data=train) prp <- as.numeric(predict(spam.rp, newdata=test, type="class"))-1 hrp <- prp == test$spam ## mnlm Xspam <- model.matrix( spam ~ . + .^2, data=train)[,-1] spampen <- mnlm(counts=train$spam, covars=Xspam, verb=1, penalty=c(10,1/2)) XXspam <- model.matrix( spam ~ . + .^2, data=test)[,-1] pmnlm <- round(predict(spampen, newdata=XXspam)) hmnlm <- pmnlm == test$spam ## random forests rf <- randomForest(spam~., data=train) prf <- as.numeric(predict(rf, newdata=test))-1 hrf <- prf == test$spam ## return predictions return(data.frame(null=hnull, full=hfull, fwd=hfwd, fwdi=hfwdi, lda=hlda, qda=hqda, fda=hfda, rp=hrp, mnlm=hmnlm, rf=hrf)) }  The next function, spam_cv.snow creates the folds for each repetition, calls clusterApply on each fold (from all repetitions), and collates the results into an appropriate data frame of "hitrates", and returns the result. {r} spamcv.snow <- function(cls, spam, folds=5, reps=2) { ## create folds for all reps all.folds <- list() for(r in 1:reps) all.folds <- c(all.folds, cv.folds(nrow(spam), folds)) ## distributed computation for each fold hits <- clusterApply(cls, all.folds, spamfold, spam) ## allocate space for hitrate calculation hitrate <- data.frame(matrix(NA, nrow=reps, ncol=ncol(hits[[1]]))) colnames(hitrate) <- colnames(hits[[1]]) ## loop over reps and folds i <- 1 for(r in 1:reps) { ## reps dfhit <- matrix(NA, nrow=nrow(spam), ncol=ncol(hits[[1]])) for(j in 1:folds) { ## folds dfhit[all.folds[[i]],] <- as.matrix(hits[[i]]) i <- i + 1 } ## aggretate hitrate[r,] <- apply(dfhit, 2, mean) } ## return the collected hitrates return(hitrate) }  - The advantage to divvying up over folds *and* repetitions is that there are more tasks to distribute over the parallel instances, which means that fewer cores will be idle at the end (as inevitably, some will take more time than others). The final step is to read in the data, define cv.folds, establish the cluster and "go". Below is my code for doing that over the eight hyperthreaded cores (i.e., effectively 16). However, the code is not executed in this Rmarkdown document. Rather it is called offline using the shell script described below. {r, eval=FALSE} spam <- read.csv("../data/spam.csv") cv.folds <- function (n, folds = 10) split(sample(1:n), rep(1:folds, length = n)) library(parallel) cl <- makeCluster(type="PSOCK", rep("localhost", nth)) hitrate <- spamcv.snow(cl, spam, reps) stopCluster(cl) save.image("spam_snow.RData")  - Notice that neither the reps or nth (number of threads) variables is defined. The final ingredient is a shell script [spam_snow.sh](spam_snow.sh) that takes an integer command-line argument specifying the number of parallel instances (sockets) to create, the number of CV folds, and the number of CV repetitions. Although you may follow the link above, a readout is provided below. {r} lines <- readLines("spam_snow.sh") cat(paste(lines, collapse="\n"))  ## Problem$\star$: Spam summary *This isn't a real problem. It is just here as a place-holder to say that for Problems 5/6 you must demonstrate, in your PDF on Canvas, that you have been able to collect a substantial number of MC repetitions in order to re-visualize the results from lecture (which are only based on 5 repetitions). I expect at least thirty repetitions, which many would regard as a minimum number in order to "trust" the resulting accuracy distributions.* I used the [spam_mc_collect.R](spam_mc_collect.R) script linked from the class web page to collate the results from my poor-man's batch parallel execution. I made fifteen parallel instances with two repetitions. Rather than simply source the file, I have copied the code chunks with some explanation below. First read in the true data and the predictions. {r} spam <- read.csv("../data/spam.csv") files <- list.files("./", pattern="spam_[0-9]*.RData") length(files)  Then, initialize empty matrices for collecting predicted probabilities, and loop over the files to extract those from each of the comparators. {r} pfulls <- pfwds <- pfwdis <- pldas <- pqdas <- NULL pnulls <- pfdas <- prps <- pmnlms <- prfs <- pfulls for (i in 1:length(files)) { f <- files[i] load(f) cat(" ", f, sep="") pnulls <- rbind(pnulls, pnull) pfulls <- rbind(pfulls, pfull) pfwds <- rbind(pfwds, pfwd) pfwdis <- rbind(pfwdis, pfwdi) pldas <- rbind(pldas, plda) pqdas <- rbind(pqdas, pqda) pfdas <- rbind(pfdas, pfda) prps <- rbind(prps, prp) prfs <- rbind(prfs, prf) }  Then, calculate hit rates comparing each of those probabilities to the true spam label. {r} hit <- function(x, y=spam$spam) mean(x == y, na.rm=TRUE) df <- data.frame(null=apply(pnulls, 1, hit), full=apply(pfulls, 1, hit), fwd=apply(pfwds, 1, hit), fwdi=apply(pfwdis, 1, hit), lda=apply(pldas, 1, hit), qda=apply(pqdas, 1, hit), fda=apply(pfdas, 1, hit), rp=apply(prps, 1, hit), rf=apply(prfs, 1, hit))  Finally, we can summarize the results. The boxplot below summarizes the distribution of 30 hit rates obtained for each comparator. {r, dev.args = list(bg = 'transparent'), fig.width=6, fig.height=5.5, fig.align="center"} boxplot(df[,-1], main="spam bakeoff", xlab="comparators", ylab="hit rate")  - Looks like randomForest is still the best; qda is the worst. Comparators lda and fda look very similar. Pairwise $t$-tests can help sort out differences between methods with adjacent rankings. The code below sorts the data by median hit rate, and reports $p$-values from such pairwise $t$-tests. {r} dfmean <- apply(df, 2, mean) o <- order(dfmean, decreasing=TRUE) tt <- rep(NA, length(dfmean)) for(i in 1:(length(o)-1)) { tto <- t.test(df[,o[i]], df[,o[i+1]], alternative="greater", paired=TRUE) tt[o[i]] <- tto\$p.value } cbind(data.frame(dfmean), data.frame(tt))[o,]  - All differences in hit rate are statistically significant except for lda and fda, which we can conclude are equally good (or bad).