Department of Statistics, Virginia Tech" output: html_document --- ## Instructions This homework is **due on Tuesday, October 31st at 12:30pm** (the start of class). Please turn in all your work. The primary purpose of this homework is to hone debugging and profiling skills and explore parallelization with Monte Carlo experiments. This description may change at any time, however notices about substantial changes (requiring more/less work) will be additionally noted on the class web page. Note that there are two prongs to submission, via Canvas and Bitbucket (in `asc-repo/hwk/hw4`). You don't need to use `Rmarkdown` but your work should be just as pretty if you want full marks. ## Problem 1: Profiling (15 pts) *Below are two, very compact, versions of functions for calculating powers of a matrix.* ```{r} powers3 <- function(x, dg) outer(x, 1:dg, "^") powers4 <- function(x, dg) { repx <- matrix(rep(x, dg), nrow=length(x)) return(t(apply(repx, 1, cumprod))) } ``` a. *First, briefly explain the thought process behind these two new methods.* The `powers3` function calculates the outer "product" of `x` with the `Y`-argument taking on the desired powers. The word "product" is put in quotes, because the in this application the `FUN` argument is not the default `*`, rather it is the power argument `^`. One reason we can expect this method to be slow is that it doesn't re-use work done with lower powers when calculating the higher ones. The `powers4` function is a more compact version of our `powers2` version from earlier, which is quoted below. Rather than using a `for` loop, it uses `apply` (which we know is slower). However, it does utilize a vectorized subroutine, `cumprob`, so perhaps that will help "make up" some of the time lost to `apply`. b. *Then provide a summary of the computation time for all four versions (two from lecture and the two above). Use the same `x` from lecture.* ```{r} x <- runif(10000000) ``` First, lets paste the two earlier versions in so we can compare those as well. ```{r} powers1 <- function(x, dg) { pw <- matrix(x, nrow=length(x)) prod <- x for(i in 2:dg) { prod <- prod * x pw <- cbind(pw, prod) } return(pw) } powers2 <- function(x, dg) { pw <- matrix(nrow=length(x), ncol=dg) prod <- x pw[,1] <- prod for(i in 2:dg) { prod <- prod * x pw[,i] <- prod } } ``` Now consider the following summary of execution times. ```{r cache=TRUE} c(p1=system.time(p1 <- powers1(x, 16))[3], p2=system.time(p2 <- powers2(x, 16))[3], p3=system.time(p3 <- powers3(x, 16))[3], p4=system.time(p4 <- powers4(x, 16))[3]) ``` - Woah, the fourth one is particularly bad! I guess having a bit of vectorization doesn't help in the face of a slow `apply`. But lets profile to get more information. c. *Profile the code to explain why the two new versions disappoint relative to the original two. Cite particular subroutines which cause the slowdowns with reference to the profile summaries. Are these subroutines they creating memory or computational bottlenecks, or both?* First consider version 3, with memory profiling turned on. ```{r cache=TRUE} Rprof(memory.profiling=TRUE) p3 <- powers3(x, 16) Rprof(NULL) Rp3 <- summaryRprof(memory="both") Rp3 ``` This is a particular boring case. All of the work is happening inside the `outer` command, and can't be attributed to any of its subroutines. Notice that there are some extra calls to `knitr` and `rmarkdown` (and its subroutines) which wouldn't be present in a typical R session. We can conclude that our instinct above, which is that work from lower power calculations is not being re-used when calculating earlier powers, is causing `powers3` to be sluggish. - I.e., it is much faster to perform one more multiplication to get $3^3$ from $3^2$ than it is to re-caluclate it from scratch. - Moreover, a single `^` call is much slower than the a single (or even a small number of multiple) `*` call(s), as products are implemented natively by fast machine instructions. Now consider version 4. ```{r cache=TRUE} Rprof(memory.profiling=TRUE) p4 <- powers4(x, 16) Rprof(NULL) Rp4 <- summaryRprof(memory="both") Rp4 ``` This is rather more interesting. Ignoring the `rmarkdown` and other commands, clearly `apply` is creating a bottleneck. It is high on both `$by.self` and `$by.total` lists. It is also surprising how much time is spent transposing: more than two seconds. That is probably not just to do with the final `t()` command in the code above. There must be some transposes being called under the hood as well. - However the real story has to do with memory. This one uses an order of magnitude more memory than `powers3`, for example. Working with memory is much slower than computing. - Notice that `cumprod` doesn't even show up! Actually calculating the products takes not time at all. - All of the time is spent (inefficiently) navigating the loop and memory management within the `apply` statement. The take-home message is that `apply` makes for clean code, but that could come at the expense of big slowdowns. ## Problem 2: Annie & Sam (10 pts) *How would adjust the code for the Annie \& Sam example(s) to accommodate other distributions? E.g., if $S \sim \mathcal{N}(10.5, 1)$ and $A\sim\mathcal{N}(11, 1.5)$?* The code below summarizes what we had in lecture, based on uniform distributions. ```{r} n <- 1000 S <- runif(n, 10, 11.5) A <- runif(n, 10.5, 12) ``` Then, using the normal approximation to the binomial, we may obtain the following estimate and interval for the probability that Annie arrives before Sam. ```{r} prob <- mean(A < S) se <- sqrt(prob*(1-prob)/n) c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se) ``` Similarly for the difference in arrival times ... ```{r} diff <- A - S d <- c(mean(diff), mean(abs(diff))) se <- c(sd(diff), sd(abs(diff)))/sqrt(n) ``` - In terms of mean differences: ```{r} c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1]) ``` - and in terms of mean absolute differences: ```{r} c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2]) ``` The only modifications required in order to accommodate the Gaussian distributions, compared to `runif`, involve the following. ```{r} S <- rnorm(n, 10.5, 1) A <- rnorm(n, 11, 1.5) ``` Then, cut-and-pasting from above gives the following estimates and intervals. ```{r} prob <- mean(A < S) se <- sqrt(prob*(1-prob)/n) results <- c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se) diff <- A - S d <- c(mean(diff), mean(abs(diff))) se <- c(sd(diff), sd(abs(diff)))/sqrt(n) rbind(prob=results, md=c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1]), mad=c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2])) ``` In fact, we could probably write a simple function which allows any distribution for arrival times for Annie and Sam. ```{r} anniesam <- function(n, rA, rS) { ## simulate S <- rA(n) A <- rS(n) ## estimate prob <- mean(A < S) se <- sqrt(prob*(1-prob)/n) results <- c(lower=prob-1.96*se, est=prob, upper=prob+1.96*se) diff <- A - S d <- c(mean(diff), mean(abs(diff))) se <- c(sd(diff), sd(abs(diff)))/sqrt(n) results <- rbind(prob=results, md=c(lower=d[1] - 1.96*se[1], est=d[1], upper=d[1] + 1.96*se[1]), mad=c(lower=d[2] -1*1.96*se[2], est=d[2], upper=d[2] + 1.96*se[2])) ## report results return(results) } ``` Now lets try it with our two distributions. First uniform: ```{r} rS <- rA <- runif formals(rA)$min <- 10; formals(rA)$max <- 11.5 formals(rS)$min <- 10.5; formals(rS)$max <- 12 anniesam(n, rA, rS) ``` Then Gaussian: ```{r} rS <- rA <- rnorm formals(rA)$mean <- 10.5; formals(rA)$sd <- 1 formals(rS)$mean <- 11; formals(rS)$sd <- 1.5 anniesam(n, rA, rS) ``` - Bingo. ## Problem 3: Bootstrap with `boot` (15 pts) *Re-write the least-squares regression bootstrap from lecture using the `boot` library.* - *Briefly compare and contrast to the results we obtained in lecture.* First we need some data. The code below duplicates the data-generating mechanism from lecture. ```{r} n <- 200 X <- 1/cbind(rt(n, df=1), rt(n, df=1), rt(n,df=1)) beta <- c(1,2,3,0) Y <- beta[1] + X %*% beta[-1] + rnorm(100, sd=3) fit <- lm(Y~X) ``` Now, the `boot` function needs (at the very least) a `data` argument and a `statistic` argument. For the `data` argument, lets put together a `matrix` where the last column is `Y` and the first columns are `X`. ```{r} data <- cbind(X,Y) ``` Now, for the `statistic`, we simply need to "wrap" `lm` in a function that takes a (`data`) argument and an `index` argument. ```{r} stat <- function(data, index) { Y <- data[index,ncol(data)] X <- data[index,1:(ncol(data)-1)] coef(lm(Y~X)) } ``` Ok, lets try it. ```{r, cache=TRUE} library(boot) beta.hat.boot <- boot(data, stat, R=1000)$t ``` The visulation code below is identical to what we used in lecture. ```{r, dev.args = list(bg = 'transparent'), fig.width=8, fig.height=7, fig.align="center"} par(mfrow=c(2,2)) for(i in 1:4) { m <- coef(fit)[i] s <- sqrt(cov(beta.hat.boot)[i,i]) x <- m + seq(-4*s, 4*s, length=1000) xlim <- range(c(x, beta.hat.boot[,i])) hist(beta.hat.boot[,i], freq=FALSE, xlim=xlim, xlab=paste("beta_", i, sep=""), main="posterior", breaks=20) lines(x, dnorm(x, m, s), col=2, lwd=2) } ``` - These results are very similar to the plots we obtained from that code in lecture, at least up to Monte Carlo error. The code effort (using `boot` as compared to our own "by hand" version) is very similar in my opinion. The extra scaffolding that building the `stat` function implies is comparable to the burden of writing your own `for` loop. However, the `boot` library does offer some nice auxiliary features which would be hard to duplicate "by hand", such as the potential for parallelization. ## Problem 4: Bootstrapped splines (15 pts) *Design a bootstrap to assess the predictive uncertainty in our \R{smooth.spline()} fits from slides 49--51 from [stats.pdf](stats.pdf). Rather than specifying `df = 11`, use the CV option to fit the degrees of freedom. You may code the routine by hand, or within the `boot` library. Provide a visualization of the bootstrapped average predictive mean and central 90% quantiles.* First lets duplicate the data generation code from class to set up the problem. ```{r} X <- seq(0,10,length=50) Ytrue <- (sin(pi*X/5) + 0.2*cos(4*pi*X/5)) Y <- Ytrue + rnorm(length(Ytrue), sd=0.1) ``` The code below visualizes the fit from class (using `df=11`), overlaying the truth. ```{r, dev.args = list(bg = 'transparent'), fig.width=6, fig.height=5, fig.align="center"} plot(X, Y, main="Smooth Spline Fit and Truth") XX <- seq(0,10,length=199) YYtrue <- (sin(pi*XX/5) + 0.2*cos(4*pi*XX/5)) lines(XX, YYtrue, col=1, type="l") fit1 <- smooth.spline(X, Y, df=11) YY1 <- as.matrix(predict(fit1, data.frame(X=XX))$y) lines(XX, YY1, col=2, lty=2, lwd=2) legend("topright", c("truth", "df=11"), col=c(1,2), lty=c(1,2)) ``` *Now consider the following boostrapped version, using CV to infer the degrees of freedom.* - *Using `cv=TRUE`, which results in leave-one-out cross validation, can be problematic when there are duplicated rows in the data. The package reports a warning message, but does a sensible thing (leaving out the whole group of duplicates). So we can trust the answers it provides. To avoid seeing all the warning messages, the implementation below uses `suppressWarnings`.* - *The code also saves the `df` from each bootstrap iteration.* ```{r, cache=TRUE, message=FALSE} B <- 1000 YYpred.boot <- matrix(NA, B, length(XX)) df <- rep(NA, B) for(b in 1:B) { indices <- sample(1:length(X), length(X), replace=TRUE) suppressWarnings(fit <- smooth.spline(X[indices], Y[indices], cv=TRUE)) df[b] <- fit$df YYpred.boot[b,] <- predict(fit, XX)$y } ``` The plot below shows the bootstraped predictve paths as gray lines, whose distribution is summarized by a solid black (mean) line and dashed-red 90% quantiles. ```{r, dev.args = list(bg = 'transparent'), fig.width=6, fig.height=5, fig.align="center"} matplot(XX, t(YYpred.boot), type="l", col="gray", main="Bootstrapped Smoothing Splines", xlab="y", ylab="y") points(X, Y) m <- apply(YYpred.boot, 2, mean) q1 <- apply(YYpred.boot, 2, quantile, p=0.05) q2 <- apply(YYpred.boot, 2, quantile, p=0.95) lines(XX, m, lwd=2) lines(XX, q1, col=2, lty=2) lines(XX, q2, col=2, lty=2) ``` - Notice how the dashed red lines capture most of the points, - and how the predictive uncertainty is greatest at the edges of the input space. The plot below shows the bootstrap distribution of the CV-estimated degrees of freedom, `df`. ```{r, dev.args = list(bg = 'transparent'), fig.width=6, fig.height=5, fig.align="center"} hist(df, main="bootstrapped degrees of freedom") ``` - Notice how `df = 11` has high support under this distribuition, but it is not the modal value, which is closer to `df = 12` or `13`. - Low values, `df < 8` or so, do not have high support, but high ones `df > 30` do, favoring smoothness. ## Problem 5: Spam MC shell script (15 pts) *Design a shell script `spam_mc.sh` which automates a crop of batch Monte Carlo instances for our spam "bakeoff", which are to be run in parallel. It must take an integer augment specifying how many parallel instances to create. For example, make it so one can execute* ``` ./spam_mc.sh 16 ``` *to get 16 batches in parallel.* You may find the script [`spam_mc.sh`](spam_mc.sh) linked from the course web page, and pasted below for convenience. See-inline comments for an explanation of each chunk. ```{r} lines <- readLines("spam_mc.sh") cat(paste(lines, collapse="\n")) ``` - Notice that this version takes two arguments, the second being an optional `reps` setting, with the default being five. This will be useful later when we transport this to ARC. ## Problem 6: Spam MC "bakeoff" in `parallel` (30 pts) *Re-write the spam MC "bakeoff" from lecture with sockets (i.e., via the `parallel` package) rather than via "batch mode".* - *One thing this means is that you need to collate the list of outputs `clusterApply` provides; i.e., not by reading `*.Rdata` files.* Please see my [`spam_snow.R`](spam_snow.R), linked from the class web page. The highlights are provided below. First, the function `spamfold` below encapsulates the (previously coded) "bakeoff", with the intention that it will be called (separatley) for each fold. This might be a little different than what you did, possibly parallelizing only over repetitions. My version, as will be described in more detail below, combines repetitions and folds. ```{r} spamfold <- function(fold, spam) { library(textir) library(rpart) library(mda) library(MASS) library(randomForest) ## training and testing set test <- spam[fold,] train <- spam[-fold,] ## get rid of colums that are all the same one <- rep(1, nrow(train)) zo <- which(sqrt(drop(one %*% (train^2))) == 0) if(length(zo) > 0) { train <- train[,-zo]; test <- test[,-zo] } nt <- nrow(train) ## defining the universe of variables null <- glm(spam ~ 1, data=train, family="binomial") pnull <- round(predict(null, newdata=test, type="response")) hnull <- pnull == test$spam ## the full glm full <- suppressWarnings(glm(spam ~ ., data=train, family="binomial")) ## will warn pfull <- round(predict(full, newdata=test, type="response")) hfull <- pfull == test$spam ## go forward to get the best model greedily fwd <- suppressWarnings(step(null, scope=formula(full), k=log(nt), trace=0)) pfwd <- round(predict(fwd, newdata=test, type="response")) hfwd <- pfwd == test$spam ## now try adding in interactions fwdi <- suppressWarnings(step(fwd, scope=~.+.^2, k=log(nt), trace=0)) ## this takes a long time (about an hour on a fast machine) pfwdi <- round(predict(fwdi, newdata=test, type="response")) hfwdi <- pfwdi == test$spam ## LDA spam.lda <- lda(spam~., data=train) plda <- as.numeric(predict(spam.lda, newdata=test)$class)-1 hlda <- plda == test$spam ## QDA hqda <- rep(NA, nrow(test)) try({ ## necessary because quadratic expansion may not work spam.qda <- qda(spam~., data=train); pqda <- as.numeric(predict(spam.qda, newdata=test)$class)-1 hqda <- pqda == test$spam }, silent=TRUE) ## FDA spam.fda <- fda(spam~., data=train) pfda <- as.numeric(predict(spam.fda, newdata=test, type="class"))-1 hfda <- pfda == test$spam ## rpart ## otherwise rpart will do regression train <- transform(train, spam=as.factor(spam)) spam.rp <- rpart(spam~., data=train) prp <- as.numeric(predict(spam.rp, newdata=test, type="class"))-1 hrp <- prp == test$spam ## mnlm Xspam <- model.matrix( spam ~ . + .^2, data=train)[,-1] spampen <- mnlm(counts=train$spam, covars=Xspam, verb=1, penalty=c(10,1/2)) XXspam <- model.matrix( spam ~ . + .^2, data=test)[,-1] pmnlm <- round(predict(spampen, newdata=XXspam)) hmnlm <- pmnlm == test$spam ## random forests rf <- randomForest(spam~., data=train) prf <- as.numeric(predict(rf, newdata=test))-1 hrf <- prf == test$spam ## return predictions return(data.frame(null=hnull, full=hfull, fwd=hfwd, fwdi=hfwdi, lda=hlda, qda=hqda, fda=hfda, rp=hrp, mnlm=hmnlm, rf=hrf)) } ``` The next function, `spam_cv.snow` creates the folds for each repetition, calls `clusterApply` on each fold (from all repetitions), and collates the results into an appropriate data frame of "hitrates", and returns the result. ```{r} spamcv.snow <- function(cls, spam, folds=5, reps=2) { ## create folds for all reps all.folds <- list() for(r in 1:reps) all.folds <- c(all.folds, cv.folds(nrow(spam), folds)) ## distributed computation for each fold hits <- clusterApply(cls, all.folds, spamfold, spam) ## allocate space for hitrate calculation hitrate <- data.frame(matrix(NA, nrow=reps, ncol=ncol(hits[[1]]))) colnames(hitrate) <- colnames(hits[[1]]) ## loop over reps and folds i <- 1 for(r in 1:reps) { ## reps dfhit <- matrix(NA, nrow=nrow(spam), ncol=ncol(hits[[1]])) for(j in 1:folds) { ## folds dfhit[all.folds[[i]],] <- as.matrix(hits[[i]]) i <- i + 1 } ## aggretate hitrate[r,] <- apply(dfhit, 2, mean) } ## return the collected hitrates return(hitrate) } ``` - The advantage to divvying up over folds *and* repetitions is that there are more tasks to distribute over the parallel instances, which means that fewer cores will be idle at the end (as inevitably, some will take more time than others). The final step is to read in the data, define `cv.folds`, establish the cluster and "go". Below is my code for doing that over the eight hyperthreaded cores (i.e., effectively 16). However, the code is not executed in this `Rmarkdown` document. Rather it is called offline using the shell script described below. ```{r, eval=FALSE} spam <- read.csv("../data/spam.csv") cv.folds <- function (n, folds = 10) split(sample(1:n), rep(1:folds, length = n)) library(parallel) cl <- makeCluster(type="PSOCK", rep("localhost", nth)) hitrate <- spamcv.snow(cl, spam, reps) stopCluster(cl) save.image("spam_snow.RData") ``` - Notice that neither the `reps` or `nth` (number of threads) variables is defined. The final ingredient is a shell script [`spam_snow.sh`](spam_snow.sh) that takes an integer command-line argument specifying the number of parallel instances (sockets) to create, the number of CV folds, and the number of CV repetitions. Although you may follow the link above, a readout is provided below. ```{r} lines <- readLines("spam_snow.sh") cat(paste(lines, collapse="\n")) ``` ## Problem $\star$: Spam summary *This isn't a real problem. It is just here as a place-holder to say that for Problems 5/6 you must demonstrate, in your PDF on Canvas, that you have been able to collect a substantial number of MC repetitions in order to re-visualize the results from lecture (which are only based on 5 repetitions). I expect at least thirty repetitions, which many would regard as a minimum number in order to "trust" the resulting accuracy distributions.* I used the [`spam_mc_collect.R`](spam_mc_collect.R) script linked from the class web page to collate the results from my poor-man's batch parallel execution. I made fifteen parallel instances with two repetitions. Rather than simply source the file, I have copied the code chunks with some explanation below. First read in the true data and the predictions. ```{r} spam <- read.csv("../data/spam.csv") files <- list.files("./", pattern="spam_[0-9]*.RData") length(files) ``` Then, initialize empty matrices for collecting predicted probabilities, and loop over the files to extract those from each of the comparators. ```{r} pfulls <- pfwds <- pfwdis <- pldas <- pqdas <- NULL pnulls <- pfdas <- prps <- pmnlms <- prfs <- pfulls for (i in 1:length(files)) { f <- files[i] load(f) cat(" ", f, sep="") pnulls <- rbind(pnulls, pnull) pfulls <- rbind(pfulls, pfull) pfwds <- rbind(pfwds, pfwd) pfwdis <- rbind(pfwdis, pfwdi) pldas <- rbind(pldas, plda) pqdas <- rbind(pqdas, pqda) pfdas <- rbind(pfdas, pfda) prps <- rbind(prps, prp) prfs <- rbind(prfs, prf) } ``` Then, calculate hit rates comparing each of those probabilities to the true `spam` label. ```{r} hit <- function(x, y=spam$spam) mean(x == y, na.rm=TRUE) df <- data.frame(null=apply(pnulls, 1, hit), full=apply(pfulls, 1, hit), fwd=apply(pfwds, 1, hit), fwdi=apply(pfwdis, 1, hit), lda=apply(pldas, 1, hit), qda=apply(pqdas, 1, hit), fda=apply(pfdas, 1, hit), rp=apply(prps, 1, hit), rf=apply(prfs, 1, hit)) ``` Finally, we can summarize the results. The boxplot below summarizes the distribution of 30 hit rates obtained for each comparator. ```{r, dev.args = list(bg = 'transparent'), fig.width=6, fig.height=5.5, fig.align="center"} boxplot(df[,-1], main="spam bakeoff", xlab="comparators", ylab="hit rate") ``` - Looks like `randomForest` is still the best; `qda` is the worst. Comparators `lda` and `fda` look very similar. Pairwise $t$-tests can help sort out differences between methods with adjacent rankings. The code below sorts the data by median hit rate, and reports $p$-values from such pairwise $t$-tests. ```{r} dfmean <- apply(df, 2, mean) o <- order(dfmean, decreasing=TRUE) tt <- rep(NA, length(dfmean)) for(i in 1:(length(o)-1)) { tto <- t.test(df[,o[i]], df[,o[i+1]], alternative="greater", paired=TRUE) tt[o[i]] <- tto$p.value } cbind(data.frame(dfmean), data.frame(tt))[o,] ``` - All differences in hit rate are statistically significant except for `lda` and `fda`, which we can conclude are equally good (or bad).