Advanced Statistical Computing
STAT 6984 his is a second course on statistical computing. Although basics will be revisited, the pace will be
swift so we can get to advanced computing and data management topics as quickly as possible. The main programming
language will be
R, but by the end it will primarily act as
the "glue" binding together other languages, databases, computing architectures and interfaces, as appropriate for
the task(s) at hand. We will learn how statisticians can best leverage modern desktop computing (multiple cores),
cluster computing (multiple nodes) and distributed computing (hadoop/Amazon EC2) and the coming wave of exascale
computing (GPU/TPU/Xeon Phi). The goal is to make students marketable as postdocs at National Lab and similar
research facilities where statisticians are expected to have the same computing skills as other applied
scientists. A high bar of computing experience is required for graduating Ph.D.s to be competitive applicants for
those positions, and likewise at investment banks/hedge funds, semiconductor companies, industrial engineering
giants (Boeing, GE), etc. An aspect of that preparation will be “back to basics” with navigating the Unix shell,
manipulating data therein, compiling libraries with make, version control (e.g., Git), and good habits/best
practice with code development and data management.
- Final project statement was released on Tuesday 10 Oct.
- Class is canceled on Tuesday 3 Oct.
- Homework 2 due date changed to 26 Sept.
Regular Expressions: slides and tutorial courtesy of A Fadikar
Auxiliary files: awk_basics.sh, line_count.awk, regex.R, script.awk, sed_basics.sh, sed_script_arg.sh, sed_script.sh
Data files: homicides.txt, numbers.txt,
Homework Due at the start of lecture
- Homework 1 (Rmd): Linux, Bitbucket and R, due 7 Sept 2017
- Homework 2 (Rmd): R functions, OOP and scripts, due 26 Sept 2017
Solutions (Rmd) including bisection2.R, rendpurl.R and build.sh
- Homework 3 (Rmd): data and plotting, due 10 Oct 2017
Data: teach.csv, nhlteams.csv
Solutions (Rmd) including nhlstandings.R, nhltables.R and nhlstandings.sh
- Homework 4 (Rmd): Monte Carlo and parallelization, due 31 Oct 2017
Solutions (Rmd) including spam_mc.sh, spam_snow.R and spam_snow.sh
- Homework 5 (Rmd): Compiled code, due 16 Nov 2017
Code files: mvnllik.R and mvnllik.c
Solutions (Rmd and MKL version) including mvnllik_sol.c, mvnllik_sol.cpp, mvnllik_arma.cpp, and bootreg_sol.c
- Homework 6 (Rmd): Compiled code, due 5 Dec 2017
Solution files: spam_mc_mult.qsub, spam_Rmpi.qsub, and spam_Rmpi.R
- Final homework, due 18 Dec 2017 11:59pm
You need to do five problems, not including your own.
GPUs (Rmd) with MetroH_PT.R
Distributed computing (Rmd)
plotly(Rmd) requiring baseball data
Regular Expressions (Rmd)
Deep Learning (Rmd)
Tim Warburton teaches a similar class to CMDA undergraduates, and this slide offers a nice snap-shot of the toolchain computational modelers (and statisticians) need to be effective researchers and collaborators. If you find it helpful, think of our class as catching you up with what undergraduates in other quantitative fields know about scientific computing, with a slight emphasis on statistics and data analytics.
The "home base" language for this course is
which can be obtained from CRAN.
R Studio is an excellent multi-platform graphical
R which you will likely prefer to the default
Throughout the course we will encounter several other helpful tools, platforms and languages. The (incomplete) list of resources below, blending tutorials and best-practice guides, may be helful.
- A guide to Unix from VT ARC (Advanced Research Computing)
- A guide to the bash shell
Rstyle guides from Google and Hadley Wickham
- Some instructions on setting up bitbucket/git and integrating with Rstudio.
- A PeerJ issue on Practical Data Science for Stats
- Code Academy offers a nice suite of tutorials on many computing tools. You may find the ones on the Unix command line and Git to be of interest.
- Free access to Intel MKL on Ubuntu for optimized and threaded linear algebra, and instructions for quick linking to R.
- Follow these instructions to use the Accelerate framework on the Mac. (Ignore the OpenBLAS suggestions. OpenBLAS is not thread safe. E.g., it doesn't work with OpenMP.)
- You may need to download gcc and gfortran to compile C code on your Mac. OSX's default compiler, Clang, is great but it doesn't support OpenMP.
- Microsoft R Open is basically R compiled with Intel MKL for Windows and Linux platforms.