Advanced Statistical Computing
STAT 6984 his is a second course on statistical computing. Although basics will be revisited, the pace will be
swift so we can get to advanced computing and data management topics as quickly as possible. The main programming
language will be
R, but by the end it will primarily act as
the "glue" binding together other languages, databases, computing architectures and interfaces, as appropriate for
the task(s) at hand. We will learn how statisticians can best leverage modern desktop computing (multiple cores),
cluster computing (multiple nodes) and distributed computing (hadoop/Amazon EC2) and the coming wave of exascale
computing (GPU/TPU/Xeon Phi). The goal is to make students marketable as postdocs at National Lab and similar
research facilities where statisticians are expected to have the same computing skills as other applied
scientists. A high bar of computing experience is required for graduating Ph.D.s to be competitive applicants for
those positions, and likewise at investment banks/hedge funds, semiconductor companies, industrial engineering
giants (Boeing, GE), etc. An aspect of that preparation will be “back to basics” with navigating the Unix shell,
manipulating data therein, compiling libraries with make, version control (e.g., Git), and good habits/best
practice with code development and data management.
- Homework 2 due date changed to 26 Sept.
Tim Warburton teaches a similar class to CMDA undergraduates, and this slide offers a nice snap-shot of the toolchain computational modelers (and statisticians) need to be effective researchers and collaborators. If you find it helpful, think of our class as catching you up with what undergraduates in other quantitative fields know about scientific computing, with a slight emphasis on statistics and data analytics.
The "home base" language for this course is
which can be obtained from CRAN.
R Studio is an excelent multi-platform graphical
R which you will likely prefer to the default
Throught the course we will encounter several other helpful tools, platforms and languages. The (incomplete) list of resources below, blending tutorials and best-practice guides, may be helful.
- A guide to the bash shell
Rstyle guides from Google and Hadley Wickham
- Some instructions on setting up bitbucket/git and integrating with Rstudio.
- A PeerJ issue on Practical Data Science for Stats
- Code Academy offers a nice suite of tutorials on many computing tools. You may find the ones on the Unix command line and Git to be of interest.