Advanced Statistical Computing

Details

  • Course Number: STAT 6984
  • Institution: Virginia Tech
  • Term, Dates, Location: Fall Semester 2017, T/Th 12:30-1:45 SMYTH 232
  • Instructor: Robert B. Gramacy (rbg@vt.edu; bobby.gramacy.com)
    • Office Hours & Location: by appointment in HUTCH 403G
  • Prerequisites: A first-year graduate sequence in statistics, an introductory statistical computing course (ideally with R), and a willingness and capability to work in a Unix environment (more details under Personal Computing below).
  • Required text: None
  • Web: Course materials will be provided at bobby.gramacy.com/teaching/asc; Canvas will only be used for its grades and calendar feature

About the course

Content and goals

This is a second course on statistical computing. Although basics will be revisited, the pace will be swift so we can get to advanced computing and data management topics as quickly as possible. The main programming language will be R, but by the end it will primarily act as the “glue” binding together other languages, databases, computing architectures and interfaces, as appropriate for the task(s) at hand. We will learn how statisticians can best leverage modern desktop computing (multiple cores), cluster computing (multiple nodes) and distributed computing (hadoop/Amazon EC2) and the coming wave of exascale computing (GPU/TPU/Xeon Phi). The goal is to make students marketable as postdocs at National Lab and similar research facilities where statisticians are expected to have the same computing skills as other applied scientists. A high bar of computing experience is required for graduating Ph.D.s to be competitive applicants for those positions, and likewise at investment banks/hedge funds, semiconductor companies, industrial engineering giants (Boeing, GE), etc. An aspect of that preparation will be “back to basics” with navigating the Unix shell, manipulating data therein, compiling libraries with make, version control (e.g., Git), and good habits/best practice with code development and data management.

Software/List of topics

R will be reviewed and will serve as the main language for the class. Along the way we will be introduced to the following languages/libraries/platforms. No previous experience with these will be required. In most cases the material will be structured so that students will ultimately be versed in using these tools appropriately, rather than being required to build bespoke ones from scratch. (We will not attempt to teach C programming for example, but students will learn how to compile a C library and access it from within R.)

The expectation is that there would be about a week for each of these, plus one week reviewing R basics.

  1. Unix: bash and common build-ins: find, head, grep, make, nohup, screen
  2. Version control: git with Bitbucket and/or GitHub
  3. Advanced R topics: object oriented programming (S3/S4), pre-allocation and vectorization, modularization, environments, data structures, objects, functional programming
  4. R code correctness and efficiency: debugging (debug, error=recover) and profiling (Rprof)
  5. Data manipulation and data cleaning in Unix and R: regular expressions, sed, awk, merging, pivoting, subsetting, summarizing
  6. Data input: scraping (from web), formatted data (JSON/XML), databases (SQL), from other computing environments (Matlab, Excel, Stata, etc.)
  7. Visualization: advanced plotting and R graphics manipulation, ggplot
  8. Sharing: R packages, Rmarkdown, Shiny apps
  9. Compiled code: C/C++ (and Rcpp), Fortan for fast for loops and linking external libraries, correctness (gdb/valgrind) and efficiency (gprof/OSX Instruments)
  10. Custom linear algebra libraries: MKL, Accellerate framework, ATLAS for faster R execution and customizing matrix-vector routines in C
  11. Symmetric multiprocessor computation: OpenMP
  12. Cluster computation: parallel in R, MPI (Rmpi), scheduler scripts/job queueing (SLURM)
  13. Distributed computation and storage: Amazon EC2/S3, map-reduce (hadoop)
  14. Exascale computing: data-parallel computing on graphics cards (NVidia/CUDA) and their competitors (Intel Xeon Phi)

Personal computing

Students will need a Unix environment for this class. Examples include OSX, any Linux, BSD variant, etc. Students with PCs running Windows are encouraged to create a partition and install Ubuntu Linux. The instructor will provide support for this on a one-on-one basis. Alternatives to installing Ubuntu Linux on a partition included installing it on a (free) virtual machine, e.g., through Virtual Box, which the instructor will also support. Linux may also be accessed via the department servers. Although it is possible to create a Unix-like environment with Windows directly, and subsequently access all of the programs/software covered in the course, students are on their own if thy choose this (more cumbersome and more expensive) option.

Texts

Although there is no required text (because you can generally Google and find quality help more quickly than with a book), the lecture notes are derived from the following texts, which each make nice desk references and are therefore highly recommended.

Grading details

Rubric

  • 75% Homework
  • 25% Final project

The graded work will be code-based and will be marked separately for correctness, efficiency, documentation and style. Students will be required to set up a private Git repository on Bitbucket in which all class work will be stored, and where the homework and final project will be submitted for grading. The maintenance of this repository will also be fair game for evaluation.

Exams

There will be no in-class exams. A final project will be assigned part-way through the semester, which the students must complete on their own (in particular without the help of classmates). It will be due during finals week. The final will be submitted via the student’s private git repository.

Homework

  • Homework will be assigned and due on a regular basis. Students are welcome to collaborate with one another, but are required to submit their own work as well as be able to reproduce it.
  • All work must be shown and software must be used when appropriate with attached software output.
  • Late homework will not be accepted unless previously approved by the instructor.
  • Homeworks will be submitted via the student’s private git repository.
  • All the homework grades will be kept which means NO homework grade will be dropped.

Logistics

Honor code

The Virginia Tech Honor Code will be strictly enforced in this course. All graded assignments must be composed of your own work.

Services for students with disabilities

Any student who feels that he or she may need an accommodation because of a disability (learning disability, attention deficit disorder, psychological, physical, etc.), please make an appointment to see me during office hours.

Important dates