R
), and a willingness and capability to work in a Unix environment (more details under Personal Computing below).This is a second course on statistical computing. Although basics will be revisited, the pace will be swift so we can get to advanced computing and data management topics as quickly as possible. The main programming language will be R
, but by the end it will primarily act as the “glue” binding together other languages, databases, computing architectures and interfaces, as appropriate for the task(s) at hand. We will learn how statisticians can best leverage modern desktop computing (multiple cores), cluster computing (multiple nodes) and distributed computing (hadoop/Amazon EC2) and the coming wave of exascale computing (GPU/TPU/Xeon Phi). The goal is to make students marketable as postdocs at National Lab and similar research facilities where statisticians are expected to have the same computing skills as other applied scientists. A high bar of computing experience is required for graduating Ph.D.s to be competitive applicants for those positions, and likewise at investment banks/hedge funds, semiconductor companies, industrial engineering giants (Boeing, GE), etc. An aspect of that preparation will be “back to basics” with navigating the Unix shell, manipulating data therein, compiling libraries with make
, version control (e.g., Git), and good habits/best practice with code development and data management.
R
will be reviewed and will serve as the main language for the class. Along the way we will be introduced to the following languages/libraries/platforms. No previous experience with these will be required. In most cases the material will be structured so that students will ultimately be versed in using these tools appropriately, rather than being required to build bespoke ones from scratch. (We will not attempt to teach C
programming for example, but students will learn how to compile a C
library and access it from within R
.)
The expectation is that there would be about a week for each of these, plus one week reviewing R
basics.
bash
and common build-ins: find
, head
, grep
, make
, nohup
, screen
git
with Bitbucket and/or GitHubR
topics: object oriented programming (S3/S4), pre-allocation and vectorization, modularization, environments, data structures, objects, functional programmingR
code correctness and efficiency: debugging (debug
, error=recover
) and profiling (Rprof
)R
: regular expressions, sed
, awk
, merging, pivoting, subsetting, summarizingMatlab
, Excel
, Stata
, etc.)R
graphics manipulation, ggplot
R
packages, Rmarkdown
, Shiny
appsC
/C++
(and Rcpp
), Fortan
for fast for
loops and linking external libraries, correctness (gdb
/valgrind
) and efficiency (gprof
/OSX Instruments)R
execution and customizing matrix-vector routines in C
OpenMP
parallel
in R
, MPI (Rmpi
), scheduler scripts/job queueing (SLURM)hadoop
)CUDA
) and their competitors (Intel Xeon Phi)Students will need a Unix environment for this class. Examples include OSX, any Linux, BSD variant, etc. Students with PCs running Windows are encouraged to create a partition and install Ubuntu Linux. The instructor will provide support for this on a one-on-one basis. Alternatives to installing Ubuntu Linux on a partition included installing it on a (free) virtual machine, e.g., through Virtual Box, which the instructor will also support. Linux may also be accessed via the department servers. Although it is possible to create a Unix-like environment with Windows directly, and subsequently access all of the programs/software covered in the course, students are on their own if thy choose this (more cumbersome and more expensive) option.
Although there is no required text (because you can generally Google and find quality help more quickly than with a book), the lecture notes are derived from the following texts, which each make nice desk references and are therefore highly recommended.
R
Programming by Norman MatloffR
by Hadley WickhamR
in a Nutshell by Joseph AdlerR
by Jones, et al.The graded work will be code-based and will be marked separately for correctness, efficiency, documentation and style. Students will be required to set up a private Git repository on Bitbucket in which all class work will be stored, and where the homework and final project will be submitted for grading. The maintenance of this repository will also be fair game for evaluation.
There will be no in-class exams. A final project will be assigned part-way through the semester, which the students must complete on their own (in particular without the help of classmates). It will be due during finals week. The final will be submitted via the student’s private git
repository.
git
repository.The Virginia Tech Honor Code will be strictly enforced in this course. All graded assignments must be composed of your own work.
Any student who feels that he or she may need an accommodation because of a disability (learning disability, attention deficit disorder, psychological, physical, etc.), please make an appointment to see me during office hours.