Robert B. Gramacy Professor of Statistics
Intermediate Data Analytics and Machine Learning
CMDA/CS/STAT 4654 is a technical analytics course that will teach supervised and unsupervised learning strategies,
including regression, generalized linear models, regularization, dimension reduction methods, treebased methods for
classification, and clustering. Upperlevel analytical methods are shown in practice: e.g., advanced naïve Bayes,
neural networks and Gaussian processes. It is targeted towards students who have completed (and remember the concepts
from) a course in introductory statistics and mathematical modeling.
We will make extensive use of calculus, linear algrbra, and probability.
Computational tools, such as the R
language for statistical
computing, will be used for illustration in class be essential for completing homework problems.
Notices
Lectures

Part 1: Introduction & Overview (doc format)

Part 2: Least Squares (doc format)
Supplementary code: correlations, and conditional distributions
Data files: wages, and mutual funds
Supplemental lecture: maximum likelihood (doc format) 
Part 3: Linear Model (doc format)
Supplementary code: lmmc.R, and for MC sampling under the linear model 
Part 4: Dragnostics & Transformations (doc format)
Data files: Anscombe, rent, pickups, telemarketing, imports, and food sales

Part 5: Multiple Linear Regression (doc format)
Data files: census, boss, and grades

Part 6: Model Selection (with CV & Bootstrap) (doc format)
Data files: census
Supplementary code: prostate.R

Part 7: Time Series (doc format)
Data files: weather, beer, dja, and airlines 
Part 8: Classification & GLMs (doc format)
Data files: NBA spread and German credit
Supplementary code: lect8_germancredit.R 
Part 9: Clustering & EM (doc format)
Supplementary code: em.R 
Part 10: Trees (doc format)
Data files: NBC, and CA housing
Homework Due at the start of lecture
 Homework 0 (Rmd): prerequesites, due 25 Jan 2017
Solutions (Rmd)  Homework 1 (Rmd): least squares and linear models, due 8 Feb 2017
Data files: tractors
Solutions (Rmd)  Homework 2 (Rmd): diagnostics and transformations, due 22 Feb 2017
Data files: tractors, transforms, and cheese
Solutions (Rmd)  Homework 3 (Rmd): multiple linear and stepwise regression, due 15 March 2017
Data files: nutrition, beef, and pollution
Solutions (Rmd)  Homework 4 (Rmd): model selection (CV & bootstrap), due 31 March 2017
Data files: pollution
Solutions (Rmd)
Supplementary file: fit_mse.R  Homework 5 (Rmd): time series and GLMs, due 12 Apr 2017
Data files: UK Gas, seatbelts, and adult income
Solutions (Rmd)  Homework 6 (Rmd): EM and clustering, due 28 April 2017
Computing
The recommended language for this course is R
,
which can be obtained from CRAN.
Other languages such as MATLAB
are allowed but are not recommended.
Examples in lecture, and help in office hours, etc., will be exclusively in R
.
Below are some helpful R
resources:
 A quick R tutorial and accompanying code file
 Some helpful video tutorials and step by step guides
 R Studio is an excelent multiplatform graphical
interface to
R
which you will likely prefer to the default Windows/OSX GUI(s).  If you must, MATLAB code supporting the book can be downloaded here.