Instructions

This homework is due on Tuesday, October 10th at 12:30pm (the start of class). Please turn in all your work. The primary purpose of this homework is to hone data and plotting skills. This description may change at any time, however notices about substantial changes (requiring more/less work) will be additionally noted on the class web page. Note that there are two prongs to submission, via Canvas and Bitbucket (in asc-repo/hwk/hw3). You don’t need to use Rmarkdown but your work should be just as pretty if you want full marks.

Problem 2: Asset returns (25 pts)

Return to the dow30 data from Yahoo! Finance that we collected in class.

  1. Consider the getSymbols function from the quantmod package for R.

How can it be used in a fashion similar to our get.quotes and get.multiple.quotes functions?

The getSymbols function is a bit of a hybrid between our two get.quotes and get.multiple.quotes routines. It allows you to specify several stocks, but it doesn’t collate the results. Instead, it defines variables in the calling environment matching the names of each stock history requested. Collating would have to be done “by hand”, as shown in iii. below. One nice thing about getSymbols is that it facilites the automatic creation of objects of particular, useful, classes like ts. One disadvantage is that it is a little more work (compared to our get.quotes) to get monthly or weekly data. See, e.g., this page, which post-processes the daily results which are downloaded.

By inspecting the source, describe how it handles the “crumb issue” we encountered?

The getSymbols function uses the special cookies features in the curl package for R. As a back-end, the curl R package uses the (Unix) curl library functions. In that (Unix) environment, curl is essentially a command-line browser. It is designed to do what other ordinary browsers do, like Google Chrome or Firefox, but without the human (readable) interface. Rather, it is design to automate how a human would use a browser in computer code. So it is able to automate some of the steps that we had to do by hand, like getting and setting the crumb.

For later, lets load in the library.

library(quantmod)
options("getSymbols.warning4.0"=FALSE)
options("getSymbols.yahoo.warning"=FALSE)
  • Those options quiet warnings that would come up later and make things look messy.

Provide the commands (using getSymbols) which mimic the result of our get.multiple.quotes function on the dow30.tickers from class.

Here are the dow30 tickers from lecture, augmented to include Dow-Dupont which was not listed on Wikipedia at that time.

dow30.tickers <-
  c("MMM", "AXP", "AAPL", "BA", "CAT", "CVX","CSCO", "DWDP", 
    "KO", "DIS", "XOM", "GE", "GS", "HD", "IBM", "INTC", "JNJ",
    "JPM", "MCD", "MRK", "MSFT", "NKE", "PFE", "PG", "TRV", 
    "UTX", "UNH", "VZ", "V", "WMT")

In order to avoid cluttering the environment with new symbols, we are going to write a new get.multiple.symbols like get.multiple.quotes that iterates over getSymbols in a similar fashion (with env=NULL)

  • Notice that we don’t need the crumb.
  • A more elegant set of commands can be used to get the date range and/or convert to lower frequency data.
  • We need to coerce the output of getSymbols to a data.frame or matrix for rbind to work.
get.multiple.symbols <- function(tkrs, from=(Sys.Date()-365), to=(Sys.Date()), interval=c("1d", "1wk", "1mo"))
  {
    interval <- match.arg(interval)
    tmp <- NULL
    for (tkr in tkrs) {

        ## manipulate data
        s <- getSymbols(tkr, env=NULL)                    ## download
        s <- s[paste(from, to, sep="::")]                 ## time subset
        if(interval == "1wk") s <- to.weekly(s)           ## period week
        else if(interval == "1mo") s <- to.monthly(s)     ## period month
        
        ## coerce to appropriate data frame
        mat <- as.data.frame(s)  
        mat <- cbind(tkr, row.names(mat), mat)
        row.names(mat) <- NULL
        colnames(mat) <- c("Symbol", "Date", "Open", "High", "Low", "Close", "Volume", "Adj Close")

        ## augment or create
        if (is.null(tmp)) tmp <- mat
        else tmp <- rbind(tmp, mat)
    }
    tmp
  }

Now we can get the price data to use in answering the question below.

dow30 <- get.multiple.symbols(dow30.tickers)
## Warning: JPM contains missing values. Some functions will not work if
## objects contain missing values in the middle of the series. Consider using
## na.omit(), na.approx(), na.fill(), etc to remove or replace them.
  1. With the data you downloaded, via the old get.quotes version or the new getSymbols (as you prefer), use the transform function to create a new "Mid" column tabulating the average between the "Low" and "High" quotes, and to convert the "Date" column into a Date class using a single call.

Following the in-class demonstration, we may use the following transform.

dow30 <- transform(dow30, Date=as.Date(Date), Mid=(High+Low)/2, Symbol=as.character(Symbol))
  • The final "Symbol" may be necessary if you use the original get.multiple.quotes version.
  1. Calculate returns \(r_{a,t}\) for each asset \(a\) based on the "Mid" price \(p_{a,t}\) on each day \(t\): \(r_{a,t} = (p_{a,t} - p_{a,t-1})/p_{a,t-1}\).

Having converted the dates makes them sort-able, which is useful for calculating returns in the following way.

dow30$Returns <- rep(NA, nrow(dow30))
for(t in dow30.tickers) {
  rows <- which(dow30$Symbol == t)
  Mid <- dow30[rows,]$Mid
  n <- length(rows)
  Returns <- (Mid[2:n] - Mid[1:(n-1)])/Mid[1:(n-1)]
  dow30[rows,]$Returns <- c(NA, Returns)
}
  1. Use those same "Mid" prices to calculate the Dow Jones Industrial Average (DJIA) using the single “divisor” provided on that page. Compare your calculation to a DJIA index downloaded from Yahoo!

Using the divisor downloaded from that Wikipedia page, the code below calculates the index DJI.

div <- 0.14523396877348
ud <- sort(unique(dow30$Date))
DJI <- data.frame(Date=ud, Mid=NA)
for(d in ud) {
  rows <- which(dow30$Date == d)
  DJI[DJI$Date==d,2] <- sum(dow30[rows,]$Mid)/div
}

To make a comparison, we dowload the DJIA from Yahoo and perform the same "Mid" calculation, etc..

DJIyah <- get.multiple.symbols("^DJI")
DJIyah <- transform(DJIyah, Date=as.Date(Date), Mid=(High+Low)/2, Symbol=as.character(Symbol))

As you can see from the plot below, the two series are essentialy identical.

plot(DJI, type="l", ylim=range(c(DJI, DJIyah$Mid)), lwd=2, main="Dow Jones Indices")
lines(DJIyah$Date, DJIyah$Mid, col=2, lty=2, lwd=2)
legend("bottomright", c("by hand", "index"), col=1:2, lty=1:2, lwd=2)

  1. The DJIA is often criticized for being a simple “price-weighted” index, meaning that the divisor is common to all assets. Create an alternative market-capitalization weighted version, more like the S&P500. For bonus points, write a script to scrape the “market caps” from Google or Yahoo. (Otherwise do it by hand.) Calculate returns. Finally, put a data.frame together with your all of your DJIA prices and returns and provide a visualization comparing the two.

The following market capitalizations were downloaded, by hand, from Yahoo as of today (12 Oct 2017) and are recorded in billions of dollars.

  • These are entered in the same order as dow30.tickers above.
caps <-  c(129, 81, 808, 154, 75, 226, 166, 167, 
    196, 152, 350, 199, 94, 195, 137, 184, 367,
    340, 132, 174, 174, 83, 217, 233, 35, 
    94, 189, 199, 248, 256)
names(caps) <- dow30.tickers
caps <- caps[order(names(caps))]
  • And then put in alphabetical order

Using a loop similar to the above for the DJIA calculation, the following performs the analog weighted average.

## now loop over each date to estimate the dow price
ud <- sort(unique(dow30$Date))
DJIw <- data.frame(Date=ud, Mid=NA)
for(d in ud) {
  rows <- which(dow30$Date == d)
  cs <- caps[names(caps) %in% dow30[rows,]$Symbol]
  if(any(names(cs) != dow30[rows,]$symbol)) stop("names mismatch")
  DJIw[DJIw$Date==d,2] <- drop((cs %*% dow30[rows,]$Mid)/sum(cs))
}

Now calculate returns for the new original and weighted indices.

n <- nrow(DJI)
DJIwRet <- DJIRet <- rep(NA, n)
DJIRet[-1] <- (DJI[-1,2] - DJI[1:(n-1),2])/DJI[1:(n-1),2]
DJIwRet[-1] <- (DJIw[-1,2] - DJIw[1:(n-1),2])/DJIw[1:(n-1),2]

The data frame below records all of our indices and their returns.

DJI <- data.frame(DJI, DJIRet, DJIw[,-1], DJIwRet)
names(DJI) <- c("Date", "p", "r", "wp", "wr")

The plots below separately compare prices and returns. Prices are normalized so that weighted and un-weighted versions are on the same scale.

par(mfrow=c(1,2))
plot(DJI$Date, DJI$p / max(DJI$p), type="l", lwd=2, main="prices", xlab="Date", ylab="Normalized Price")
lines(DJI$Date, DJI$wp / (max(DJI$wp)) , lty=2, col=2, lwd=2)
legend("bottomright", c("unweighted", "weighted"), col=1:2, lty=1:2, lwd=2)
plot(DJI$Date, DJI$r, type="l", lwd=2, main="returns", xlab="Date", ylab="Returns")
lines(DJI$Date, DJI$wr, lty=2, col=2, lwd=2)

  • Not very much difference between weighted and unweighted versions. Perhaps the criticism of the DJIA is overblown?
  1. Calculate the correlation of each stock’s returns to those of the market, as measured (separately) by your two DJIA return calculations. Create a new data set where the stock prices and returns are presented after sorting on these correlations.

The code below uses merge to match on "Date" in order to facilitate the pairings fed into cor for calculating correlations.

i <- 1
dow30$Cors <- NA
for(t in dow30.tickers) {
  rows <- which(dow30$Symbol == t)
  data <- merge(DJI[,c("Date","r")], dow30[rows,c("Date", "Returns")])
  dow30[rows,]$Cors <- cor(data$Returns, data$r, use="complete.obs")
  i <- i + 1
}
  • Above, the un-weighted index was used, but one could easily swap in $DJIwRet instead.

Now, the data frame can be re-order via correlation as follows.

dow30 <- dow30[order(dow30$Cors, dow30$Date, decreasing=TRUE),]
dow30[1:10,]
##      Symbol       Date   Open   High    Low  Close  Volume Adj.Close
## 3276     GS 2017-10-13 239.00 239.45 236.84 238.53 2481800    238.53
## 3275     GS 2017-10-12 242.33 243.42 238.75 239.80 2148200    239.80
## 3274     GS 2017-10-11 242.00 243.20 241.35 242.40 2175700    242.40
## 3273     GS 2017-10-10 242.80 243.73 241.70 242.60 2187100    242.60
## 3272     GS 2017-10-09 245.15 246.35 242.02 242.80 2165200    242.80
## 3271     GS 2017-10-06 246.30 247.08 244.61 246.02 2396100    246.02
## 3270     GS 2017-10-05 241.00 246.32 240.12 246.06 3521300    246.06
## 3269     GS 2017-10-04 241.62 242.88 240.12 240.31 1840400    240.31
## 3268     GS 2017-10-03 241.07 242.71 239.69 241.62 2010800    241.62
## 3267     GS 2017-10-02 237.20 241.10 237.10 240.65 2501300    240.65
##          Mid      Returns     Cors
## 3276 238.145 -0.012194879 0.698446
## 3275 241.085 -0.004911784 0.698446
## 3274 242.275 -0.001812805 0.698446
## 3273 242.715 -0.006020060 0.698446
## 3272 244.185 -0.006752208 0.698446
## 3271 245.845  0.010792700 0.698446
## 3270 243.220  0.007122157 0.698446
## 3269 241.500  0.001243762 0.698446
## 3268 241.200  0.008782930 0.698446
## 3267 239.100  0.012770822 0.698446
  • Apparently, Goldman Sachs gives the highest correlation.

Problem 3: Data visualization (30 pts)

Use charts to explore the 1970s UK teacher’s pay survey data in the teach.csv available on the course web page. What to show and how to show it is up to you.

First, lets read in the data.

teach <- read.csv("../data/teach.csv", comment.char="#")

Now consider the following representative set of visualizations.

Below is a plot of salary versus months of service, however rather than using points the code below makes an empty plot and then uses colored text to show gender simultaneously.

plot(teach$months, teach$salary, type="n", main="Gender Gap")
text(teach$months, teach$salary, col=as.numeric(teach$sex),
     labels=teach$sex)

There are several noteworthy observations here.

  • Pay increases with seniority, but so too does variability.
  • Men and women start out being paid about the same amount, and the increase is fairly uniform for the first several years.
  • However, after about 150 months or so we start to see some men being paid lots more than the rest of the cohort, including men and women. How can we explain that?

Lets poke around a little more. The set of boxplot below show how salary is distributed, marginally, for each of the categorical predictor variables.

par(mfrow=c(3,2), mar=c(4,3,1,1))
boxplot(salary ~ sex, data=teach, xlab="sex", col=7)
boxplot(salary ~ marry, data=teach, xlab="marry", col=7)
boxplot(salary ~ degree, data=teach, xlab="degree", col=7)
boxplot(salary ~ type, data=teach, xlab="type", col=7)
boxplot(salary ~ train, data=teach, xlab="train", col=7)
boxplot(salary ~ brk, data=teach, xlab="break", col=7)