Department of Statistics, Virginia Tech

In 2006 a firm called Nucleus Research conducted a study surrounding a claim by a business software company, SAP.

- SAP said its customers were 32%
**more**profitable than their industry peers (except presumably not using SAP).

Nucleus produced a report for Oracle, a competitor, finding that actually

- SAP customers were 20%
**less**profitable than their peers, based on return on investment (ROI), a metric Nucleus specializes in.

Looking at the Nucleus web page, and the Wikipedia page for Nucleus, it is clear that this study is Nucleus’ claim to fame.

Oracle used that report as the basis for and advertising campaign.

What does Oracle have to do with it?

- Well, nothing really. For all we know Oracle customers are 50% less profitable than their industry peers. The point is just to have an interesting real-data example to get us started.

The data is included in a spreadsheet at the end of the Nucleus report.

- I’ve extracted that data and put it in a CSV file called
`nucleus.csv`

. - Lets read it in and see what it looks like.

```
nucleus <- read.csv("../data/nucleus.csv")
nucleus[22:30,]
```

```
## Company SAPSolution Industry ROE
## 22 Danone CRM Consumer+retail 6.7
## 23 Dow Corning PLM Chemicals 32.7
## 24 Dragerwerk AG CRM Engineering construction 11.6
## 25 Dupont SCM Chemicals+life sciences 17.1
## 26 Eastman Kodak SCM Chemicals+life sciences -47.5
## 27 EnBW CRM Utilities 23.8
## 28 Epson Korea SAM+CRM Consumer+high tech 1.1
## 29 Ericsson SRM Telecom 26.7
## 30 Fraport ag PLM+SCM+SAM+Mobile Aerospace+services 6.9
## IndustryROE
## 22 16.4
## 23 19.5
## 24 15.3
## 25 19.5
## 26 19.5
## 27 10.2
## 28 14.9
## 29 8.6
## 30 10.9
```

The data has two ROE (return on equity) columns, one for the SAP client in question (e.g., Dupont), and one for the industry-at-large that that company is “in” (e.g., Chemicals).

- It is sensible to ask how these compare.

How can we estimate ROE for these two groups?

- How about we just calculate the (sample) average values?

```
sap <- mean(nucleus$ROE)
ind <- mean(nucleus$IndustryROE)
roi <- data.frame(sap=sap, industry=ind)
roi
```

```
## sap industry
## 1 12.63704 15.69877
```

What do these numbers tell us about SAP versus Industry?

- Is the SAP number 20% lower than the Industry one?

`sap/ind`

`## [1] 0.8049701`

- Yes, it is!

But what is the strength of this evidence?

- How confident are we that this result didn’t come by chance?
- How big would the “gap” have to be before you thought it was noteworthy?
- 1%, 5%, 10%, 25%?

Consider for now just the SAP numbers.

```
sap.y <- nucleus$ROE
sap.y
```

```
## [1] 21.1 15.8 1.2 61.2 15.2 26.4 66.9 5.4 12.2 4.9 20.8
## [12] 23.3 13.2 14.5 27.3 17.0 -41.1 35.9 116.4 8.1 7.9 6.7
## [23] 32.7 11.6 17.1 -47.5 23.8 1.1 26.7 6.9 13.4 14.9 14.9
## [34] 14.6 -15.4 4.6 43.9 15.8 6.4 13.1 14.6 8.4 8.9 8.8
## [45] 23.1 -91.8 45.8 7.3 18.9 22.8 -7.7 6.8 -9.2 -62.6 18.7
## [56] 27.3 8.3 16.7 18.4 28.8 25.6 16.9 0.3 -0.7 -38.5 8.3
## [67] 8.3 8.3 8.3 6.2 6.2 -0.5 18.6 0.9 47.4 27.3 4.7
## [78] 2.8 41.4 26.0 14.6
```

Suppose the true population of SAP ROE’s is IID Gaussian with parameters \(\mu=12\) and \(\sigma^2=25^2\).

- We never presume to know these values, but the hypothetical is illustrative.

With \(n=81\) we have that \(\bar{Y} \sim \mathcal{N}(12, 7.2)\).

- Observe that our sample mean is well within the middle 95% of this distribution.

`sap`

`## [1] 12.63704`

`qnorm(c(0.025, 0.975), 12, sd=sqrt(7.2))`

`## [1] 6.740865 17.259135`

Here is what our actual data, and \(\bar{y}\), look like compared to the “truth”.

```
y <- seq(-63, 87, length=1000)
plot(y, dnorm(y, 12, 25), type="l", lwd=2, bty="n", main="Hypothetical Truth")
points(sap.y, rep(0, length(sap.y)))
abline(v=sap)
legend("topright", "ybar", lty=1)
```

Here is a little **Monte Carlo experiment** that illustrates the frequency argument, being made implicitly in the development of the sampling distribution for this data, via a numerical procedure.

Lets generate 1000 “data sets” that are like our actual data \(y_1, \dots, y_{81}\),

- in the sense that they are IID samples from the same, hypothetically known, distribution.

```
Y <- matrix(rnorm(81*1000, 12, 25), ncol=81)
Ybar <- rowMeans(Y)
c(mean(Ybar), var(Ybar))
```

`## [1] 11.990633 7.400345`

- pretty close to what we calculated, eh?

Now we go whole hog and compare the empirical (Monte Carlo) sampling distribution to the one we derived analytically.

```
hist(Ybar, xlim=range(y), freq=FALSE, border=2)
lines(y, dnorm(y, 12, 25/sqrt(81)), col=2)
lines(y, dnorm(y, 12, 25), lwd=2)
abline(v=sap)
points(sap.y, rep(0, length(sap.y)))
legend("topright", c("ybar", "Ybar density"), lty=1, col=1:2)
```

While we’re on the subject of Monte Carlo experiments, how can we understand

- the purely random chance that SAP ROE was 20% lower?

Consider two identical populations, \(\mathcal{N}(12, 25^2)\) with 81 observations each:

- mimicking our “data-generating mechanism”, and our observed sample;
- but now, generate those two samples 1000 times.
- (We already did this for SAP, in
`Y`

on the previous slide.)

```
Yind <- matrix(rnorm(81*1000, 12, 25), ncol=81)
Yindbar <- rowMeans(Yind)
mean(Ybar/Yindbar < 0.8)
```

`## [1] 0.264`

Our little simulation showed that, in samples of size \(n=81\),

- even two identical populations would show a larger than 20% difference more than 25% of the time
**just due to random chance**!

That’s not the basis of a marketing campaign for Oracle (or against SAP);

- it suggests the study is seriously flawed.
- The study would have to be \(\sim 6\)x larger to rule out “random chance”, with greater than 95% probability.

```
Ybar6 <- rowMeans(matrix(rnorm(81*6*1000, 12, 25), ncol=81*6))
Yindbar6 <- rowMeans(matrix(rnorm(81*6*1000, 12, 25), ncol=81*6))
mean(Ybar6/Yindbar6 < 0.8)
```

`## [1] 0.037`

But we’re getting a little ahead of ourselves…