## The Nucleus study

In 2006 a firm called Nucleus Research conducted a study surrounding a claim by a business software company, SAP.

• SAP said its customers were 32% more profitable than their industry peers (except presumably not using SAP).

Nucleus produced a report for Oracle, a competitor, finding that actually

• SAP customers were 20% less profitable than their peers, based on return on investment (ROI), a metric Nucleus specializes in.

Looking at the Nucleus web page, and the Wikipedia page for Nucleus, it is clear that this study is Nucleus’ claim to fame.

Oracle used that report as the basis for and advertising campaign.

What does Oracle have to do with it?

• Well, nothing really. For all we know Oracle customers are 50% less profitable than their industry peers. The point is just to have an interesting real-data example to get us started.

## Getting familiar with the data

The data is included in a spreadsheet at the end of the Nucleus report.

• I’ve extracted that data and put it in a CSV file called nucleus.csv.
• Lets read it in and see what it looks like.
nucleus <- read.csv("../data/nucleus.csv")
nucleus[22:30,]
##          Company        SAPSolution                 Industry   ROE
## 22        Danone                CRM          Consumer+retail   6.7
## 23   Dow Corning                PLM                Chemicals  32.7
## 24 Dragerwerk AG                CRM Engineering construction  11.6
## 25        Dupont                SCM  Chemicals+life sciences  17.1
## 26 Eastman Kodak                SCM  Chemicals+life sciences -47.5
## 27          EnBW                CRM                Utilities  23.8
## 28   Epson Korea            SAM+CRM       Consumer+high tech   1.1
## 29      Ericsson                SRM                  Telecom  26.7
## 30    Fraport ag PLM+SCM+SAM+Mobile       Aerospace+services   6.9
##    IndustryROE
## 22        16.4
## 23        19.5
## 24        15.3
## 25        19.5
## 26        19.5
## 27        10.2
## 28        14.9
## 29         8.6
## 30        10.9

The data has two ROE (return on equity) columns, one for the SAP client in question (e.g., Dupont), and one for the industry-at-large that that company is “in” (e.g., Chemicals).

• It is sensible to ask how these compare.

How can we estimate ROE for these two groups?

• How about we just calculate the (sample) average values?
sap <- mean(nucleus$ROE) ind <- mean(nucleus$IndustryROE)
roi <- data.frame(sap=sap, industry=ind)
roi
##        sap industry
## 1 12.63704 15.69877

What do these numbers tell us about SAP versus Industry?

• Is the SAP number 20% lower than the Industry one?
sap/ind
## [1] 0.8049701
• Yes, it is!

But what is the strength of this evidence?

• How confident are we that this result didn’t come by chance?
• How big would the “gap” have to be before you thought it was noteworthy?
• 1%, 5%, 10%, 25%?

## Modeling, estimation, and the sampling distribution

Consider for now just the SAP numbers.

sap.y <- nucleus\$ROE
sap.y
##  [1]  21.1  15.8   1.2  61.2  15.2  26.4  66.9   5.4  12.2   4.9  20.8
## [12]  23.3  13.2  14.5  27.3  17.0 -41.1  35.9 116.4   8.1   7.9   6.7
## [23]  32.7  11.6  17.1 -47.5  23.8   1.1  26.7   6.9  13.4  14.9  14.9
## [34]  14.6 -15.4   4.6  43.9  15.8   6.4  13.1  14.6   8.4   8.9   8.8
## [45]  23.1 -91.8  45.8   7.3  18.9  22.8  -7.7   6.8  -9.2 -62.6  18.7
## [56]  27.3   8.3  16.7  18.4  28.8  25.6  16.9   0.3  -0.7 -38.5   8.3
## [67]   8.3   8.3   8.3   6.2   6.2  -0.5  18.6   0.9  47.4  27.3   4.7
## [78]   2.8  41.4  26.0  14.6

Suppose the true population of SAP ROE’s is IID Gaussian with parameters $$\mu=12$$ and $$\sigma^2=25^2$$.

• We never presume to know these values, but the hypothetical is illustrative.

With $$n=81$$ we have that $$\bar{Y} \sim \mathcal{N}(12, 7.2)$$.

• Observe that our sample mean is well within the middle 95% of this distribution.
sap
## [1] 12.63704
qnorm(c(0.025, 0.975), 12, sd=sqrt(7.2))
## [1]  6.740865 17.259135

Here is what our actual data, and $$\bar{y}$$, look like compared to the “truth”.

y <- seq(-63, 87, length=1000)
plot(y, dnorm(y, 12, 25), type="l", lwd=2, bty="n", main="Hypothetical Truth")
points(sap.y, rep(0, length(sap.y)))
abline(v=sap)
legend("topright", "ybar", lty=1)

Here is a little Monte Carlo experiment that illustrates the frequency argument, being made implicitly in the development of the sampling distribution for this data, via a numerical procedure.

Lets generate 1000 “data sets” that are like our actual data $$y_1, \dots, y_{81}$$,

• in the sense that they are IID samples from the same, hypothetically known, distribution.
Y <- matrix(rnorm(81*1000, 12, 25), ncol=81)
Ybar <- rowMeans(Y)
c(mean(Ybar), var(Ybar))
## [1] 11.990633  7.400345
• pretty close to what we calculated, eh?

Now we go whole hog and compare the empirical (Monte Carlo) sampling distribution to the one we derived analytically.

hist(Ybar, xlim=range(y), freq=FALSE, border=2)
lines(y, dnorm(y, 12, 25/sqrt(81)), col=2)
lines(y, dnorm(y, 12, 25), lwd=2)
abline(v=sap)
points(sap.y, rep(0, length(sap.y)))
legend("topright", c("ybar", "Ybar density"), lty=1, col=1:2)

While we’re on the subject of Monte Carlo experiments, how can we understand

• the purely random chance that SAP ROE was 20% lower?

Consider two identical populations, $$\mathcal{N}(12, 25^2)$$ with 81 observations each:

• mimicking our “data-generating mechanism”, and our observed sample;
• but now, generate those two samples 1000 times.
• (We already did this for SAP, in Y on the previous slide.)
Yind <- matrix(rnorm(81*1000, 12, 25), ncol=81)
Yindbar <- rowMeans(Yind)
mean(Ybar/Yindbar < 0.8)
## [1] 0.264

Our little simulation showed that, in samples of size $$n=81$$,

• even two identical populations would show a larger than 20% difference more than 25% of the time
• just due to random chance!

That’s not the basis of a marketing campaign for Oracle (or against SAP);

• it suggests the study is seriously flawed.
• The study would have to be $$\sim 6$$x larger to rule out “random chance”, with greater than 95% probability.
Ybar6 <- rowMeans(matrix(rnorm(81*6*1000, 12, 25), ncol=81*6))
Yindbar6 <- rowMeans(matrix(rnorm(81*6*1000, 12, 25), ncol=81*6))
mean(Ybar6/Yindbar6 < 0.8)
## [1] 0.037

But we’re getting a little ahead of ourselves…