The Nucleus study

In 2006 a firm called Nucleus Research conducted a study surrounding a claim by a business software company, SAP.

Nucleus produced a report for Oracle, a competitor, finding that actually

Looking at the Nucleus web page, and the Wikipedia page for Nucleus, it is clear that this study is Nucleus’ claim to fame.

Oracle used that report as the basis for and advertising campaign.

What does Oracle have to do with it?

Getting familiar with the data

The data is included in a spreadsheet at the end of the Nucleus report.

nucleus <- read.csv("../data/nucleus.csv")
nucleus[22:30,]
##          Company        SAPSolution                 Industry   ROE
## 22        Danone                CRM          Consumer+retail   6.7
## 23   Dow Corning                PLM                Chemicals  32.7
## 24 Dragerwerk AG                CRM Engineering construction  11.6
## 25        Dupont                SCM  Chemicals+life sciences  17.1
## 26 Eastman Kodak                SCM  Chemicals+life sciences -47.5
## 27          EnBW                CRM                Utilities  23.8
## 28   Epson Korea            SAM+CRM       Consumer+high tech   1.1
## 29      Ericsson                SRM                  Telecom  26.7
## 30    Fraport ag PLM+SCM+SAM+Mobile       Aerospace+services   6.9
##    IndustryROE
## 22        16.4
## 23        19.5
## 24        15.3
## 25        19.5
## 26        19.5
## 27        10.2
## 28        14.9
## 29         8.6
## 30        10.9

The data has two ROE (return on equity) columns, one for the SAP client in question (e.g., Dupont), and one for the industry-at-large that that company is “in” (e.g., Chemicals).

How can we estimate ROE for these two groups?

sap <- mean(nucleus$ROE)
ind <- mean(nucleus$IndustryROE)
roi <- data.frame(sap=sap, industry=ind)
roi
##        sap industry
## 1 12.63704 15.69877

What do these numbers tell us about SAP versus Industry?

sap/ind
## [1] 0.8049701

But what is the strength of this evidence?

Modeling, estimation, and the sampling distribution

Consider for now just the SAP numbers.

sap.y <- nucleus$ROE
sap.y
##  [1]  21.1  15.8   1.2  61.2  15.2  26.4  66.9   5.4  12.2   4.9  20.8
## [12]  23.3  13.2  14.5  27.3  17.0 -41.1  35.9 116.4   8.1   7.9   6.7
## [23]  32.7  11.6  17.1 -47.5  23.8   1.1  26.7   6.9  13.4  14.9  14.9
## [34]  14.6 -15.4   4.6  43.9  15.8   6.4  13.1  14.6   8.4   8.9   8.8
## [45]  23.1 -91.8  45.8   7.3  18.9  22.8  -7.7   6.8  -9.2 -62.6  18.7
## [56]  27.3   8.3  16.7  18.4  28.8  25.6  16.9   0.3  -0.7 -38.5   8.3
## [67]   8.3   8.3   8.3   6.2   6.2  -0.5  18.6   0.9  47.4  27.3   4.7
## [78]   2.8  41.4  26.0  14.6

Suppose the true population of SAP ROE’s is IID Gaussian with parameters \(\mu=12\) and \(\sigma^2=25^2\).

With \(n=81\) we have that \(\bar{Y} \sim \mathcal{N}(12, 7.2)\).

sap
## [1] 12.63704
qnorm(c(0.025, 0.975), 12, sd=sqrt(7.2))
## [1]  6.740865 17.259135

Here is what our actual data, and \(\bar{y}\), look like compared to the “truth”.

y <- seq(-63, 87, length=1000)
plot(y, dnorm(y, 12, 25), type="l", lwd=2, bty="n", main="Hypothetical Truth")
points(sap.y, rep(0, length(sap.y)))
abline(v=sap)
legend("topright", "ybar", lty=1)

Here is a little Monte Carlo experiment that illustrates the frequency argument, being made implicitly in the development of the sampling distribution for this data, via a numerical procedure.

Lets generate 1000 “data sets” that are like our actual data \(y_1, \dots, y_{81}\),

Y <- matrix(rnorm(81*1000, 12, 25), ncol=81)
Ybar <- rowMeans(Y)
c(mean(Ybar), var(Ybar))
## [1] 11.990633  7.400345

Now we go whole hog and compare the empirical (Monte Carlo) sampling distribution to the one we derived analytically.

hist(Ybar, xlim=range(y), freq=FALSE, border=2)
lines(y, dnorm(y, 12, 25/sqrt(81)), col=2)
lines(y, dnorm(y, 12, 25), lwd=2)
abline(v=sap)
points(sap.y, rep(0, length(sap.y)))
legend("topright", c("ybar", "Ybar density"), lty=1, col=1:2)

While we’re on the subject of Monte Carlo experiments, how can we understand

Consider two identical populations, \(\mathcal{N}(12, 25^2)\) with 81 observations each:

Yind <- matrix(rnorm(81*1000, 12, 25), ncol=81)
Yindbar <- rowMeans(Yind)
mean(Ybar/Yindbar < 0.8)
## [1] 0.264

Our little simulation showed that, in samples of size \(n=81\),

That’s not the basis of a marketing campaign for Oracle (or against SAP);

Ybar6 <- rowMeans(matrix(rnorm(81*6*1000, 12, 25), ncol=81*6))
Yindbar6 <- rowMeans(matrix(rnorm(81*6*1000, 12, 25), ncol=81*6))
mean(Ybar6/Yindbar6 < 0.8)
## [1] 0.037

But we’re getting a little ahead of ourselves…