Recall the mutual outlink example from lecture 8:

Let A be an adjacency matrix of a graph with n nodes: A is n×n and A[i,j] is either 1 or 0, depending on whether there is an edge from node i to j in the graph. The edges may represent links between websites or in a social network.

For any two nodes, say websites, we might be interested in mutual outlinks:

outbound links (edges) that are common to two sites.

If node i and node j both link to node k, we say there is one outbound link between i and j through k. Suppose we wanted to find the mean number of mutual outinks, averaged over all pairs of of websites in the data.

The initial implementation is:

```
A <- matrix(sample(0:1,(10)^2,replace=TRUE),nrow=10)
mutlinks <- function(A)
{
n <- ncol(A)
if(nrow(A) != n) stop("A must be square")
S <- 0
for(i in 1:(n-1))
for(j in (i+1):n)
for(k in 1:n) S <- S + A[i,k]*A[j,k]
S/choose(n,2)
}
mutlinks(A)
```

`## [1] 1.577778`

If we understand the problem more deeply, we can simplify the implementation as: (count all outbound links though each node, then sum them!)

```
out <- apply(A, 2, function(x) choose(sum(x),2))
sum(out)/choose(nrow(A),2)
```

`## [1] 1.577778`

Here we have two options for you to play with hadoop on this example:

- Set A as 1000*1000. Setup the single node hadoop. Refering to functions in
`rmr2`

and`rhdfs`

package, write the map and reduce function to calculate the mean number of mutual outinks.

- Set A as 1000*1000. Setup the single node hadoop. Refering to functions in
- Follow the corresponding sections of our tutorial to create a hadoop cluster on Amazon EMR . Use the same mapreduce function to deal with a much larger A. Report dimensions of A, the number of instances and the result. We recommend A to be at least 5000*5000.

If do both, you can have extra credit (maybe). The AWS EMR option is supposed to be much faster than your local single node hadoop.