Covariance


The mathematical formula for covariance is
E(XY) - E(X)E(Y)

This essay intends to explain why this is not an unreasonable definition.

The "covariance" of two variables is a measure of how much they vary together. For example, consider a room full of people. The heavy ones will tend to be taller, and the taller ones will tend to be heavier. So if we let one variable be the weights of the individuals in the room, and the other variable be their heights, a "covariance" is, appropriately, a measure of this tendency for the two variables to vary together.

To calculate the mathematical covariance using the formula above, first find the expectation of the product of height and weight. This amounts to multiplying each person's height by his or her weight. (While this may seem like adding apples and oranges, the way that the entire result tracks covariance will be clear momentarily.) To get the expection, E(XY), then divide the total sum of these products by the number of people. This yields an average (or expectation) of the height*weight product. Next, subtract the average of the heights E(X) times the average of the weights E(Y). The result is the covariance.

To make this result seem reasonable, first suppose that there were no relationship between the people's heights and weights (say they were just entirely random to begin with). Then E(XY) - E(X)E(Y) would tend to be zero: there would be absolutely no reason to expect that the average product would be any different from the product of the averages. But if you suppose that there is indeed a relationship---imagine now that there exists almost a linear relationship---then the excess of E(XY) over E(X)E(Y) arises from the products of the larger people's heights and weights, which more than compensates for the contributions of the products of the smaller people's heights and weights. As a silly example, suppose that four people's heights and weights were

 Person 1      1     1
 Person 2      2     2
 Person 3      3     3
Person 4      4     4

Then E(XY) is 1*1 + 2*2 + 3*3 + 4*4 divided by 4, or 30/4 = 7.5, whereas E(X)E(Y) is only 2.5 * 2.5 = 6.25. In this example, the covariance is 1.25.

Addendum: Norm Hardy www.cap-lore.com also points out that adding a constant to either X or Y does not change the defined covariance:
    E((X+3)Y) - E(X+3)E(Y)
= Sum((X+3)Y)/n  - Sum(X+3)Sum(Y)/n^2
= Sum(XY)/n + Sum(3Y)/n - (Sum(X)+3n)Sum(Y)/n^2
= E(XY) + 3Sum(Y)/n - Sum(X)Sum(Y)/n^2 - 3nSum(Y)/n^2
= E(XY) + 3E(Y) - E(X)E(Y) - 3Sum(Y)/n
= E(XY) + 3E(Y) - E(X)E(Y) - 3E(Y)
= E(XY) - E(X)E(Y)