## Saturday, October 5, 2013

A common mistake students make when dealing with correlation is assuming that correlation equals causation. Although putting a regression line through points on a scatter plot may tempt one to say that the x-variable causes the y-variable, this is not the case. Other factor or factors may be driving both variables being observed. This third variable is known as a lurking variable.

When data is observed, as opposed to data obtained from a designed experiment, there is no way to be certain that a lurking variable is not the cause of an apparent association between the variables.

For example, suppose a scatter plot shows the average life expectancy of men for 20 different countries is plotted against the number of doctors per person in the country. Note that we must check the conditions of correlation before interpreting the correlation. The conditions are that both variables must be quantitative, and life expectancy and number of doctors are both quantitative variables. Second, the pattern of the scatter plot must be quite straight. Since there is no scatter plot shown, for the sake of argument, we will assume this condition is met. Finally, no outliers can be present. Once again, we will assume this condition is met.

Suppose the data shows that there is a strong positive association, r2 = 0.81, between the variables. The seems to confirm what we'd expect that the more doctors per person that a country has, the longer the life expectancy. So, we may think that the countries will lower life expectancies need more doctors. But could there be other factors that increase life expectancy? Yes, of course. One cannot say that only an increase in doctors will cause life expectancy to increase. There are certainly lurking variables involved here.

Suppose more data was taken from these countries and this time we plot average life expectancy against the number of tv's per household in the country. The scatter plot shows a very strong association of r2 = 0.88. Fitting the linear model, we may use the number of tv's as a predictor of life expectancy. But this is an absurd way of thinking. If we just use the r2 value to determine causation, we would think we need to send more tv's to the countries with lower life expectancies than doctors.

What is most likely the cause for higher life expectancy is much more than just doctors per person and tv's per household. Higher living standards may be more of a reason that life expectancy is hire, increase the number of tv's and increase the number of doctors.

It's very easy to use regression and assuming causation from it. But beware that lurking variables may be a cause for an apparent association and that regression never can used to show that one variable causes another.