A common mistake students make when dealing with correlation is assuming
that correlation equals causation. Although putting a regression line
through points on a scatter plot may tempt one to say that the
x-variable causes the y-variable, this is not the case. Other factor or
factors may be driving both variables being observed. This third
variable is known as a lurking variable.
When data is observed,
as opposed to data obtained from a designed experiment, there is no way
to be certain that a lurking variable is not the cause of an apparent
association between the variables.
For example, suppose a scatter
plot shows the average life expectancy of men for 20 different
countries is plotted against the number of doctors per person in the
country. Note that we must check the conditions of correlation before
interpreting the correlation. The conditions are that both variables
must be quantitative, and life expectancy and number of doctors are both
quantitative variables. Second, the pattern of the scatter plot must be
quite straight. Since there is no scatter plot shown, for the sake of
argument, we will assume this condition is met. Finally, no outliers can
be present. Once again, we will assume this condition is met.
Suppose the data shows that there is a strong positive association, r2
= 0.81, between the variables. The seems to confirm what we'd expect
that the more doctors per person that a country has, the longer the life
expectancy. So, we may think that the countries will lower life
expectancies need more doctors. But could there be other factors that
increase life expectancy? Yes, of course. One cannot say that only an
increase in doctors will cause life expectancy to increase. There are
certainly lurking variables involved here.
Suppose more data was
taken from these countries and this time we plot average life expectancy
against the number of tv's per household in the country. The scatter
plot shows a very strong association of r2 = 0.88. Fitting
the linear model, we may use the number of tv's as a predictor of life
expectancy. But this is an absurd way of thinking. If we just use the r2
value to determine causation, we would think we need to send more tv's
to the countries with lower life expectancies than doctors.
What
is most likely the cause for higher life expectancy is much more than
just doctors per person and tv's per household. Higher living standards
may be more of a reason that life expectancy is hire, increase the
number of tv's and increase the number of doctors.
It's very easy
to use regression and assuming causation from it. But beware that
lurking variables may be a cause for an apparent association and that
regression never can used to show that one variable causes another.
No comments:
Post a Comment