Friday, May 3, 2013

Scatter Diagrams and Linear Correlation


A scatter diagram is a graph where the data points (x, y) are plotted on a rectangular coordinate system, where x is the horizontal axis and y is the vertical axis. Scatter diagrams are used in studies of correlation and regression in two variables.

After a scatter diagram is made for a set of data, a line is drawn through the points. This line is known as the "line of best fit". But how do we determine which is the "best" line through a set of points? It is the line that comes closest to each point in the scatter diagram. The "least squares line", which can be computed by hand using the data values, or more easily by a computer, will be the best fit line. This line will contain the mean of the x value and the mean of the y value. In fact, the coordinate is (x mean, y mean).

Sometimes the data is dispersed in a way that there is no "best" line. If the points are a poor fit to any line, it makes no sense to try to find a line of best fit. When the points are scattered in a way that there is not a "good" fit, then there is no linear correlation between the x and y values. Picture many randomly scattered points, almost as if looking into the sky at a bunch of stars. There will be very little if any linear correlation between the points. If the points are scattered in a way that you can visually see where a line would go or the points almost form a line, then there will be linear correlation and strong linear correlation the more the points form a straight line.

The measurement that determines the strength of linear association between variables is known as the "sample correlation coefficient r". Also known as Pearson's correlation coefficient, named after statistician Karl Pearson.

The correlation coefficient is a measurement between -1 and 1. A correlation coefficient of -1 means there is perfect negative linear correlation between x and y. On the scatter diagram the points would form a perfect line with negative slope. A correlation coefficient of 1 means there is perfect positive linear correlation between x and y. On a scatter diagram the points would form a perfect line with positive slope. The closer  r is 1, the stronger the positive correlation, and the closer r is to -1, the stronger the negative correlation. Basically this means that the closer r is to 1 and -1, the more the line describes the relationship between the variables.

In linear correlation, the explanatory variable is x and the response variable is y. These are also known as the independent and dependent variables, respectively. The value for r can be calculated by hand from the data pairs using a tedious formula or can be easily calculated using a computer.

This guide should help assist students learning the basics about scatter diagrams and linear correlation.

No comments:

Post a Comment