Saturday, October 26, 2013

When trying to find possible roots of an equation, we can start by using the rational root theorem to find all the possible real roots.  You can also use Descartes rule of signs to possibly eliminate positive or negative roots.

If you have a cubic equation you can use synthetic division to find a root, which will leave you ith coefficients to a quadratic equation.

From here you can use quadratic formula, complete the square or other methods of factoring to get the remaining roots.

Tuesday, October 22, 2013

Remember the important angles on the unit circle, which is a circle with radius 1.  The angles are found by using the pythagorean theorem to get the sides of right triangles inscribed inside the unit circle and the trig functions sine, cosine and tangent.

0, 30, 45, 60, 90, 120, 135, 150, 180, 210, 225, 240, 270, 300, 315, 330, 360

These angle measures can be converted to radians by multiplying each by Pi/180.

Thursday, October 17, 2013

When graphing the any absolute value function, it's important to know the graph of the parent function f(x) = |x|. This looks like a v with the vertex at the origin and slope of -1 from -infinity to 0, and slope of 1 from 0 to infinity.

From here we can graph any of them in this form  a|x +/- h| +/- k

If a is positive, the shape is a v, if it's negative it's an upside down v and the value of a determines the slope.

If  it's x + h, the graph shifts h units to the left and if it's x - h, it shifts h units to the right.

If it's + k, it shifts k units up and if it's -k, it shifts k units down.  The point we are shifting is the vertex.

For example,

f(x) = 4|x + 3| - 2,  the vertex moves 3 to the left and 2 down to (3, -2) and the slope is 4 from 0 to infinity and -4 from -infinity to 0.

Sunday, October 13, 2013

Thursday, October 10, 2013

You can tell from a residual plot whether or not a model is linear.  If the residuals are uniform, then yes. If not, then the linear model does not fit.

For example, if you have 6 residuals and all of them are negative, then that is not uniform. If all are positive, it is not uniform. If half are positive and half are negative and they are about the same distance from the center line on the graph, then the linear model is the correct model. 

Sunday, October 6, 2013

Some of the most difficult concepts to understand in statistics are the various types of sampling. Specifically it can be confusing to distinguish between stratified sampling and cluster sampling. I have used the following explanations of these sampling techniques during my 13 years experience as a math tutor.

A simple random sample is the most common type of sampling technique used in statistics. However, designs that are used to sample populations across large areas, are more complex than the simple random sample. In some instances, populations are divided into homogeneous groups, called strata. Then a simple random sample is selected from each strata. This kind of sampling is known as stratified sampling.

The question that often arises is, "why would we want to make things more complicated by using a stratified sample?" Suppose we want to learn about fundraising for a high school baseball team. The school is 55% boys and 45% girls, and we expect that boys and girls have different ideas on the fundraising. If a simple random sample is used to choose 200 students, we could possibly get 130 boys and 70 girls or 45 boys and 155 girls. Because of this, the amount of variability could be large. So to reduce the variability, you can sample 55% boys and 45% girls. This kind of "forced representative balance" will ensure that the percentage of boys and girls in the sample is identical to that in the population. This is a better method that the simple random sample because it should give a more accurate representation of the opinion of all the students in the school.

Now suppose we want to find out what the high school freshmen think about the food served in the cafeteria. We could use the simple random sample or stratified sampling but it's too time-consuming to find every student that was selected in the sample. But the freshmen homerooms are all in one of ten rooms on the ground floor of the school. So, we could sample two or three homerooms and sample every student in those homerooms. The population was divided into representative clusters and a few clusters were sampled in their entirety. This type of sampling is called cluster sampling.

What is the difference between stratified and cluster sampling? Clusters are heterogeneous and resemble the population in its entirety. Stratified sampling is done to make sure the sample represents different groups in the population, and samples are taken randomly within each strata. Clusters are chosen to make sampling more affordable or practical for the given situation.

An example which will more clearly display the differences in the two types of sampling is examining a pizza. Suppose you have a professional taster whose job is to check each pizza for quality. Samples need to be eaten from selected pizzas, with the crust, sauce, cheese and toppings tested.

You could taste a slice of pizza as a customer would eat a slice. When doing so, you'll learn about the pizza as a whole. The slice would be a cluster sample since it contains all the ingredients of the pizza.

If you select some tastes of the crust at random, of the cheese at random, of the sauce at random, and of the toppings at random, you will still get a pretty good judgment of the overall quality of the pizza. This kind of sampling would be stratified.

Cluster samples slice across the layers to obtain clusters, while stratified sampling represent the population by selecting some from each layer, which reduces the amount of variability.

This guide should help students better understand the differences between stratified sampling and cluster sampling.

Saturday, October 5, 2013

A common mistake students make when dealing with correlation is assuming that correlation equals causation. Although putting a regression line through points on a scatter plot may tempt one to say that the x-variable causes the y-variable, this is not the case. Other factor or factors may be driving both variables being observed. This third variable is known as a lurking variable.

When data is observed, as opposed to data obtained from a designed experiment, there is no way to be certain that a lurking variable is not the cause of an apparent association between the variables.

For example, suppose a scatter plot shows the average life expectancy of men for 20 different countries is plotted against the number of doctors per person in the country. Note that we must check the conditions of correlation before interpreting the correlation. The conditions are that both variables must be quantitative, and life expectancy and number of doctors are both quantitative variables. Second, the pattern of the scatter plot must be quite straight. Since there is no scatter plot shown, for the sake of argument, we will assume this condition is met. Finally, no outliers can be present. Once again, we will assume this condition is met.

Suppose the data shows that there is a strong positive association, r2 = 0.81, between the variables. The seems to confirm what we'd expect that the more doctors per person that a country has, the longer the life expectancy. So, we may think that the countries will lower life expectancies need more doctors. But could there be other factors that increase life expectancy? Yes, of course. One cannot say that only an increase in doctors will cause life expectancy to increase. There are certainly lurking variables involved here.

Suppose more data was taken from these countries and this time we plot average life expectancy against the number of tv's per household in the country. The scatter plot shows a very strong association of r2 = 0.88. Fitting the linear model, we may use the number of tv's as a predictor of life expectancy. But this is an absurd way of thinking. If we just use the r2 value to determine causation, we would think we need to send more tv's to the countries with lower life expectancies than doctors.

What is most likely the cause for higher life expectancy is much more than just doctors per person and tv's per household. Higher living standards may be more of a reason that life expectancy is hire, increase the number of tv's and increase the number of doctors.

It's very easy to use regression and assuming causation from it. But beware that lurking variables may be a cause for an apparent association and that regression never can used to show that one variable causes another.

Tuesday, October 1, 2013

Assumptions and Conditions of Linear Regression


Linear regression is one of the many courses I studied while earning my BS in Statistics from Lehigh University. I have been tutoring statistics for the past 13 years and there is often confusion among students as to when the linear regression model can be used. Sometimes the model, although tempting to use, doesn't apply. Certain conditions and assumptions must be checked. The following paragraphs I explain each condition and assumption as I would to any of my students over the years.

The linear regression model has two easily estimated parameters, a significant measure of how well the model fits the data, and it has the ability to predict new values. The first condition is the quantitative variables condition. When a measured variable with units answers questions about the amount or quantity of what is being measured, it is a quantitative variable. Examples of quantitative variables are cost, scores, temperature, height, and weight.

A linear regression model makes many assumptions. First, the relationship between the variables must be linear. This assumption cannot be checked, per say, but can be checked by viewing the scatter plot. The scatter plot will help check the straight enough condition, which means the data in the scatter plot should be straight enough to make sense. For example, if the data shows more of a curved relationship between the variables and you try to use a linear model, stop. You cannot use it, the model won't mean a thing.

To summarize the scatter of the data in the plot by using the standard deviation, all of the residuals should have the same spread, or variance. Therefore, we need the equal variance assumption. Check for changing spread of the scatter plot. If you notice that the spread thickens at any part of the plot, then the does the plot thicken? condition does not hold and the linear model does not apply.

Finally, we have to check for outliers, which are points which are drastically far above or below the rest of the data points. These points can dramatically change a regression model, such as the slope, which can mislead us about the relationship between the variables in the model. Therefore, be certain that the outlier condition is also met.

Although the linear regression model is widely used and a very powerful tool in statistics to predict values, several assumptions and conditions must be check to make sure the model is appropriate. If the model is inappropriate, do not use it. The results will be misleading.