## Sunday, October 6, 2013

Some of the most difficult concepts to understand in statistics are the various types of sampling. Specifically it can be confusing to distinguish between stratified sampling and cluster sampling. I have used the following explanations of these sampling techniques during my 13 years experience as a math tutor.

A simple random sample is the most common type of sampling technique used in statistics. However, designs that are used to sample populations across large areas, are more complex than the simple random sample. In some instances, populations are divided into homogeneous groups, called strata. Then a simple random sample is selected from each strata. This kind of sampling is known as stratified sampling.

The question that often arises is, "why would we want to make things more complicated by using a stratified sample?" Suppose we want to learn about fundraising for a high school baseball team. The school is 55% boys and 45% girls, and we expect that boys and girls have different ideas on the fundraising. If a simple random sample is used to choose 200 students, we could possibly get 130 boys and 70 girls or 45 boys and 155 girls. Because of this, the amount of variability could be large. So to reduce the variability, you can sample 55% boys and 45% girls. This kind of "forced representative balance" will ensure that the percentage of boys and girls in the sample is identical to that in the population. This is a better method that the simple random sample because it should give a more accurate representation of the opinion of all the students in the school.

Now suppose we want to find out what the high school freshmen think about the food served in the cafeteria. We could use the simple random sample or stratified sampling but it's too time-consuming to find every student that was selected in the sample. But the freshmen homerooms are all in one of ten rooms on the ground floor of the school. So, we could sample two or three homerooms and sample every student in those homerooms. The population was divided into representative clusters and a few clusters were sampled in their entirety. This type of sampling is called cluster sampling.

What is the difference between stratified and cluster sampling? Clusters are heterogeneous and resemble the population in its entirety. Stratified sampling is done to make sure the sample represents different groups in the population, and samples are taken randomly within each strata. Clusters are chosen to make sampling more affordable or practical for the given situation.

An example which will more clearly display the differences in the two types of sampling is examining a pizza. Suppose you have a professional taster whose job is to check each pizza for quality. Samples need to be eaten from selected pizzas, with the crust, sauce, cheese and toppings tested.

You could taste a slice of pizza as a customer would eat a slice. When doing so, you'll learn about the pizza as a whole. The slice would be a cluster sample since it contains all the ingredients of the pizza.

If you select some tastes of the crust at random, of the cheese at random, of the sauce at random, and of the toppings at random, you will still get a pretty good judgment of the overall quality of the pizza. This kind of sampling would be stratified.

Cluster samples slice across the layers to obtain clusters, while stratified sampling represent the population by selecting some from each layer, which reduces the amount of variability.

This guide should help students better understand the differences between stratified sampling and cluster sampling.