Prashnam: The Story and the Science (Part 3)

Sampling

A key conundrum that needs to be addressed is the following: how can just asking a thousand people give an accurate indication of the thinking general population numbering tens or hundreds of millions. This is where science comes in. Let’s split the problem into two parts: who to sample and how many to sample.

Here is a short summary from National Science Foundation on how to sample scientifically:

When conducting a survey, how a researcher selects participants is just as important as how many participate. Scientific surveys can include every member of the group to be studied, but this approach is usually impractical and/or expensive. Instead, researchers often draw conclusions about a target group using information gathered from a small representative sample of that group. Representative samples must be selected carefully and without bias.

The term “random” has a different meaning in statistics than in ordinary language. In everyday terms, a random event is one that is unpredictable, lacks purpose and/or has no discernible pattern. In statistical terms, a random event is one that occurs with a certain, measurable chance or probability of happening. For example, under the simplest circumstances, where each member of a population has one chance of being sampled, the probability of getting selected for a survey can be calculated just by knowing a population size and desired sample size. One would have a 10 percent chance of being selected for a 100-person sample out of a total population of 1000. But, researchers use several methods for randomly selecting samples. These include stratified, cluster and systematic sampling. Stratified and cluster sampling require prior knowledge about the survey population but can produce more representative samples than simpler “blind” sampling methods. Researchers often use stratified sampling to capture the diversity of large populations with distinctive, homogeneous subgroups—such as the U.S. population.

In Prashnam, we use a process called stratified random sampling. More from Wikipedia:

In statistics, stratified sampling is a method of sampling from a population which can be partitioned into subpopulations.

Assume that we need to estimate the average number of votes for each candidate in an election. Assume that a country has 3 towns: Town A has 1 million factory workers, Town B has 2 million office workers and Town C has 3 million retirees. We can choose to get a random sample of size 60 over the entire population but there is some chance that the resulting random sample is poorly balanced across these towns and hence is biased, causing a significant error in estimation. Instead if we choose to take a random sample of 10, 20 and 30 from Town A, B and C respectively, then we can produce a smaller error in estimation for the same total sample size. This method is generally used when a population is not a homogeneous group.

The next question: how many to sample? The answer will surprise you.

Tomorrow: Part 4

Published by

Rajesh Jain

An Entrepreneur based in Mumbai, India.