In an ideal world, we would examine every single member of the population we are studying. This complete examination is called a census. In reality, a census is almost always impractical — it costs too much money, takes too much time, and in some cases is physically impossible. This is why we sample: we select a subset of the population and use it to draw conclusions about the whole.
Understanding the distinction between populations and samples is critical because it affects the formulas we use, the conclusions we can draw, and the uncertainty inherent in our results.
The quality of our estimate depends entirely on how well our sample represents the population. A biased or too-small sample leads to unreliable conclusions. The science of sampling is about making this representation as accurate and efficient as possible.
LakeFront Retail Co. processes approximately 50,000 transactions per month across all stores. Management wants to estimate the average transaction value, but reviewing every single transaction would be extremely time-consuming. Instead, they decide to draw a random sample of transactions and use the sample mean to estimate the population mean.
Not all samples are created equal. The method used to select the sample determines whether the results can be generalized to the population. The gold standard is probability sampling, where every member of the population has a known, non-zero chance of being selected.
Every member of the population has an equal probability of being selected. This is the most basic probability sampling method and the foundation for all statistical theory. In practice, you assign each member a number and use a random number generator to pick your sample.
Select every kth member from the population after a random starting point. For example, if you want a sample of 100 from a population of 5,000, you would select every 50th transaction starting from a randomly chosen position. Systematic sampling is faster than simple random sampling but can produce biased results if the population has a hidden pattern that aligns with the sampling interval.
Divide the population into strata (subgroups) based on a shared characteristic, then take a random sample from each stratum. This ensures every important subgroup is represented. Stratified sampling typically produces more precise estimates than simple random sampling because it controls for known sources of variation.
Divide the population into clusters (often geographic), randomly select some clusters, and then sample all (or some) members within the chosen clusters. Cluster sampling is practical when the population is geographically dispersed and a complete list of all members is unavailable.
Select whoever is easiest to reach. While cheap and fast, convenience samples are not representative and should not be used for making generalizations about the population. Results from convenience samples are prone to significant bias.
Simple Random: Use Excel's RANDBETWEEN to randomly pick 200 transaction IDs from the monthly database of 50,000.
Systematic: Pick every 250th transaction from the database (50,000 / 200 = 250).
Stratified: Divide transactions by store, then randomly sample proportionally from each store to ensure all 12 locations are represented.
Cluster: Randomly select 4 of the 12 stores, then sample all transactions from those 4 stores.
Convenience: Only analyze transactions from the store nearest to headquarters — easy, but potentially very misleading.
If you drew one sample of 50 transactions from LakeFront's database and calculated the mean, then drew another sample of 50 and calculated the mean again, would you get the exact same number? Almost certainly not. Sample means vary from sample to sample — this is a fundamental fact of statistics called sampling variability.
The sampling distribution is the probability distribution of a statistic (like the sample mean) over all possible samples of the same size. It tells us how much the statistic would fluctuate if we repeated the sampling process many times.
The standard error (SE) measures the typical amount by which a sample statistic differs from the true population parameter. For the sample mean, the standard error depends on two things: the population standard deviation (σ) and the sample size (n). Larger samples produce smaller standard errors, meaning more precise estimates.
=std_dev/SQRT(n)SE is the standard error, σ is the population standard deviation, and n is the sample size.
=SQRT(p*(1-p)/n)p is the sample proportion and n is the sample size.
LakeFront's transaction values have a population mean of $67 and population standard deviation of $28. If we draw a sample of n = 50 transactions:
SE = $28 / (50)1/2 = $28 / 7.071 = $3.96
This means sample means from repeated samples of size 50 would typically differ from the true mean by about $3.96. To cut the standard error in half, we would need to quadruple the sample size to n = 200.
Adjust the sample size to see how the sampling distribution of the mean becomes narrower (more precise) as n increases.
The Central Limit Theorem (CLT) is arguably the most important theorem in all of statistics. It states that regardless of the shape of the original population distribution, the sampling distribution of the sample mean will be approximately normal as long as the sample size is sufficiently large.
This is a remarkable result. Even if the population is heavily skewed, bimodal, or uniform, the distribution of sample means will still look like a bell curve when the sample size is large enough. The general rule of thumb is that n ≥ 30 is sufficient for the CLT to apply, though more skewed populations may need larger samples.
The CLT is the foundation that allows us to use normal probability methods — Z-scores, confidence intervals, hypothesis tests — for sample means, even when the underlying population is not normal. Without the CLT, we would be limited to making inferences only when we could verify the population is normally distributed, which is rarely the case in practice.
If X1, X2, …, Xn are independent random variables drawn from any population with mean μ and finite standard deviation σ, then for large n, the sampling distribution of the sample mean is approximately normal:
The left chart shows a right-skewed population. The right chart shows the sampling distribution of means for n = 30 — notice how it is approximately normal despite the skewed population.
The Central Limit Theorem is the bridge between the messy reality of real-world data and the elegant mathematics of normal distributions. No matter how oddly shaped the population, the distribution of sample means converges to normal as sample size grows. This single theorem justifies nearly every inferential method you will encounter in the rest of your statistics education — from confidence intervals to hypothesis tests to regression analysis.
In this final chapter, we covered the essential concepts of sampling and discovered the Central Limit Theorem — the theoretical foundation for inferential statistics. Here is what you should take away:
Sampling: We sample because a census is usually impractical. A sample statistic estimates the corresponding population parameter, and the quality of that estimate depends on the sampling method and sample size.
Sampling Methods: Probability methods (simple random, systematic, stratified, cluster) allow generalization to the population. Convenience sampling does not.
Standard Error: SE measures how much sample statistics vary from sample to sample. SE decreases as sample size increases — specifically, SE = σ / n1/2. Quadrupling the sample size cuts the standard error in half.
Central Limit Theorem: For sufficiently large samples (n ≥ 30), the sampling distribution of the mean is approximately normal regardless of the population shape. This is what enables us to use Z-scores and normal probability methods for inference.
Congratulations! You have finished all five chapters of STATS100. You now have a solid foundation in descriptive statistics, probability, distributions, Z-scores, sampling, and the Central Limit Theorem. These concepts form the building blocks for every advanced statistical method.
Back to STATS100 Home