Chapter 5: Sampling & The Central Limit Theorem

5.1 Why We Sample

In an ideal world, we would examine every single member of the population we are studying. This complete examination is called a census. In reality, a census is almost always impractical — it costs too much money, takes too much time, and in some cases is physically impossible. This is why we sample: we select a subset of the population and use it to draw conclusions about the whole.

Understanding the distinction between populations and samples is critical because it affects the formulas we use, the conclusions we can draw, and the uncertainty inherent in our results.

Key Terminology

Population: The entire group of individuals or observations you want to study. It could be all customers, all products, or all transactions.
Parameter: A numerical summary that describes the population (e.g., the true population mean μ). Parameters are typically unknown and are what we are trying to estimate.
Sample: A subset of the population that we actually collect data from.
Statistic: A numerical summary calculated from the sample (e.g., the sample mean x). Statistics are used to estimate the corresponding population parameters.

The quality of our estimate depends entirely on how well our sample represents the population. A biased or too-small sample leads to unreliable conclusions. The science of sampling is about making this representation as accurate and efficient as possible.

🏪 LakeFront Retail Co.

LakeFront Retail Co. processes approximately 50,000 transactions per month across all stores. Management wants to estimate the average transaction value, but reviewing every single transaction would be extremely time-consuming. Instead, they decide to draw a random sample of transactions and use the sample mean to estimate the population mean.

✓ Check Your Understanding

A sample statistic is used to estimate a:

A) Sample

B) Population parameter

C) Census

D) Z-score

5.2 Sampling Methods

Not all samples are created equal. The method used to select the sample determines whether the results can be generalized to the population. The gold standard is probability sampling, where every member of the population has a known, non-zero chance of being selected.

Simple Random Sampling

Every member of the population has an equal probability of being selected. This is the most basic probability sampling method and the foundation for all statistical theory. In practice, you assign each member a number and use a random number generator to pick your sample.

Systematic Sampling

Select every kth member from the population after a random starting point. For example, if you want a sample of 100 from a population of 5,000, you would select every 50th transaction starting from a randomly chosen position. Systematic sampling is faster than simple random sampling but can produce biased results if the population has a hidden pattern that aligns with the sampling interval.

Stratified Sampling

Divide the population into strata (subgroups) based on a shared characteristic, then take a random sample from each stratum. This ensures every important subgroup is represented. Stratified sampling typically produces more precise estimates than simple random sampling because it controls for known sources of variation.

Cluster Sampling

Divide the population into clusters (often geographic), randomly select some clusters, and then sample all (or some) members within the chosen clusters. Cluster sampling is practical when the population is geographically dispersed and a complete list of all members is unavailable.

Convenience Sampling

Select whoever is easiest to reach. While cheap and fast, convenience samples are not representative and should not be used for making generalizations about the population. Results from convenience samples are prone to significant bias.

🏪 LakeFront Retail Co. — Sampling in Practice

Simple Random: Use Excel's RANDBETWEEN to randomly pick 200 transaction IDs from the monthly database of 50,000.

Systematic: Pick every 250th transaction from the database (50,000 / 200 = 250).

Stratified: Divide transactions by store, then randomly sample proportionally from each store to ensure all 12 locations are represented.

Cluster: Randomly select 4 of the 12 stores, then sample all transactions from those 4 stores.

Convenience: Only analyze transactions from the store nearest to headquarters — easy, but potentially very misleading.

⚡ Sampling Method Scenario

LakeFront wants to estimate customer satisfaction across all stores, but they know that urban and rural stores may have different satisfaction levels. Which sampling method ensures both urban and rural stores are represented in the sample?

Simple Random Sampling

Systematic Sampling

Stratified Sampling

Convenience Sampling

🎮

Practice: Sampling Safari Match real-world scenarios to the best sampling method

→

5.3 Sampling Distributions & Standard Error

If you drew one sample of 50 transactions from LakeFront's database and calculated the mean, then drew another sample of 50 and calculated the mean again, would you get the exact same number? Almost certainly not. Sample means vary from sample to sample — this is a fundamental fact of statistics called sampling variability.

The sampling distribution is the probability distribution of a statistic (like the sample mean) over all possible samples of the same size. It tells us how much the statistic would fluctuate if we repeated the sampling process many times.

Standard Error

The standard error (SE) measures the typical amount by which a sample statistic differs from the true population parameter. For the sample mean, the standard error depends on two things: the population standard deviation (σ) and the sample size (n). Larger samples produce smaller standard errors, meaning more precise estimates.

Standard Error of the Mean

📊 Excel: =std_dev/SQRT(n)

where SE is the standard error, σ is the population standard deviation, and n is the sample size.

Standard Error of a Proportion

📊 Excel: =SQRT(p*(1-p)/n)

where p is the sample proportion and n is the sample size.

🏪 LakeFront Retail Co.

LakeFront's transaction values have a population mean of $67 and population standard deviation of $28. If we draw a sample of n = 50 transactions:

SE = $28 / (50)^1/2 = $28 / 7.071 = $3.96

This means sample means from repeated samples of size 50 would typically differ from the true mean by about $3.96. To cut the standard error in half, we would need to quadruple the sample size to n = 200.

Sampling Distribution Simulation

Adjust the sample size to see how the sampling distribution of the mean becomes narrower (more precise) as n increases.

Sample Size (n): 50

SE = $3.96

✓ Check Your Understanding

If the standard error for a sample of n = 25 is $4.00, what is the standard error for a sample of n = 100?

A) $2.00

B) $1.00

C) $8.00

D) $4.00

5.4 The Central Limit Theorem

The Central Limit Theorem (CLT) is arguably the most important theorem in all of statistics. It states that regardless of the shape of the original population distribution, the sampling distribution of the sample mean will be approximately normal as long as the sample size is sufficiently large.

This is a remarkable result. Even if the population is heavily skewed, bimodal, or uniform, the distribution of sample means will still look like a bell curve when the sample size is large enough. The general rule of thumb is that n ≥ 30 is sufficient for the CLT to apply, though more skewed populations may need larger samples.

Why the CLT Matters

The CLT is the foundation that allows us to use normal probability methods — Z-scores, confidence intervals, hypothesis tests — for sample means, even when the underlying population is not normal. Without the CLT, we would be limited to making inferences only when we could verify the population is normally distributed, which is rarely the case in practice.

Formal Statement

If X₁, X₂, …, X_n are independent random variables drawn from any population with mean μ and finite standard deviation σ, then for large n, the sampling distribution of the sample mean x is approximately normal:

Mean of sampling distribution = μ (the same as the population mean)
Standard deviation of sampling distribution (SE) = σ / n^1/2

CLT in Action: Skewed Population vs. Sampling Distribution

The left chart shows a right-skewed population. The right chart shows the sampling distribution of means for n = 30 — notice how it is approximately normal despite the skewed population.

✓ Check Your Understanding

The Central Limit Theorem is important because it:

A) Requires the population to be normal

B) Allows normal probability methods for sample means regardless of population shape

C) Only works for large populations

D) Applies only to symmetric distributions

🎮

Practice: CLT Simulator Draw samples from different populations and watch the CLT in action

→

💡 Key Takeaway

The Central Limit Theorem is the bridge between the messy reality of real-world data and the elegant mathematics of normal distributions. No matter how oddly shaped the population, the distribution of sample means converges to normal as sample size grows. This single theorem justifies nearly every inferential method you will encounter in the rest of your statistics education — from confidence intervals to hypothesis tests to regression analysis.

5.5 Chapter Summary

In this final chapter, we covered the essential concepts of sampling and discovered the Central Limit Theorem — the theoretical foundation for inferential statistics. Here is what you should take away:

💡 Chapter 5 Summary

Sampling: We sample because a census is usually impractical. A sample statistic estimates the corresponding population parameter, and the quality of that estimate depends on the sampling method and sample size.

Sampling Methods: Probability methods (simple random, systematic, stratified, cluster) allow generalization to the population. Convenience sampling does not.

Standard Error: SE measures how much sample statistics vary from sample to sample. SE decreases as sample size increases — specifically, SE = σ / n^1/2. Quadrupling the sample size cuts the standard error in half.

Central Limit Theorem: For sufficiently large samples (n ≥ 30), the sampling distribution of the mean is approximately normal regardless of the population shape. This is what enables us to use Z-scores and normal probability methods for inference.

📋 Chapter 5 — Formula Reference

Concept	Formula	Excel Function
Standard Error (Mean)		`=std_dev/SQRT(n)`
Standard Error (Proportion)		`=SQRT(p*(1-p)/n)`
CLT: Sampling Distribution Mean		`=AVERAGE(sample)`
CLT: Sampling Distribution SD		`=STDEV.S(sample)/SQRT(n)`
Sample Size Rule		`n ≥ 30 for CLT`

📄

Download the Chapter 5 Practice Dataset

Coming Soon — Excel file with sampling simulation exercises

🎓

Course Complete!

Congratulations! You have finished all five chapters of STATS100. You now have a solid foundation in descriptive statistics, probability, distributions, Z-scores, sampling, and the Central Limit Theorem. These concepts form the building blocks for every advanced statistical method.

Back to STATS100 Home

Sampling & The Central Limit Theorem

5.1 Why We Sample

Key Terminology

5.2 Sampling Methods

Simple Random Sampling

Systematic Sampling

Stratified Sampling

Cluster Sampling

Convenience Sampling

⚡ Sampling Method Scenario

5.3 Sampling Distributions & Standard Error

Standard Error

5.4 The Central Limit Theorem

Why the CLT Matters

Formal Statement

5.5 Chapter Summary

Course Complete!

Chapter Outline

Progress