Chapter 1: Descriptive Statistics

1. Introduction to Descriptive Statistics

Every business decision begins with understanding the data at hand. Before building predictive models or running hypothesis tests, analysts must first describe what the data looks like. This is the domain of descriptive statistics — a collection of methods for summarizing, organizing, and visualizing raw data so that meaningful patterns emerge.

Descriptive statistics answer fundamental questions: What is a typical value in our dataset? How much do the observations spread out? Are there unusually high or low values that deserve attention? These questions may seem simple, but the answers form the foundation of every advanced analysis that follows.

In a business context, descriptive statistics are the first tool managers reach for. A regional sales director looking at quarterly revenue figures across 50 territories does not start with regression analysis — they start by computing averages, sorting from high to low, and plotting bar charts. Descriptive statistics turn a sprawling spreadsheet into a concise story.

Why Descriptive Statistics Matter

Consider the difference between receiving a list of 10,000 daily transactions versus receiving a single summary: average transaction value is $47, with a standard deviation of $12, and a median of $44. The summary instantly communicates what would take hours to absorb from the raw data. Descriptive statistics compress information without losing the essential shape of the data.

Throughout this chapter, we will follow a single running example — LakeFront Retail Co. — to see how each concept applies to a real business scenario.

🏪 LakeFront Retail Co.

LakeFront Retail Co. is a mid-size retail chain operating 12 stores around the Great Lakes region. Management wants to understand weekly revenue patterns across all stores to decide where to invest in expansion, which locations need support, and how consistent performance is across the chain.

The dataset below shows weekly revenue (in thousands of dollars) for each of the 12 stores. This data will serve as our working example throughout the entire chapter.

LakeFront Retail Co. — Weekly Revenue by Store ($000s)

Store ↕	Revenue ($K) ↕	Region ↕

2. Measures of Central Tendency

The first question we typically ask about a dataset is: what is a typical value? Measures of central tendency give us a single number that represents the “center” or “middle” of the data. The three most common measures are the mean, median, and mode.

The Mean (Arithmetic Average)

The mean is the most widely used measure of central tendency. It is computed by adding all values in the dataset and dividing by the number of observations. The mean takes every data point into account, which makes it a comprehensive summary — but also makes it sensitive to extreme values.

Arithmetic Mean

📊 Excel: =AVERAGE(range)

where x is the sample mean, x_i is each individual observation, and n is the total number of observations.

The Median

The median is the middle value when data are arranged in order from smallest to largest. If the dataset has an odd number of observations, the median is the single middle value. If the dataset has an even number, the median is the average of the two middle values. Because the median depends only on position — not magnitude — it is resistant to outliers.

Median (Even n)

📊 Excel: =MEDIAN(range)

For even n, sort the data and average the values at positions n/2 and (n/2)+1.

The Mode

The mode is the value that occurs most frequently in a dataset. In continuous data (like revenue figures), it is common for no value to repeat, in which case we say the data has no mode. The mode is most useful for categorical or discrete data, such as the most popular product category sold across stores.

When to Use Each Measure

Mean: Best for symmetric data without extreme outliers. Most useful when every value matters equally.
Median: Best when data is skewed or contains outliers. Common in income, housing prices, and revenue data.
Mode: Best for categorical data or when identifying the most common outcome.

✎ Worked Example: Calculating the Mean

List all 12 store revenues:
48, 52, 41, 63, 55, 71, 89, 44, 38, 61, 47, 57

Sum all values:
48 + 52 + 41 + 63 + 55 + 71 + 89 + 44 + 38 + 61 + 47 + 57 = 666

Divide by the number of stores (n = 12):
x = 666 / 12 = 55.5

Result: The mean weekly revenue across LakeFront's 12 stores is $55,500.

✎ Worked Example: Calculating the Median

Sort the 12 revenues in ascending order:
38, 41, 44, 47, 48, 52, 55, 57, 61, 63, 71, 89

Since n = 12 (even), find the two middle positions: n/2 = 6th and (n/2)+1 = 7th values.
6th value = 52 7th value = 55

Average the two middle values:
Median = (52 + 55) / 2 = 53.5

Result: The median weekly revenue is $53,500.

🏪 LakeFront Retail Co.

LakeFront's mean weekly revenue is $55.5K, but the median is $53.5K. The mean is pulled higher by the Chicago store, which generates $89K per week — significantly more than any other location. This $2K gap between mean and median tells management that the revenue distribution is right-skewed: a few high-performing stores push the average above the typical store's revenue.

For planning purposes, the median ($53.5K) may be a better representation of what a “typical” LakeFront store earns each week.

✓ Check Your Understanding

If LakeFront opens a 13th store with $200K weekly revenue, which measure of central tendency changes more dramatically?

The mean — it will jump significantly

The median — it will jump significantly

Both change by the same amount

Neither changes at all

💡 Key Takeaway

The mean uses every data point in its calculation, making it sensitive to outliers. The median depends only on the middle position, making it robust. When data is skewed, the median often provides a more representative measure of the “typical” value.

🎮

Practice: Mean vs Median Game Test your intuition for when mean and median diverge

→

3. Measures of Spread

Knowing the center of a dataset is only half the picture. Two datasets can share the same mean yet look entirely different. Imagine two retail chains that both average $55K in weekly revenue per store — one might have stores clustered between $50K and $60K, while the other has stores ranging from $20K to $90K. The spread (or variability) of data tells us how much individual values differ from one another and from the center.

The most common measures of spread are the range, variance, standard deviation, and interquartile range (IQR).

Range

The range is the simplest measure of spread: the difference between the maximum and minimum values. It is easy to compute but only considers two data points, making it highly sensitive to outliers.

Range

📊 Excel: =MAX(range)-MIN(range)

The range captures the total span of the data.

✎ Worked Example: Calculating the Range

Identify the maximum and minimum revenues:
Max = 89 (Chicago) Min = 38 (Toledo)

Calculate the range:
Range = 89 − 38 = 51

Result: LakeFront's weekly revenues span a range of $51,000 — a wide gap between the lowest and highest performing stores.

Variance and Standard Deviation

While the range only uses two values, the variance considers how far every data point deviates from the mean. It is computed by finding each observation's deviation from the mean, squaring those deviations (to eliminate negative signs), summing them, and dividing by n − 1 (for a sample). The standard deviation is the square root of the variance, bringing the measure back to the original units.

We divide by n − 1 rather than n because we are working with a sample, not the entire population. This correction (called Bessel's correction) produces an unbiased estimate of the population variance.

Sample Variance

📊 Excel: =VAR.S(range)

where s² is the sample variance, x_i is each observation, x is the sample mean, and n is the sample size.

Sample Standard Deviation

📊 Excel: =STDEV.S(range)

The standard deviation s is the square root of the variance, expressed in the same units as the data.

✎ Worked Example: Variance and Standard Deviation

Recall the mean: x = 55.5

Calculate each deviation (x_i − x) and its square:

Store	x_i	x_i − x	(x_i − x)²
Duluth	48	−7.5	56.25
Thunder Bay	52	−3.5	12.25
Marquette	41	−14.5	210.25
Traverse City	63	7.5	56.25
Green Bay	55	−0.5	0.25
Milwaukee	71	15.5	240.25
Chicago	89	33.5	1122.25
Gary	44	−11.5	132.25
Toledo	38	−17.5	306.25
Cleveland	61	5.5	30.25
Erie	47	−8.5	72.25
Buffalo	57	1.5	2.25
Sum of squared deviations:			2,241.00

Compute the sample variance:
s² = 2241.00 / (12 − 1) = 2241.00 / 11 = 203.73

Take the square root for standard deviation:
s = √203.73 ≈ 14.27

Result: The sample standard deviation is approximately $14,270. This means store revenues typically deviate about $14.3K from the mean.

Store Revenues with Mean Line

🏪 LakeFront Retail Co.

A standard deviation of approximately $14.3K tells LakeFront's management that most stores generate weekly revenue within roughly one standard deviation of the mean — between about $41K and $70K. Stores outside this range (Chicago at $89K, Toledo at $38K) deserve special attention: Chicago as a model for success, and Toledo as a candidate for operational review.

✓ Check Your Understanding

If every LakeFront store's weekly revenue increased by exactly $10K (e.g., due to a company-wide promotion), what happens to the standard deviation?

It increases by $10K

It doubles

It stays exactly the same

It decreases by $10K

💡 Key Takeaway

Standard deviation measures how much individual data points typically differ from the mean. Adding a constant to every value shifts the center but does not change the spread — the distances between points remain the same. Standard deviation is only affected by operations that change the relative positions of data points, such as multiplying by a constant.

🎮

Practice: Variance Strategy Game Build intuition for how variance responds to data changes

→

4. Visualizing Data

Numbers alone rarely tell the full story. A well-designed visualization can reveal patterns, outliers, and distributions that summary statistics might obscure. In business settings, charts and graphs are often the primary way analysis gets communicated to stakeholders who may not be comfortable interpreting raw numbers.

Choosing the Right Chart

The type of chart you choose depends on what you want to communicate:

Bar charts compare values across categories (e.g., revenue by store). They are ideal when each bar represents a distinct entity.
Histograms show the distribution of a continuous variable by grouping values into bins. They reveal the shape, center, and spread of the data.
Box plots display the five-number summary (minimum, Q1, median, Q3, maximum) and highlight outliers. They are excellent for comparing distributions across groups.
Line charts show trends over time. They are the go-to choice for time-series data like monthly revenue or daily stock prices.
Scatter plots reveal relationships between two continuous variables, such as store square footage versus revenue.

For LakeFront's data, a bar chart sorted by revenue is the most immediate way to compare store performance. Let us look at one now.

Store Revenues — Sorted (Highest to Lowest)

🏪 LakeFront Retail Co.

LakeFront's regional VP uses this sorted bar chart to quickly identify the top and bottom performers. The visualization instantly reveals that Chicago is a clear outlier at the top, with revenue more than double that of Toledo. The chart also shows a natural grouping: three stores above $60K (Chicago, Milwaukee, Traverse City), six stores between $44K and $57K forming the core, and three stores below $48K that may need attention.

✓ Check Your Understanding

Which chart type would best show the distribution shape of LakeFront's 12 store revenues (i.e., how many stores fall in each revenue range)?

Pie chart

Histogram

Line chart

Scatter plot

🎮

Practice: Which Chart? Game Match datasets to the best visualization type

→

5. Putting It Together

Individual summary statistics are useful, but they become powerful when combined into a complete descriptive profile. Reporting only the mean without a measure of spread can be misleading — it hides how much variation exists in the data. Similarly, reporting only the range tells you about the extremes but nothing about the center.

A best-practice descriptive summary includes:

A measure of center (mean and/or median)
A measure of spread (standard deviation and/or IQR)
The range (min and max)
The sample size (n)
A visualization appropriate for the data type

Let us compile LakeFront's complete summary dashboard.

🏪 LakeFront Retail Co. — Complete Summary Dashboard

Mean

$55.5K

Median

$53.5K

Std Dev

$14.3K

Range

$51K

This dashboard tells a complete story: LakeFront's typical store earns mid-$50Ks weekly, but there is meaningful variation ($14.3K std dev) across the chain. The gap between mean and median suggests right skew driven by top performers like Chicago. Management should investigate both the high performers (to replicate success) and the lowest performers (to identify improvement opportunities).

💡 Key Takeaway

Always report both a measure of center and a measure of spread. The mean without context is incomplete — you need standard deviation or IQR to understand how representative that mean actually is. A complete descriptive profile, paired with a well-chosen visualization, gives stakeholders the information they need to make sound decisions.

6. Chapter Summary

In this chapter, we covered the foundational tools of descriptive statistics — the methods that transform raw data into meaningful summaries. Here is what you should take away:

💡 Chapter 1 Summary

Central Tendency: The mean, median, and mode each capture a different sense of the “center.” The mean is comprehensive but sensitive to outliers; the median is robust to skewed data; the mode is best for categorical variables.

Spread: Range, variance, and standard deviation quantify how much data values differ from one another. Standard deviation is the most commonly reported measure because it is in the same units as the data.

Visualization: Charts translate numbers into visual patterns. Choose the chart type based on what you want to communicate — comparisons (bar chart), distributions (histogram), or relationships (scatter plot).

The Big Picture: Descriptive statistics are not an end in themselves — they are the essential first step that guides every subsequent analysis. A thorough descriptive profile combines center, spread, and visualization to tell a complete, honest story about your data.

📋 Chapter 1 — Formula Reference

Measure	Formula	Excel Function
Mean		`=AVERAGE(range)`
Median		`=MEDIAN(range)`
Mode		`=MODE.SNGL(range)`
Range		`=MAX(range)-MIN(range)`
Variance		`=VAR.S(range)`
Std Deviation		`=STDEV.S(range)`
Minimum		`=MIN(range)`
Maximum		`=MAX(range)`
Count		`=COUNT(range)`

📄

Download the LakeFront Dataset

Excel file with all 12 store revenues — practice the calculations yourself

Up Next

Chapter 2: Probability Foundations

→

Descriptive Statistics

1. Introduction to Descriptive Statistics

Why Descriptive Statistics Matter

2. Measures of Central Tendency

The Mean (Arithmetic Average)

The Median

The Mode

When to Use Each Measure

3. Measures of Spread

Range

Variance and Standard Deviation

4. Visualizing Data

Choosing the Right Chart

5. Putting It Together

6. Chapter Summary

Chapter Outline

Progress