Chapter 6

Simple Linear Regression

📖 ~55 min read 📈 2 interactive charts ✍️ 2 practice questions

6.1 The Regression Model

While correlation tells us whether two variables are linearly related, regression tells us how they are related and lets us make predictions. Simple linear regression fits a straight line through the data that best describes the relationship between a predictor variable (x) and a response variable (y).

The regression equation takes the form:

Simple Linear Regression Equation
📊 Excel: =FORECAST(x, y_range, x_range)
where ŷ is the predicted value of y, b0 is the y-intercept, and b1 is the slope.
Slope (b1)
📊 Excel: =SLOPE(y_range, x_range)
The slope measures the change in y for each one-unit increase in x.
Intercept (b0)
📊 Excel: =INTERCEPT(y_range, x_range)
where y is the mean of y and x is the mean of x.
🏪 GreatLakes Manufacturing

Continuing from Chapter 5, GreatLakes Manufacturing wants to predict annual maintenance cost from machine age. Using the same n = 20 data points, the regression analysis yields: ŷ = 1200 + 340x, where x is machine age in years and ŷ is predicted annual maintenance cost in dollars.

Machine Age vs. Maintenance Cost with Regression Line
✓ Check Your Understanding
In the equation ŷ = 1200 + 340x, the slope of 340 means:
A) The cost starts at $340 when the machine is new
B) Each additional year of machine age adds $340 to the predicted maintenance cost
C) There is a $340 fixed cost regardless of age
D) Machine age explains 34% of the variation in cost

6.2 Evaluating the Regression

Fitting a line is easy — evaluating whether it fits well is the harder and more important task. The key metric for assessing a regression model is R-squared (R²), also called the coefficient of determination.

R-Squared

R² measures the proportion of variation in the response variable (y) that is explained by the predictor variable (x). It ranges from 0 to 1:

  • R² = 0: The model explains none of the variation — the regression line is no better than using the mean of y.
  • R² = 1: The model explains all variation — every data point falls exactly on the regression line.
Coefficient of Determination
📊 Excel: =RSQ(y_range, x_range)
where SSR is the regression sum of squares, SST is the total sum of squares, and SSE is the error sum of squares.
✎ Worked Example: Interpreting R²
1
From the GreatLakes regression output: R² = 0.533.
2
Interpretation: Machine age explains 53.3% of the variation in annual maintenance costs across the 20 machines. The remaining 46.7% is due to other factors not in the model (e.g., machine type, usage intensity, maintenance history).
3
Note: For simple linear regression, R² = r². Since R² = 0.533, the correlation r = (0.533)^{1/2} ≈ 0.73.
✓ Check Your Understanding
An R² of 0.72 means:
A) The slope of the regression line is 0.72
B) 72% of the variation in y is explained by x
C) The correlation between x and y is 0.72
D) The model is 72% accurate

6.3 Predictions and Residuals

Making Predictions

Once we have a regression equation, we can predict y for any given value of x by substituting into the equation. However, predictions are only reliable within the range of the observed x-values. Predicting beyond the data range is called extrapolation and is dangerous because the linear relationship may not hold outside the observed range.

✎ Worked Example: Predicting Maintenance Cost
1
Predict the annual maintenance cost for a 7-year-old machine using ŷ = 1200 + 340x.
2
ŷ = 1200 + 340(7) = 1200 + 2380 = $3,580
3
Result: The predicted annual maintenance cost for a 7-year-old machine is $3,580. Since 7 years falls within the observed range (1–10 years), this prediction is reasonable.

Residuals

A residual is the difference between the observed value and the predicted value: e = y − ŷ. Residuals tell us how far off our predictions are for each observation. A well-fitting model has residuals that are small, randomly scattered, and show no pattern.

Analyzing residuals helps us detect problems with the model. If residuals show a curved pattern, the relationship may not be linear. If residuals fan out (increase in magnitude), the variance may not be constant.

Residual Plot: Predicted Values vs. Residuals
💡 Key Takeaway

Never extrapolate beyond the range of observed data. A regression model describes the relationship within the data; outside that range, the relationship may change dramatically. Always check residual plots to verify model assumptions before relying on predictions.

6.4 Chapter Summary

This chapter covered the fundamentals of simple linear regression: fitting a line, evaluating its quality, and using it for predictions.

💡 Chapter 6 Summary

The Model: ŷ = b0 + b1x fits a straight line through paired data. The slope b1 tells us how much y changes per unit increase in x.

R-Squared: Measures the proportion of variation in y explained by x. Higher R² means a better fit, but does not guarantee the model is correct.

Predictions: Substitute x into the equation to predict y. Only predict within the range of observed data.

Residuals: Check residual plots for patterns. Random scatter indicates a good model; patterns indicate violations of assumptions.

📋 Chapter 6 — Formula Reference
Measure Formula Excel Function
Regression Equation
=FORECAST(x, y, x_range)
Slope
=SLOPE(y_range, x_range)
Intercept
=INTERCEPT(y_range, x_range)
R-Squared
=RSQ(y_range, x_range)
Residual
=y - FORECAST(x, y, x_range)
Up Next
Chapter 7: Chi-Square Tests