While correlation tells us whether two variables are linearly related, regression tells us how they are related and lets us make predictions. Simple linear regression fits a straight line through the data that best describes the relationship between a predictor variable (x) and a response variable (y).
The regression equation takes the form:
=FORECAST(x, y_range, x_range)ŷ is the predicted value of y, b0 is the y-intercept, and b1 is the slope.
=SLOPE(y_range, x_range)=INTERCEPT(y_range, x_range)y is the mean of y and is the mean of x.
Continuing from Chapter 5, GreatLakes Manufacturing wants to predict annual maintenance cost from machine age. Using the same n = 20 data points, the regression analysis yields: ŷ = 1200 + 340x, where x is machine age in years and ŷ is predicted annual maintenance cost in dollars.
Fitting a line is easy — evaluating whether it fits well is the harder and more important task. The key metric for assessing a regression model is R-squared (R²), also called the coefficient of determination.
R² measures the proportion of variation in the response variable (y) that is explained by the predictor variable (x). It ranges from 0 to 1:
=RSQ(y_range, x_range)SSR is the regression sum of squares, SST is the total sum of squares, and SSE is the error sum of squares.
Once we have a regression equation, we can predict y for any given value of x by substituting into the equation. However, predictions are only reliable within the range of the observed x-values. Predicting beyond the data range is called extrapolation and is dangerous because the linear relationship may not hold outside the observed range.
A residual is the difference between the observed value and the predicted value: e = y − ŷ. Residuals tell us how far off our predictions are for each observation. A well-fitting model has residuals that are small, randomly scattered, and show no pattern.
Analyzing residuals helps us detect problems with the model. If residuals show a curved pattern, the relationship may not be linear. If residuals fan out (increase in magnitude), the variance may not be constant.
Never extrapolate beyond the range of observed data. A regression model describes the relationship within the data; outside that range, the relationship may change dramatically. Always check residual plots to verify model assumptions before relying on predictions.
This chapter covered the fundamentals of simple linear regression: fitting a line, evaluating its quality, and using it for predictions.
The Model: ŷ = b0 + b1x fits a straight line through paired data. The slope b1 tells us how much y changes per unit increase in x.
R-Squared: Measures the proportion of variation in y explained by x. Higher R² means a better fit, but does not guarantee the model is correct.
Predictions: Substitute x into the equation to predict y. Only predict within the range of observed data.
Residuals: Check residual plots for patterns. Random scatter indicates a good model; patterns indicate violations of assumptions.