Chapter 5: Model Selection and Diagnostics

5.1 Residual Analysis

A regression equation may look impressive on paper, but the real test of a model lies in its residuals — the differences between the observed values and the values the model predicts. Residuals are the voice of the data telling you what the model missed. Examining them carefully is the single most important diagnostic step after fitting any regression.

Every residual is defined as the observed value minus the predicted value:

Residual

📊 Excel: =Actual - Predicted

where e_i is the residual for observation i, y_i is the observed value, and \hat{y}_i is the predicted value from the regression model.

Residual plots reveal whether the four key assumptions of linear regression hold:

The Four Key Assumptions

Linearity: The relationship between predictors and the response is linear. In a residual-vs-fitted plot, residuals should scatter randomly around zero with no curved pattern.
Homoscedasticity: The variance of residuals is constant across all levels of the fitted values. The vertical spread of residuals should remain roughly the same from left to right.
Normality: Residuals are approximately normally distributed. A Q-Q plot should show points falling close to the diagonal reference line.
Independence: Residuals are independent of one another. This is especially important with time-series data, where consecutive residuals may be correlated.

Diagnostic Plots

Three plots form the core of residual analysis:

Residuals vs. Fitted Values: Checks linearity and homoscedasticity. Random scatter around zero is ideal; patterns indicate problems.
Q-Q Plot (Normal Probability Plot): Checks normality. Points should hug the diagonal line; systematic departure signals non-normality.
Scale-Location Plot: Plots the square root of standardized residuals against fitted values. A horizontal band confirms constant variance; an upward or downward trend indicates heteroscedasticity.

Residuals vs. Fitted Values — NorthStar Enterprises Sales Model

🏪 NorthStar Enterprises

NorthStar's analytics team built a regression model predicting quarterly revenue from advertising spend, headcount, and market index. When they plotted residuals against fitted values, they noticed a clear fan shape — residuals spread wider as fitted values increased. This is the classic signature of heteroscedasticity: the model's prediction errors grow larger for higher-revenue quarters.

The team addressed this by applying a log transformation to revenue before fitting the model, which stabilized the variance and produced a random residual scatter.

💡 Key Takeaway

Always plot your residuals before trusting a regression model. A good-looking R-squared can hide serious assumption violations. Fan-shaped residuals indicate heteroscedasticity, curved patterns indicate non-linearity, and clusters may indicate missing variables or interaction effects.

5.2 Model Selection Approaches

In practice, analysts rarely know in advance which predictors belong in a model. With many candidate variables available, the challenge is selecting a subset that balances predictive power against complexity. Including too few variables produces a model that misses important relationships; including too many creates noise, instability, and overfitting.

Forward Selection

Start with no predictors. At each step, add the variable that most improves the model (typically the one with the lowest p-value or greatest reduction in AIC). Continue adding variables until no remaining candidate meets the entry criterion.

Backward Elimination

Start with all candidate predictors. At each step, remove the variable that contributes the least (typically the one with the highest p-value or whose removal most reduces AIC). Continue removing variables until every remaining predictor meets the retention criterion.

Stepwise Selection

A hybrid of forward and backward approaches. At each step, the algorithm can both add and remove variables. A predictor that was useful early on may become redundant once other variables enter the model, so stepwise allows the model to correct itself as it builds.

Information Criteria: AIC and BIC

Rather than relying on p-values, information criteria provide a single score that balances goodness-of-fit against model complexity. Lower values are better for both metrics.

Akaike Information Criterion (AIC)

📊 Excel: =n*LN(SSE/n) + 2*k

where k is the number of estimated parameters (including the intercept) and L is the maximized likelihood. Lower AIC indicates a better balance of fit and parsimony.

Bayesian Information Criterion (BIC)

📊 Excel: =n*LN(SSE/n) + k*LN(n)

BIC penalizes complexity more heavily than AIC (using ln(n) instead of 2), so it tends to select simpler models when the sample size is large.

🏪 NorthStar Enterprises

NorthStar's data science team had 8 candidate predictors for their quarterly revenue model: advertising spend, headcount, market index, customer satisfaction score, web traffic, number of product lines, average deal size, and region. Using backward elimination with AIC, the team narrowed this down to a 4-predictor model (advertising spend, headcount, market index, and average deal size) that achieved the lowest AIC score of all models evaluated.

The remaining four variables — while individually correlated with revenue — added complexity without meaningfully improving the model once the core four were included.

💡 Key Takeaway

There is no single correct method for model selection. Forward, backward, and stepwise approaches can yield different models from the same data. Always combine algorithmic selection with domain knowledge — a statistically selected variable that makes no business sense may be capturing noise, and an excluded variable that theory says matters may need a different functional form.

5.3 Overfitting and Cross-Validation

A model that performs brilliantly on the data used to build it but fails on new data is overfit. Overfitting occurs when a model learns the noise and peculiarities of the training sample rather than the true underlying pattern. The result is a model that appears highly accurate but cannot generalize.

Train/Test Split

The simplest defense against overfitting is to split your data into two parts before building the model. A common split is 70-80% for training and 20-30% for testing. The model is fit only on the training set, and then its performance is evaluated on the held-out test set. If performance drops substantially on the test data, overfitting is likely.

Cross-Validation

Cross-validation extends the train/test idea by repeating the split multiple times. In k-fold cross-validation, the data is divided into k equal subsets (folds). The model is trained on k − 1 folds and tested on the remaining fold, rotating through all k possibilities. The average test performance across all folds provides a more reliable estimate of how the model will perform on new data.

R-squared

📊 Excel: =RSQ(known_y, known_x)

R² measures the proportion of variance explained by the model. Compare training R² to test R² to detect overfitting.

🏪 NorthStar Enterprises

NorthStar initially built a model with all 8 candidate predictors. On the training data, this model achieved an impressive R² = 0.95. However, when evaluated on held-out test data, the R² plummeted to just 0.61. This dramatic drop is a textbook sign of overfitting — the 8-predictor model memorized patterns specific to the training sample.

The leaner 4-predictor model selected earlier achieved R² = 0.82 on training data and R² = 0.79 on test data. The small gap between training and test performance confirms this model generalizes well. Sometimes less is more.

✓ Check Your Understanding

A model achieves R² = 0.98 on training data but R² = 0.52 on test data. This model is most likely:

Excellent — high training performance is what matters

Underfit — the model is too simple

Overfit — it memorized the training data

Correctly specified — R² of 0.52 is acceptable

💡 Key Takeaway

A large gap between training and test R² is the hallmark of overfitting. Always evaluate model performance on data the model has never seen. Cross-validation provides a more robust estimate than a single train/test split because it averages performance across multiple holdout samples.

5.4 Chapter Summary

Model selection and diagnostics are the quality-control steps of regression analysis. A model is only as trustworthy as its residuals are well-behaved, and only as useful as its performance on new data.

💡 Chapter 5 Summary

Residual Analysis: Plot residuals against fitted values, use Q-Q plots for normality, and check for patterns that signal linearity, homoscedasticity, normality, or independence violations.

Model Selection: Forward selection, backward elimination, and stepwise procedures provide algorithmic approaches. AIC and BIC balance fit against complexity. Always combine statistical methods with domain expertise.

Overfitting: A model that fits training data too closely will fail on new data. Train/test splits and cross-validation detect overfitting by evaluating out-of-sample performance. A small gap between training and test metrics signals a generalizable model.

📋 Chapter 5 — Formula Reference

Measure	Formula	Excel Function
Residual		`=Actual - Predicted`
AIC		`=nLN(SSE/n) + 2k`
BIC		`=nLN(SSE/n) + kLN(n)`
R-squared		`=RSQ(known_y, known_x)`

Up Next

Chapter 6: Logistic Regression Introduction

→

Model Selection and Diagnostics

5.1 Residual Analysis

The Four Key Assumptions

Diagnostic Plots

5.2 Model Selection Approaches

Forward Selection

Backward Elimination

Stepwise Selection

Information Criteria: AIC and BIC

5.3 Overfitting and Cross-Validation

Train/Test Split

Cross-Validation

5.4 Chapter Summary

Chapter Outline

Progress