A regression equation may look impressive on paper, but the real test of a model lies in its residuals — the differences between the observed values and the values the model predicts. Residuals are the voice of the data telling you what the model missed. Examining them carefully is the single most important diagnostic step after fitting any regression.
Every residual is defined as the observed value minus the predicted value:
=Actual - Predictedei is the residual for observation i, yi is the observed value, and \hat{y}i is the predicted value from the regression model.
Residual plots reveal whether the four key assumptions of linear regression hold:
Three plots form the core of residual analysis:
NorthStar's analytics team built a regression model predicting quarterly revenue from advertising spend, headcount, and market index. When they plotted residuals against fitted values, they noticed a clear fan shape — residuals spread wider as fitted values increased. This is the classic signature of heteroscedasticity: the model's prediction errors grow larger for higher-revenue quarters.
The team addressed this by applying a log transformation to revenue before fitting the model, which stabilized the variance and produced a random residual scatter.
Always plot your residuals before trusting a regression model. A good-looking R-squared can hide serious assumption violations. Fan-shaped residuals indicate heteroscedasticity, curved patterns indicate non-linearity, and clusters may indicate missing variables or interaction effects.
In practice, analysts rarely know in advance which predictors belong in a model. With many candidate variables available, the challenge is selecting a subset that balances predictive power against complexity. Including too few variables produces a model that misses important relationships; including too many creates noise, instability, and overfitting.
Start with no predictors. At each step, add the variable that most improves the model (typically the one with the lowest p-value or greatest reduction in AIC). Continue adding variables until no remaining candidate meets the entry criterion.
Start with all candidate predictors. At each step, remove the variable that contributes the least (typically the one with the highest p-value or whose removal most reduces AIC). Continue removing variables until every remaining predictor meets the retention criterion.
A hybrid of forward and backward approaches. At each step, the algorithm can both add and remove variables. A predictor that was useful early on may become redundant once other variables enter the model, so stepwise allows the model to correct itself as it builds.
Rather than relying on p-values, information criteria provide a single score that balances goodness-of-fit against model complexity. Lower values are better for both metrics.
=n*LN(SSE/n) + 2*kk is the number of estimated parameters (including the intercept) and L is the maximized likelihood. Lower AIC indicates a better balance of fit and parsimony.
=n*LN(SSE/n) + k*LN(n)ln(n) instead of 2), so it tends to select simpler models when the sample size is large.
NorthStar's data science team had 8 candidate predictors for their quarterly revenue model: advertising spend, headcount, market index, customer satisfaction score, web traffic, number of product lines, average deal size, and region. Using backward elimination with AIC, the team narrowed this down to a 4-predictor model (advertising spend, headcount, market index, and average deal size) that achieved the lowest AIC score of all models evaluated.
The remaining four variables — while individually correlated with revenue — added complexity without meaningfully improving the model once the core four were included.
There is no single correct method for model selection. Forward, backward, and stepwise approaches can yield different models from the same data. Always combine algorithmic selection with domain knowledge — a statistically selected variable that makes no business sense may be capturing noise, and an excluded variable that theory says matters may need a different functional form.
A model that performs brilliantly on the data used to build it but fails on new data is overfit. Overfitting occurs when a model learns the noise and peculiarities of the training sample rather than the true underlying pattern. The result is a model that appears highly accurate but cannot generalize.
The simplest defense against overfitting is to split your data into two parts before building the model. A common split is 70-80% for training and 20-30% for testing. The model is fit only on the training set, and then its performance is evaluated on the held-out test set. If performance drops substantially on the test data, overfitting is likely.
Cross-validation extends the train/test idea by repeating the split multiple times. In k-fold cross-validation, the data is divided into k equal subsets (folds). The model is trained on k − 1 folds and tested on the remaining fold, rotating through all k possibilities. The average test performance across all folds provides a more reliable estimate of how the model will perform on new data.
=RSQ(known_y, known_x)R² measures the proportion of variance explained by the model. Compare training R² to test R² to detect overfitting.
NorthStar initially built a model with all 8 candidate predictors. On the training data, this model achieved an impressive R² = 0.95. However, when evaluated on held-out test data, the R² plummeted to just 0.61. This dramatic drop is a textbook sign of overfitting — the 8-predictor model memorized patterns specific to the training sample.
The leaner 4-predictor model selected earlier achieved R² = 0.82 on training data and R² = 0.79 on test data. The small gap between training and test performance confirms this model generalizes well. Sometimes less is more.
A large gap between training and test R² is the hallmark of overfitting. Always evaluate model performance on data the model has never seen. Cross-validation provides a more robust estimate than a single train/test split because it averages performance across multiple holdout samples.
Model selection and diagnostics are the quality-control steps of regression analysis. A model is only as trustworthy as its residuals are well-behaved, and only as useful as its performance on new data.
Residual Analysis: Plot residuals against fitted values, use Q-Q plots for normality, and check for patterns that signal linearity, homoscedasticity, normality, or independence violations.
Model Selection: Forward selection, backward elimination, and stepwise procedures provide algorithmic approaches. AIC and BIC balance fit against complexity. Always combine statistical methods with domain expertise.
Overfitting: A model that fits training data too closely will fail on new data. Train/test splits and cross-validation detect overfitting by evaluating out-of-sample performance. A small gap between training and test metrics signals a generalizable model.