Assessing the Linear Modelling Assumption of Normality



Julia Piaskowski

February 24, 2026



“My data is not normal. I did a Shapiro-Wilk test on the data and it failed that test. What transformation should I use?”

Linear Model Review

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

\(Y_i\) = dependent variable (there is only 1)

\(X_i\) = independent variable(s) (there may be many)

\(B_0\) = model intercept, the overall mean of \(Y\)

\(B_1\) = how \(Y\) changes with \(X\)

\(\epsilon_i\) = model residual, the gap between the predicted value for \(Y_i\) and its observed value

The Residual Is Important

\[ \epsilon_i \sim N(0, \sigma)\]

  • The residuals are normally distributed with a mean of zero and some standard deviation that we will estimate during the model-fitting process.

  • All model residuals share this same distribution and distributional properties! That is, they are identically distributed.

  • The residuals are independent of each other (i.e. they are uncorrelated with each other)

  • The residuals should have no correlation with other model parameters (e.g. treatment effects)

The Distribution of \(Y\)

Technically in a linear model, \(\bar{y_{i.}}\), the mean value for a population, follows a normal distribution for a set of known parameters:

\[ \bar{y_{i.}} \sim 𝑁(x_i \beta, \sigma^2/r_i) \]

We often cannot observe this distribution well with our limited number of reps. But we if repeated a study an infinite number of times, a normal distribution for \(\bar{y_{i.}}\) would likely be observed.

Example Analyses 1

Example Analyses 2

Example Analyses 3

Early Lessons

  • Often, the residuals are normally distributed, and hence model expectations are met!

  • Linear models are considered robust to modest departures from normality; that is, the estimates will be fine and the results from ANOVA will be correct.

  • The Shapiro-Wilk test is not great for asessing normality; it’s very sensitive to the size of the data set; large data sets may fail just based on sampling variation, small data sets pass because there are too few observations to adequately test.

Sampling Variation & Normality

The function rnorm() can be used to generate random values that follow a normal distribution, that we can test for normality using shapiro.test().


    Shapiro-Wilk normality test

data:  rnorm(60)
W = 0.98675, p-value = 0.7603

Sampling Variation & Normality

The Central Limit Theorum


In probability theory, the central limit theorem states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution.
–Wikipedia

What Data Are Truly Not Normal?

  • Count data (unless the values are very high, e.g. hundreds)
  • Data that lack variability; most data points are a single value (emergence and crop stand data)
  • Binomial data (yes/no)
  • Data with categorical outcomes (rating scales)
  • Data that have many zeros
  • Extremely skewed data

Count data

  • Things you count: insects, calves born, trips to the grocery store (requires a generalized linear model)

No Data Variability

Binomial data

  • These are a set number of evaluations where only two possible outcomes are possible for each
  • The results might be expressed this as proportion that one outcome resulted: number of plants that survived after a cold shock, number of insects that choose a volatile compound.
  • Requires a generalized linear model (logistic regression)

Binomial data

Alternatives for Non-normal data

Categorical

  • Likert scales (strong disagree to agree)
  • disease rating (very resistant to very tolerant)
  • special models: multinomial, ordinal (very complicated)

Many zero’s

  • common in count data
  • common in relative abundance data from metagenomics
  • zero-inflated models (complicated)

Skewed Distributions

  • extreme tails on the left or more commonly, the right side of histograms
  • no single solution, many other distributions that support skewed distributions: beta (for percentage), lognormal, exponential, tweedie,… (complicated)

Skewed Distributions

]

Alternatives for Non-normal Data

Non-parametric hypothesis tests (No distributional assumptions)

  • Mann-Whitney \(U\) test/Wilcoxon rank-sum test for 2 samples: wilcox.test()
  • Kruskall-Wallis for 3+ samples: kruskal.test()
  • Wilcoxon signed-rank test for paired samples: wilcox.test(..., paired = TRUE)

Drawbacks

  • only conducting a hypothesis test, no estimation

Alternatives for Non-normal Data

Permutations

  1. Calculate the test statistic from your data (e.g. difference in treatment means).

  2. Simulate data in which the test statistic is null (e.g. the difference in treatment means is zero).

  3. Find out how often your test statistic occurs in the simulated data; the probability of this occurring is your p-value.

Permutations


Andrew Heiss’ Null Worlds

Permutation Example

Alfalfa data set

  • split plot: main plot is date of harvest (4 levels), split plot is variety (3 levels)
data("Alfalfa", package = "nlme")

alf_mod <- lme(Yield ~ Variety + Date,
               random = ~1|Block/Date,
               data = Alfalfa)

Alfalfa Permuatation Example

Permutations

sim_func <- function(){
  N = nrow(Alfalfa)
  Alfalfa$yield2 <- sample(Alfalfa$Yield, size = N, replace = FALSE)
  try(perm_mod <- lme(yield2 ~ Variety + Date,
               random = ~1|Block/Date,
               data = Alfalfa))
  if(exists("perm_mod")) return(perm_mod)
}

perms <- replicate(500, sim_func(), simplify = FALSE)

Permutated ANOVA

perm_anova <- lapply(perms, anova, type = "marginal") |>  # Conduct anova
  dplyr::bind_rows() |> # some data formatting
  tibble::rownames_to_column(var = "variable") |> 
  mutate(variable = stringr::str_extract(variable, "[A-Za-z]+")) |> 
  dplyr::filter(variable != "Intercept") # remove the intercept (optional step)
  variable numDF denDF  F-value   p-value
1  Variety     2    46 1.076281 0.3492884
2     Date     3    15 1.607674 0.2294616
3  Variety     2    46 1.369906 0.2643028
4     Date     3    15 1.373157 0.2891074
(alf_aov <- anova(alf_mod, type = "marginal"))
            numDF denDF   F-value p-value
(Intercept)     1    46 209.03665  <.0001
Variety         2    46   1.91760  0.1585
Date            3    15  14.09297  0.0001

Permutated ANOVA

The R package ‘infer’

  • Provides functions for permutation, automatic calculation of target statistics and visualization
  • Very limited in functionality: can only calculate ANOVA statistics for one independent variable at a time
  • Can only compare means when there are exactly two levels
  • This package works fine for one-way ANOVA, simple linear regression and multiple linear regression.
  • It is ill-suited for multiple-way factorial studies

Outliers

  • Do you know what led to the outlying data point? (Did you drop a sample bag? Did ducks eat your plot? Did your boss insist on something ill advised?). Decide if this is variation that reflects your target inferential conditions or not.
  • If you do not know what caused an extreme value, run an analysis (e.g. a linear model) and look at the absolute value of the residuals. Create a decision criteria for data removal.
  • Data removal is a tricky process that may introduce bias or accusation of bias. In general, be very judicious when excluding outlying data points from an analysis.
  • If a single extreme data point is driving an analytical conclusion, it is definitely worth conducting a sensitivity analysis to see what happens when the extreme data point is excluded.

Outliers

Alfalfa$alf_res <- residuals(alf_mod, type = "normalized")

My cut-off is usually an absolute values of 4 for studentized residuals (99.99% of data falls with 4 units of the mean). This removes very large values and usually, not too many. Only very large data sets need a larger cut-off.

Final Thoughts

  • The assumption of normality is largely for the model residuals.
  • It’s good to check the distribution of the dependent variable, but don’t expect normality at that stage.
  • Most of the time, data are normal enough.
  • Modest departures from normality can be ignored, anova and estimates are robust to this.
  • Non-parametric tests are an option, but they are overall a low-information solution.
  • Manual permutattions. are an (cumbersome) option.
  • The best choice is always to find the correct distribution for your data./
  • The optimal approach is to use science not statistics to justify data removal.

Additional Resources