x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
Julia Piaskowski
March 10, 2026
\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
\(Y_i\) = dependent variable (there is only 1)
\(X_i\) = independent variable(s) (there may be many)
\(B_0\) = model intercept, the overall mean of \(Y\)
\(B_1\) = how \(Y\) changes with \(X\)
\(\epsilon_i\) = model residual, the gap between the predicted value for \(Y_i\) and its observed value
\[ \epsilon_i \sim N(0, \sigma)\]
The residuals are normally distributed with a mean of zero and some standard deviation that we will estimate during the model-fitting process.
All model residuals are independent and identically distributed, “i.i.d.” – they are uncorrelated with each other and drawn from the same distribution (that has exactly one variance)
\[\hat{Y} = \beta_0 + \beta_1 X1 + \beta_2 X2 + ...\beta_k Xk\]
\[\hat{Y} = \mathbf{XB} \]
\(\hat{Y}\) = predicted/fitted value
\(\beta_0\) = model intercept
\(\beta_1, \beta_2,...\) = model coefficients
…a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.
[A confidence interval percentage] is the frequency with which other unobserved intervals will contain the true effect….if all the assumptions used to compute the intervals were correct.
– Greenland et al, 2016
The confidence level instead reflects the long-run reliability of the method used to generate the interval….if the same sampling procedure were repeated 100 times from the same population, approximately 95 of the resulting intervals would be expected to contain the true population mean. The frequentist approach sees the true population mean as a fixed unknown constant, while the confidence interval is calculated using data from a random sample.
Simple linear regression: one continuous independent variable
print(m1) is a pretty printing of some outputsummary(m1) is a function with pre-selected and formatted output
Call:
lm(formula = y1 ~ x1, data = anscombe)
Residuals:
Min 1Q Median 3Q Max
-1.92127 -0.45577 -0.04136 0.70941 1.83882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
x1 0.5001 0.1179 4.241 0.00217 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
Model coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0000909 1.1247468 2.667348 0.025734051
x1 0.5000909 0.1179055 4.241455 0.002169629
Other valuable output
[1] 1.236603
[1] 0.6665425
[1] 0.6294916
\(R^2_{Adj} = 1 - \frac {(1-R^2)(1-n)} {n - k - 1}\)
Other valuable output: F-statistic
- overall indicator of model fit
- comparing to a null model: y ~ 1 (intercept-only model)
(I tend not to need these unless doing something very specific)
From the model object
[1] "lm"
[1] add1 alias anova case.names coerce
[6] confint cooks.distance deviance dfbeta dfbetas
[11] drop1 dummy.coef effects extractAIC family
[16] formula fortify hatvalues influence initialize
[21] kappa labels logLik model.frame model.matrix
[26] nobs plot predict print proj
[31] qqnorm qr residuals rstandard rstudent
[36] show simulate slotsFromS3 summary variable.names
[41] vcov
see '?methods' for accessing help and source code
These are built-in methods for this class of object, and are dependent on the libraries loaded
confint() provides confidence interval of parametersextractAIC() and logLik() return fit statisticshatvalues(), dfbetas(), cooks.distance(), and influence() are model diagnostics toolsdrop() and add1() drop/add terms sequentially and evaluate model fit statistics (not applicable for simple linear regression)plot() provides diagnostic plots (= plot.lm())simulate() simulates data under the estimated model parametersresiduals(), rstandard(), rstudent() produce raw, Pearson and studentized residuals, respectivelypredict() is for predicting a new data setConfidence Intervals
Influential data points
$hat
1 2 3 4 5 6 7
0.10000000 0.10000000 0.23636364 0.09090909 0.12727273 0.31818182 0.17272727
8 9 10 11
0.31818182 0.17272727 0.12727273 0.23636364
$coefficients
(Intercept) x1
1 0.0003939394 3.939394e-04
2 -0.0097529844 5.133150e-04
3 0.5946796537 -9.148918e-02
4 0.1309090909 -4.763503e-19
5 0.0142575758 -3.564394e-03
6 0.0193030303 -2.757576e-03
7 0.5039170829 -4.085814e-02
8 -0.5430000000 4.936364e-02
9 -0.3435154845 6.062038e-02
10 -0.4902121212 3.501515e-02
11 0.0982727273 -8.545455e-03
$sigma
1 2 3 4 5 6 7 8
1.311535 1.311479 1.056460 1.218483 1.310017 1.311496 1.219936 1.272721
9 10 11
1.099742 1.147055 1.309605
$wt.res
1 2 3 4 5 6
0.03900000 -0.05081818 -1.92127273 1.30909091 -0.17109091 -0.04136364
7 8 9 10 11
1.23936364 -0.74045455 1.83881818 -1.68072727 0.17945455
How much single data points influence model parmaters
(DF refers to difference)
\(R^2\) coefficient of determination \[R^2 = 1 - \frac {SS_{error}}{SS_{reg} + SS_{errror}} = \frac {SS_{reg}}{SS_{reg} + SS_{errror}}\]
\[0 \leq R^2 \leq 1\]
For measuring the strength of a regression
\(R^2\) exists, \(R\) does not
\(r\) coefficient of correlation:
\[r_{xy} = \frac{s_{xy}}{s_x s_y}\]
\[-1 \leq r \leq 1\]
For understanding pairwise relationships
\(r^2\) = \(R^2\) only in the case of simple linear regression
Several continous variables effecting differences in a continuous independent variable
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Call:
lm(formula = mpg ~ disp + hp + drat, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.1225 -1.8454 -0.4456 1.1342 6.4958
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.344293 6.370882 3.036 0.00513 **
disp -0.019232 0.009371 -2.052 0.04960 *
hp -0.031229 0.013345 -2.340 0.02663 *
drat 2.714975 1.487366 1.825 0.07863 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.008 on 28 degrees of freedom
Multiple R-squared: 0.775, Adjusted R-squared: 0.7509
F-statistic: 32.15 on 3 and 28 DF, p-value: 3.28e-09
(Regression with categorical variables)
(ignoring block)
N: 2 levels, 0 and 1
P: 2 levels, 0 and 1
K: 2 levels, 0 and 1
\[ Y_{ijk+} = \beta_0 + \beta_1N +\beta_2P + \beta_3K \]
(with block)
6 levels of block
block 1: 0 only
block 2: 0 and 1
block 3: 0 and 1
….(blocks 4, 5, 6): 0 and 1
(set one level as level zero or reference level)
| ID | block1 | block 2 | block 3 | block 4 | block 5 | block 6 |
|---|---|---|---|---|---|---|
| A | 0 | 1 | 0 | 0 | 0 | 0 |
| B | 0 | 0 | 1 | 0 | 0 | 0 |
| C | 0 | 0 | 0 | 1 | 0 | 0 |
| D | 0 | 0 | 0 | 0 | 1 | 0 |
| E | 0 | 0 | 0 | 0 | 0 | 1 |
\[ Y_{npk} = \beta_0 +...+ \beta_4 Bl2 + \beta_5 Bl3 + \beta_6 Bl4 + \beta_7 Bl5 + \beta_8 Bl6\]
Call:
lm(formula = yield ~ N + P + K + block, data = npk)
Residuals:
Min 1Q Median 3Q Max
-7.0000 -1.7083 -0.0833 2.2458 6.4833
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.800 2.450 21.955 8.13e-13 ***
N1 5.617 1.634 3.438 0.00366 **
P1 -1.183 1.634 -0.724 0.47999
K1 -3.983 1.634 -2.438 0.02767 *
block2 3.425 2.830 1.210 0.24483
block3 6.750 2.830 2.386 0.03068 *
block4 -3.900 2.830 -1.378 0.18831
block5 -3.500 2.830 -1.237 0.23512
block6 2.325 2.830 0.822 0.42412
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.002 on 15 degrees of freedom
Multiple R-squared: 0.7259, Adjusted R-squared: 0.5798
F-statistic: 4.966 on 8 and 15 DF, p-value: 0.003761
Type 1 sums of squares
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
N 1 189.28 189.282 11.8210 0.00366 **
P 1 8.40 8.402 0.5247 0.47999
K 1 95.20 95.202 5.9455 0.02767 *
block 5 343.29 68.659 4.2879 0.01272 *
Residuals 15 240.19 16.012
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type 3 sums of squares
Single term deletions
Model:
yield ~ N + P + K + block
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 240.19 73.281
N 1 189.28 429.47 85.228 11.8210 0.00366 **
P 1 8.40 248.59 72.106 0.5247 0.47999
K 1 95.20 335.39 79.294 5.9455 0.02767 *
block 5 343.29 583.48 84.583 4.2879 0.01272 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type 1: depends on order of independent variables
Type 3: best for factorials
Type 2: use if you are sure there are no interactions
These only matter if the data are unbalanced
Linear mixed model fit by REML ['lmerModLmerTest']
Formula: yield ~ N + P + K + (1 | block)
Data: npk
REML criterion at convergence: 128.057
Random effects:
Groups Name Std.Dev.
block (Intercept) 3.628
Residual 4.002
Number of obs: 24, groups: block, 6
Fixed Effects:
(Intercept) N1 P1 K1
54.650 5.617 -1.183 -3.983
[1] 4.001541
Groups Name Variance Std.Dev.
block (Intercept) 13.162 3.6279
Residual 16.012 4.0015
2.5 % 97.5 %
.sig01 1.201048 7.3624368
.sigma 2.720672 5.2696730
(Intercept) 50.396060 58.9039408
N1 2.530694 8.7026395
P1 -4.269306 1.9026395
K1 -7.069306 -0.8973605
A large number of functions are providing model conditions, not something estimated.
So many methods
[1] anova as.function coef confint cooks.distance
[6] deviance df.residual drop1 extractAIC family
[11] fitted fixef formula fortify getData
[16] getL getME hatvalues influence isGLMM
[21] isLMM isNLMM isREML isSingular logLik
[26] model.frame model.matrix na.action ngrps nobs
[31] plot predict print profile ranef
[36] refit refitML rePCA residuals rstudent
[41] show sigma simulate summary terms
[46] update VarCorr vcov weights
see '?methods' for accessing help and source code
[1] anova coerce coerce<- contest contest1D contestMD
[7] difflsmeans drop1 getL isSingular ls_means lsmeansLT
[13] show step summary update
see '?methods' for accessing help and source code
methods() to explore this contentAn R Companion to Applied Regression (2019), \(3^{rd}\) Ed., by John Fox and Sanford Weisberg.
Learn more about functions for detecting influential data
Linear Mixed Model Guide: details on implementation of linear mixed models in R