1 Introduction

This document is an introduction to estimating and diagnosing simple and multiple linear regression fixed effect models in SAS. Some additional knowledge of SAS may be useful for understanding certain aspects of this tutorial and the readers are referred to: Data Step and Graphics and Plotting. Application of mixed model methodology to linear regression is also possible. While much of the theoretical background and diagnostics still apply to these models, the details and considerations of setting these models up in SAS are covered elsewhere. Likewise, readers interested in fixed or mixed nonlinear modeling are referred to the specific tutorials on those subjects.

In simple linear regression, we define a linear relationship between a continuous dependent response, \(y_i\), and an independent continuous regressor variable, \(x_1\) as:

\[y_{i} = \beta_0 + \beta_1 \cdot x_{1} + \epsilon_{i}\] where \(\beta_0\) and \(\beta_1\) are an intercept term and regression coefficient, respectively. The last term, \(\epsilon_i\), is the regression error assumed to be distributed as: \(\epsilon_i \sim N(0, \sigma^2)\). The index \(i\) refers to the \(i^{th}\) observation. With some additional considerations, this model can also be theoretically extended to k multiple regressor variables, e.g. \(x_2, x_3, \ldots, x_k\) and their corresponding coefficients, \(\beta_2, \beta_3, \ldots, \beta_k\).

Regression models have various uses including prediction of the response \(y_i\), description and exploration of potential relationships between variables, and structural comparisons of estimated relationships between groups.

2 Data used in examples

2.1 Photosynthesis

Two data examples will be used here. The first is adapted from Steele and Torrie, Principles and Procedures of Statistics: A Biometrical Approach, McGraw-Hill Book Co., 1980, pg 244. These data describe a study investigating the changes in the response, photosynthetic rate, that correspond to changes in two regressor variables: irradiation level and leaf resistance to water vapor. A data step to read in these data would be:

data photo;
    input photo irrad resist;
    cards;
    348     294      990
    131     190      968
    402     294      1868
    731     550      1814
    526     550      2521
    1346    2000     1516
    655     550      1935
    360     550      4675
    618     550      2234
    1385    2000     1158
    1550    2000     985
    1415    2000     1697
    1467    2000     646
    842     800      1086
    927     800      998
    1099    1200     911
    1086    1200     765
    910     1200     1284
    1255    1600     915
    1137    1600     1410
    349     400      4111
    498     400      1802
    989     800      801
    829     800      983
;

2.2 Soybean Yield

The second set of data is taken from Allen and Cady, Analyzing Experimental Data by Regression, Lifetime Learning Pub., 1982, pg 234 and describes soybean yields as they relate to plant height with an additional qualitative factor describing the controlled lighting conditions: 1) Ambient light, 2) Additional artificial light, and 3) Partial shading. The SAS code for reading this data is:

data soybean;
    input yield height treatment;
    cards;
    12.2    48  1
    12.4    52  1
    11.9    42  1
    11.3    35  1
    11.8    40  1
    12.1    48  1
    13.1    60  1
    12.7    61  1
    12.4    50  1
    11.4    33  1
    12.3    48  1
    12.2    51  1
    12.6    56  1
    13.2    65  1
    12.3    51  1
    16.6    63  2
    15.8    50  2
    16.5    63  2
    15.0    33  2
    15.4    38  2
    15.6    45  2
    15.8    50  2
    15.8    48  2
    16.0    50  2
    15.0    35  2
    16.2    50  2
    16.7    62  2
    15.8    49  2
    15.9    52  2
     9.5    52  3
     9.5    54  3
     9.6    58  3
     8.8    45  3
     9.5    57  3
     9.8    62  3
     9.1    52  3
    10.3    67  3
     9.5    55  3
     8.5    40  3
     8.6    41  3
    10.4    67  3
     9.4    55  3
    10.2    66  3
     9.3    56  3
;
run;

3 Simple Linear Regression

3.1 Initial steps

A first step in regression analysis should always be an examination of the data. Graphical methods in simple linear regression are particularly useful. Here we use the first data set to produce a scatter plot looking at photosynthesis rate vs irradiance using the general plotting procedure PROC SGPLOT.

title1 'Least Squares Regression';

proc sgplot data=photo;
    scatter y=photo x=irrad/; 
    xaxis label='Irradiation' LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Photosynthesis Rate" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Observed Data';
run;

The SGPlot Procedure

This initial plot is useful for discerning characteristics of potential relationships. Here we see that the relationship between photosynthesis and irradiation is positive. We can also see that this relationship is more or less proportional or linear. Hence, a linear regression is a plausible model for these data and we expect that any estimated model should reflect the positive direction of the relationship.

3.2 Estimation and Diagnostics

A common means of estimating linear regression models in SAS is the procedure PROC REG. While there are other procedures that can accomplish this task, the REG procedure has more diagnostic functionality built into it that make it the best choice for a fixed linear regression. As with all SAS procedures, there are required and optional statements. A summary of these is shown below.

3.2.1 Required Statements

MODEL: The only required statement under PROC REG is the MODEL statement. This statement defines the model as well as the order in which variables are included. The general form of the MODEL statement is: Dependent var = Regressor var’s / option1 option2 … Here, dependent var is a numeric response (the variable to be modeled) and regressor var’s is a list of one or more numeric variables paired with the dependent var. In most applications, both dependent variables and regressor variables are continuous, numeric values. Options are keywords which specify various details of the analysis, and as usual in SAS, they are listed at the end of a statement after a slash (/). More than one MODEL statement can be used in each call to PROC REG, if needed.

Unlike the MODEL statement of other procedures, PROC REG does not allow the use of crossed or nested terms. Thus, the symbols *, |, and () should not appear in the MODEL statement. If a crossed term is required in the model, it must be explicitly defined in a preceding DATA STEP or read in with the original data.

MODEL Statement Options: PROC REG provides a wide array of options for regression analysis. All of these, however, fall into three broad categories. Printing options determine what details of the analysis should be printed in the output. Examples would be P, CLM, or I for predicted values, prediction intervals, and the \((X' X)^{-1}\) matrix, respectively. The second type of options are diagnostic. Examples include R, DW, and INFLUENCE, for a preliminary residual analysis, the Durbin-Watson statistic, and influence statistics, respectively. The last category of options are used to specify the model itself. Examples of these might be NOINT, SELECTION, and INCLUDE. These, in order, result in the fit of a model with no intercept, requesting REG use various model selection procedures when multiple regressor variables are present, and providing a list of regressor variables that must be included in all models tested.

OUTPUT: The OUTPUT statement will save requested regression statstics to a new SAS data set. Examples of these would be predicted values, raw or studentized residuals, confidence intervals, etc. The format of the OUTPUT statement is: OUTPUT OUT=(data set name) option=name1 option=name2 …, where the options might be P for predicted values, RSTUDENT or R for studentized or raw residuals, respecively, or L95M and U95M for lower and upper 95% confidence intervals on the estimated model. The name1, name2, etc. give the SAS variable names for the respective output statistics. Although PROC REG produces many diagnostic statsitics and plots by default, outputting these statistics to a separate data set allows SAS to utilize them in other procedures, such as further computation or more sophisticated plotting.

The code below demonstrates a call to PROC REG for the photosynthesis data set. The predicted values and residuals are saved to the output data set, STATS.


proc reg data=photo;
    model photo=irrad ;
    output out=stats p=pred_photo r=resid rstudent = student;
run;

Model: MODEL1

Dependent Variable: photo

Number of Observations Read	24
Number of Observations Used	24

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	3549556	3549556	221.17	<.0001
Error	22	353079	16049
Corrected Total	23	3902635

Root MSE	126.68489	R-Square	0.9095
Dependent Mean	868.95833	Adj R-Sq	0.9054
Coeff Var	14.57894

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	245.42764	49.26053	4.98	<.0001
irrad	1	0.61512	0.04136	14.87	<.0001

Model: MODEL1

Dependent Variable: photo

Panel of fit diagnostics for photo.

Scatter plot of residuals by irrad for photo.

Scatterplot of photo by irrad overlaid with the fit line, a 95% confidence band and lower and upper 95% prediction limits.

From the resulting output, we can see that there were n=24 observations used. In the ANOVA table, there were 23 total degrees of freedom (DF=n-1), 1 DF for the model and 23-1=22 DF for the error term. The model F test (model mean square/error mean square) is large (F=221.2; p=0.0001) indicating a relationship is detectable. Some summary information for CV, \(r^2\), and other statistics are given in the next table. The final table gives the estimated parameter values for the intercept and the regression coefficient. The intercept, or predicted photosynthetic rate when irradiation = 0, is 245.4 and the rate of change in photosynthesis with increasing irradiation is 0.62. This value is positive, as expected, has a small p-value, and estimates that photosynthetic rate will increase 0.62 units for every unit increase in irradiation.

The final section are default graphics displaying various residual and other diagnostics. For example, the studentized residual plots indicate the residuals were of reasonable magnitude (under normality assumptions we expected them to be between 2 to 3 units of the mean of 0) and have a somewhat random distribution (there may be some curvature in the residual pattern indicating a linear model may not completely capture this response relationship). The Cook’s D plot suggests that the second data point may be high leverage affecting the regression because it is above the reference line. The final two plots display a larger version of the raw residuals and the predicted line with 95% confidence bounds on the model (shaded area) and individual points (dashed lines). Overall, this model seems to adequately provide a simple description of the relationship between photosynthetic rate and irradiation. For more information on regression and regression diagnostics, see: Applied Linear Regression Models. Kutner, Nachtsheim, and Neter and Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Belsley, Kuh, and Welsch.

We can also produce some of these plots using the output data set STATS. This data set contains all the original data from data set PHOTO with the additional statistics we requested. The code below uses this output data set to produce a more customizable version of these plots.


proc print data=stats;
run;

proc sgplot data=stats;
    scatter y=photo x=irrad/FILLEDOUTLINEDMARKERS markerattrs=(symbol=circlefilled size=14) MARKERFILLATTRS= (color=blue)
     MARKEROUTLINEATTRS=(color=black); 
    series y=pred_photo x=irrad;
    xaxis label='Irradiation' LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Photosynthesis Rate" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Observed vs Predicted';
run;
    
proc sgplot data=stats;
    scatter y=resid x=pred_photo/FILLEDOUTLINEDMARKERS markerattrs=(symbol=circlefilled size=14) MARKERFILLATTRS= (color=green)
     MARKEROUTLINEATTRS=(color=black); 
    refline 0/axis=y;
    xaxis label='Predicted Value' LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Residual" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Raw residuals vs Predicted';
run;

proc sgplot data=stats;
    scatter y=student x=pred_photo/FILLEDOUTLINEDMARKERS markerattrs=(symbol=circlefilled size=14) MARKERFILLATTRS= (color=orange)
     MARKEROUTLINEATTRS=(color=black); 
    refline 0/axis=y;
    xaxis label='Predicted Value' LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Studentized Residual" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Studentized Residuals vs Predicted';
run;

Obs	photo	irrad	resist	pred_photo	resid	student
1	348	294	990	426.27	-78.274	-0.64140
2	131	190	968	362.30	-231.301	-2.08141
3	402	294	1868	426.27	-24.274	-0.19717
4	731	550	1814	583.75	147.254	1.21474
5	526	550	2521	583.75	-57.746	-0.46280
6	1346	2000	1516	1475.68	-129.676	-1.11325
7	655	550	1935	583.75	71.254	0.57259
8	360	550	4675	583.75	-223.746	-1.93696
9	618	550	2234	583.75	34.254	0.27363
10	1385	2000	1158	1475.68	-90.676	-0.76696
11	1550	2000	985	1475.68	74.324	0.62579
12	1415	2000	1697	1475.68	-60.676	-0.50929
13	1467	2000	646	1475.68	-8.676	-0.07238
14	842	800	1086	737.53	104.473	0.83885
15	927	800	998	737.53	189.473	1.58327
16	1099	1200	911	983.58	115.424	0.92962
17	1086	1200	765	983.58	102.424	0.82134
18	910	1200	1284	983.58	-73.576	-0.58548
19	1255	1600	915	1229.63	25.374	0.20403
20	1137	1600	1410	1229.63	-92.626	-0.75408
21	349	400	4111	491.48	-142.477	-1.18439
22	498	400	1802	491.48	6.523	0.05250
23	989	800	801	737.53	251.473	2.20391
24	829	800	983	737.53	91.473	0.73161

The SGPlot Procedure

4 Multiple Linear Regression

In this example, we extend the analysis above to multiple regression by examining the relationship of photosynthesis rate with both irradiation and resistance to water vapor. As with simple linear regression, a good place to start is to examine the potential individual relationships graphically. The resulting plots may not be as clear as a simple linear model because the two variables can be influencing the response simultaneously, however, general patterns can often be seen.


title1 'Multiple Regression';

proc sgplot data=photo;
    scatter y=photo x=irrad/; 
    xaxis label='Irradiation' LABELATTRS=(Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Photosynthesis Rate" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Observed Data: Irradiation';
run;

proc sgplot data=photo;
    scatter y=photo x=resist/; 
    xaxis label='Water Vapor Resistance' LABELATTRS=(Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Photosynthesis Rate" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Observed Data: Resistance';
run;

The SGPlot Procedure

The plot for irradiance is unchanged from previously. The new plot for resistance indicates a possible negative relationship with photosynthesis rate.

4.1 Diagnostics

While the modeling process and many diagnostics are the same for multiple regression, introducing two or more regressor variables into the model requires additional considerations and diagnostics. Specifically, we now must consider whether the regressors are related to one another. This is referred to as collinearity and can cause problems through inflated variability. SAS provides the means to assess this through COLLIN and VIF options on the MODEL statement as well as the default and additional plotting options.


proc reg data=photo plots = DFBetas(label);
    model photo=irrad resist/r influence collinoint vif;
    output out=stats p=pred_photo r=resid rstudent = student;
run;

Model: MODEL1

Dependent Variable: photo

Number of Observations Read	24
Number of Observations Used	24

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	3648208	1824104	150.56	<.0001
Error	21	254427	12116
Corrected Total	23	3902635

Root MSE	110.07085	R-Square	0.9348
Dependent Mean	868.95833	Adj R-Sq	0.9286
Coeff Var	12.66699

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	Variance Inflation
Intercept	1	400.60562	69.20397	5.79	<.0001	0
irrad	1	0.57294	0.03886	14.74	<.0001	1.16925
resist	1	-0.07086	0.02483	-2.85	0.0095	1.16925

Collinearity Diagnostics (intercept adjusted)
Number	Eigenvalue	Condition Index	Proportion of Variation
Number	Eigenvalue	Condition Index	irrad	resist
1	1.38046	1.00000	0.30977	0.30977
2	0.61954	1.49272	0.69023	0.69023

Model: MODEL1

Dependent Variable: photo

Output Statistics
Obs	Dependent Variable	Predicted Value	Std Error Mean Predict	Residual	Std Error Residual	Student Residual	Cook's D	RStudent	Hat Diag H	Cov Ratio	DFFITS	DFBETAS
Obs	Dependent Variable	Predicted Value	Std Error Mean Predict	Residual	Std Error Residual	Student Residual	Cook's D	RStudent	Hat Diag H	Cov Ratio	DFFITS	Intercept	irrad	resist
1	348	498.8958	42.6780	-150.8958	101.5	-1.487	0.130	-1.5345	0.1503	0.9757	-0.6454	-0.6187	0.5082	0.3849
2	131	440.8694	46.2506	-309.8694	99.882	-3.102	0.688	-4.1136	0.1766	0.2235	-1.9048	-1.8332	1.5588	1.1340
3	402	436.6793	34.4530	-34.6793	104.5	-0.332	0.004	-0.3246	0.0980	1.2633	-0.1070	-0.0738	0.0786	0.0113
4	731	587.1774	27.9986	143.8226	106.5	1.351	0.042	1.3799	0.0647	0.9422	0.3629	0.2205	-0.2057	-0.0156
5	526	537.0783	32.4028	-11.0783	105.2	-0.105	0.000	-0.1028	0.0867	1.2655	-0.0317	-0.0031	0.0090	-0.0160
6	1346	1439	43.8862	-93.0506	100.9	-0.922	0.054	-0.9184	0.1590	1.2160	-0.3993	0.1951	-0.3427	-0.1168
7	655	578.6031	28.0307	76.3969	106.4	0.718	0.012	0.7092	0.0649	1.1490	0.1868	0.0976	-0.0981	0.0120
8	360	384.4425	75.2381	-24.4425	80.342	-0.304	0.027	-0.2976	0.4672	2.1442	-0.2787	0.1438	-0.0413	-0.2587
9	618	557.4155	29.4553	60.5845	106.1	0.571	0.008	0.5619	0.0716	1.1897	0.1560	0.0467	-0.0630	0.0489
10	1385	1464	42.1524	-79.4191	101.7	-0.781	0.035	-0.7736	0.1467	1.2417	-0.3207	0.1100	-0.2608	-0.0300
11	1550	1477	41.9689	73.3219	101.8	0.721	0.029	0.7121	0.1454	1.2565	0.2937	-0.0776	0.2285	-0.0025
12	1415	1426	45.4047	-11.2247	100.3	-0.112	0.001	-0.1093	0.1702	1.3925	-0.0495	0.0272	-0.0429	-0.0189
13	1467	1501	42.8739	-33.7002	101.4	-0.332	0.007	-0.3253	0.1517	1.3432	-0.1376	0.0144	-0.0945	0.0281
14	842	781.9984	28.4019	60.0016	106.3	0.564	0.008	0.5548	0.0666	1.1846	0.1482	0.1230	-0.0680	-0.0813
15	927	788.2343	29.6573	138.7657	106.0	1.309	0.045	1.3331	0.0726	0.9669	0.3730	0.3182	-0.1743	-0.2235
16	1099	1024	27.3153	75.4266	106.6	0.707	0.011	0.6987	0.0616	1.1475	0.1790	0.0969	0.0056	-0.0918
17	1086	1034	29.3413	52.0808	106.1	0.491	0.006	0.4819	0.0711	1.2038	0.1333	0.0801	-0.0024	-0.0801
18	910	997.1421	23.9219	-87.1421	107.4	-0.811	0.011	-0.8042	0.0472	1.1044	-0.1791	-0.0562	-0.0328	0.0356
19	1255	1252	31.8257	2.5358	105.4	0.024	0.000	0.0235	0.0836	1.2631	0.0071	0.0006	0.0037	-0.0018
20	1137	1217	31.1000	-80.3877	105.6	-0.761	0.017	-0.7535	0.0798	1.1567	-0.2219	0.0511	-0.1507	-0.0306
21	349	338.4681	62.1805	10.5319	90.825	0.116	0.002	0.1132	0.3191	1.6969	0.0775	-0.0290	0.0000	0.0668
22	498	502.0874	31.7018	-4.0874	105.4	-0.039	0.000	-0.0378	0.0830	1.2621	-0.0114	-0.0078	0.0078	0.0013
23	989	802.1940	32.8231	186.8060	105.1	1.778	0.103	1.8827	0.0889	0.7788	0.5882	0.5223	-0.2818	-0.4061
24	829	789.2972	29.8820	39.7028	105.9	0.375	0.004	0.3670	0.0737	1.2248	0.1035	0.0887	-0.0485	-0.0628

Bar chart of studentized residuals and Cook's D for photo

Sum of Residuals	0
Sum of Squared Residuals	254427
Predicted Residual SS (PRESS)	337749

Panel of fit diagnostics for photo.

Panel of DFBETAS by observation for photo.

Panel of scatterplots of residuals by regressors for photo.

In this model there are now two degrees of freedom, one for each regressor, and the model has a large F value (150.56, p < 0.0001). The \(r^2\) has increased to \(r^2=0.93\) from the simple linear example above (\(r^2=0.91\)), however, some caution is warranted as this measure will inherently increase as regressors are added to the model. The parameter estimates table gives more information. Both regressors have large t values and corresponding smaller p-values suggesting they are relevant in the relationship to photosynthetic rate. As before, the coefficient for irradiation is positive, although a bit smaller than the previous estimate, and as suspected from the initial data plots, the coefficient for water vapor resistance is negative. The last column in this table gives the VIF or, variance inflation factor, for each regressor. This statistic is computed from the multiple correlation, \(r^2_i\), of each regressor relative to the other regressors, i.e. \[VIF = 1/(1 - r^2_i)\] where i refers to the \(i^{th}\) regressor. For ‘large’ sample sizes, values larger than 1 may indicate collinearity. In this case, both values are larger than 1, however, the sample size is not that large at n=24, so this cutoff may not be a good measure.

The next table gives further metrics for assessing collinearity. The condition index is conputed from a variance decomposition routine. Values above 10 are usually considered problematic. Both regressors fall well below this value. The last measures are the proportion of variance values. Starting with the smallest eigenvalue (0.6195) the proportion for each variable is 0.69. Again, for large sample sizes, a cutoff of above 0.5 is often suggested for indicating collinearity problems, however, given the lower sample size here, these probably are not an issue even though they are greater than 0.5.

The final table gives influence statistics for each observation. Some of these are repeated from the plots. The DFFits and DFBetas statistics, however, are new. These respectively measure how much each observation influences the model fit, or the respective parameter estimates. Larger values may point to observations that have more influence on the regression fit compared to the other data. Here, the second observation appears to have the most influence. One way to see this graphically is the plots=dfbetas(label) option in the PROC REG call (labeled Influence diagnotics for photo). Although some other points (1 and 23) are slightly beyond the reference lines, this plot shows the second observation to be quite different from the other observations. Notably, this observation also was indicated in the Cook’s D, residual, and leverage plots. A good rule of thumb when lookig at any statistical diagnostic is not to rely on one measure or strict cutoff values for decision making, but rather to make a decision across several metrics. In this case we would want to go back and reassess the second data point. Was it recorded correctly? Are there any field notes relevant to that point that may call into question it’s validity. That said, typically, you should NOT delete observations unless you have a defensible reason for doing so. Many times in scientific investigations, it is the data that do not fit the model that are the most informative.

5 Comparing regression lines (Dummy Variable Regression, DVR)

5.1 Introduction - defining dummy variables

One common use of regression is the incorporation of qualitative treatment structures. Doing so allows the comparison of model relationships based on different subsets of the data. This, however, requires that the models be of the same mathematical form, i.e. parameter estimates from linear models with different regressors are not comparible. All the models used should also fit their respective data well. In this technique, the models will be a proxy for the data, so they should represent the data in a reliable fashion. Given these constraints, two or more models can be compared by combining the models into a single full model via a special class of regressors, dummy variables. Dummy variables are binary, usually coded as [0,1] that act like an on off switch corresponding to the different qualitative or treatment variable levels. A full dummy variable model allows the user to estimate all treatment regression models simultaneously and compare combinations of parameters across the models as needed to test hypotheses of interest.

Suppose, for example, we have two treatments and have linear regression models for each:

\[y_{i1} = \beta_{01} + \beta_{11} \cdot x_{i} + \epsilon_{i1} \hspace{5cm} (1)\]

\[y_{i2} = \beta_{02} + \beta_{12} \cdot x_{i} + \epsilon_{i2} \hspace{5cm} (2)\]

where the second subscript refers to treatment 1 or 2. Together, we have four parameters, two intercepts and two slopes. It may be of interest, for example, to compare the two slope parameters, \(\beta_{11}\) and \(\beta_{12}\). To do this we would need a common pooled estimate of the error, however, our models each have their own error terms, \(\epsilon_{i1}\) and \(\epsilon_{i2}\). In order to build a single full model with one common error term, we define new binary dummy variables that encode the treatment structure:

\[ D_1 = \begin{bmatrix} 1 \hspace{1cm} if \hspace{.2cm} treatment = 1\\ 0 \hspace{1cm} if \hspace{.2cm} treatment = 2 \end{bmatrix}\] \[ D_2 = \begin{bmatrix} 0 \hspace{1cm} if \hspace{.2cm} treatment = 1\\ 1 \hspace{1cm} if \hspace{.2cm} treatment = 2 \end{bmatrix}\] We can now write a full model as:

\[ y_i = (D_1 \cdot \beta_{01} + D_2 \cdot \beta_{02}) \hspace{.2cm} + \hspace{.2cm} (D_1 \cdot \beta_{11} + D_2 \cdot \beta_{12}) \cdot x_i \hspace{.2cm} + \hspace{.2cm} \epsilon_i \hspace{5cm} (3) \]

If we pick up an observation from treatment 1, then \(D_1=1\) and \(D_2=0\) and the full model reduces to model (1) above. If we have an observation from treatment 2, the full model returns model (2). Most importantly, we now have one common pooled error term, \(\epsilon_i\) and can proceed to define contrasts to test hypotheses involving comparison of the regression parameters. This concept can be extended to any number of treatments by defining additional dummy variables, however, some caution is advised as the number of estimated parameters will increase with each treatment we add.

Some may recognize this model as Analysis of Covariance (ANCOVA), where the model contains both continuous variables and qualitative factors. In fact, the two models are equivalent mathematically, however, they differ in their objectives. ANCOVA seeks to adjust mean comparisons of qualitative treatment effects using continuous covariates, while DVR focuses on comparing the continuous relationships across qualitative factors. Beware that DVR is sometimes incorrectly referred to as ANCOVA and, while they are mathematically equivalent, they have different analytical goals.

In SAS, either PROC REG or more general linear model procedures such as PROC GLM, PROC MIXED, or PROC GLIMMIX can be used to fit a full model. Although it is technically possible in PROC REG, using that method requires manually defining dummy variables and their interactions separately for each treatment in a data step. A more expedient method uses the general linear model procedures which define dummy variables internally. In this example we will concentrate on the procedure PROC GLM for the fixed model case. Extension to mixed models through PROC MIXED or PROC GLIMMIX is also possible and similar enough that users can simply apply these principles to those procedures.

5.1.1 Example using Soybean yield data

The following code estimates the simple linear regression of soybean yield as a function of plant height, separately for each treatment. The procedure PROC REG is used for the initial estimation and a BY statement is used to process each treatment independently. Note that sorting the data by treatment is necessary before doing this. Diagnostic plots are omitted here for brevity. In a real analysis, however, this step would be used with full diagnostic evaluation, as shown in the previous examples above, to ensure that each treatment regression was adequate and representative of the data.

proc sort data= soybean;
  by treatment;
run;

proc reg data=soybean plots=none;
  model yield = height;
  by treatment;
run;

Model: MODEL1

Dependent Variable: yield

treatment=1

Number of Observations Read	15
Number of Observations Used	15

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	3.77540	3.77540	244.66	<.0001
Error	13	0.20060	0.01543
Corrected Total	14	3.97600

Root MSE	0.12422	R-Square	0.9495
Dependent Mean	12.26000	Adj R-Sq	0.9457
Coeff Var	1.01322

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	9.45920	0.18191	52.00	<.0001
height	1	0.05677	0.00363	15.64	<.0001

Model: MODEL1

Dependent Variable: yield

treatment=2

Number of Observations Read	14
Number of Observations Used	14

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	3.37515	3.37515	205.60	<.0001
Error	12	0.19699	0.01642
Corrected Total	13	3.57214

Root MSE	0.12813	R-Square	0.9449
Dependent Mean	15.86429	Adj R-Sq	0.9403
Coeff Var	0.80764

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	13.21771	0.18773	70.41	<.0001
height	1	0.05385	0.00376	14.34	<.0001

Model: MODEL1

Dependent Variable: yield

treatment=3

Number of Observations Read	15
Number of Observations Used	15

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	4.32725	4.32725	272.97	<.0001
Error	13	0.20608	0.01585
Corrected Total	14	4.53333

Root MSE	0.12591	R-Square	0.9545
Dependent Mean	9.46667	Adj R-Sq	0.9510
Coeff Var	1.32999

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	5.86098	0.22064	26.56	<.0001
height	1	0.06540	0.00396	16.52	<.0001

Assuming the previous regressions fit the data well, the next step is to form and estimate the full model using PROC GLM. We use this procedure because it can utilize the CLASS statement for the qualitative variable treatment. Internally, this causes SAS to define dummy variables for each treatment level, which can then be entered into the GLM MODEL statement. In the model, the term treatment codes for the intercept parameters and the interaction treatment*height codes for the slopes. The noint option is used because we have explicitly given SAS our own intercepts, and the solution option has GLM print the estimates, which is does not do by default. Contrast statements are added to test two hypotheses:

\(H_0: \beta_{11} - \beta_{12} = 0\) ; are the two light treatments, 1 and 2, equivalent in their slope values, and
\(H_0: \frac{(\beta_{11} + \beta_{12})}{2} - \beta_{13} = 0\) ; is the average slope of the light treatments equivalent to the shaded treatment.

proc glm data=soybean;
  class treatment;
  model yield = treatment treatment*height/ ss3 noint solution;
  contrast 'Compare light treatment slopes' treatment*height 1 -1 0;
  contrast 'Compare average light treatment slope to shaded slope' treatment*height 1 1 -2;
run;

Class Level Information
Class	Levels	Values
treatment	3	1 2 3

Number of Observations Read	44
Number of Observations Used	44

Dependent Variable: yield

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	6	7133.816325	1188.969387	74843.0	<.0001
Error	38	0.603675	0.015886
Uncorrected Total	44	7134.420000

R-Square	Coeff Var	Root MSE	yield Mean
0.998048	1.012003	0.126040	12.45455

Source	DF	Type III SS	Mean Square	F Value	Pr > F
treatment	3	134.2944937	44.7648312	2817.85	<.0001
height*treatment	3	11.4778011	3.8259337	240.83	<.0001

Contrast	DF	Contrast SS	Mean Square	F Value	Pr > F
Compare light treatment slopes	1	0.00497139	0.00497139	0.31	0.5792
Compare average light treatment slope to shaded slope	1	0.07180107	0.07180107	4.52	0.0401

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
treatment 1	9.45920319	0.18457288	51.25	<.0001
treatment 2	13.21771422	0.18466955	71.57	<.0001
treatment 3	5.86098445	0.22088018	26.53	<.0001
height*treatment 1	0.05677291	0.00368273	15.42	<.0001
height*treatment 2	0.05385465	0.00369476	14.58	<.0001
height*treatment 3	0.06539931	0.00396257	16.50	<.0001

Analysis of Covariance for yield by height categorized by treatment

Some notes from this output: The total number of observations, 44, is the sum of each treatment n from the regressions above. There are now 6 degrees of freedom in the model, one for each parameter: Three intercepts and three slopes. The tests on treatment and treatment*height evaluate whether any of the intercepts or slopes are zero, respectively (we already know they are different from zero from the preceeding regression analyses).

The two contrasts are listed below the ANOVA table. The F test for the first is small and has a large p-value, hence we would conclude that the two slopes of the light treatments are similar. The second contrast has a larger F value and p-value of 0.04, indicating there is some evidence that the shaded slope differs from the light treatments. The plot of observed and predicted values for each treatment supports this showing the shaded treatment to have a slightly steeper slope than the lighted treatments.

Lastly, note that the parameter estimates shown from the solution option in the MODEL statement are exactly the same as the preceeding regression outputs.This should always be true and should always be checked to ensure the formulated full model is correct.

6 References:

Applied Linear Regression Models. Kutner, M.H. and Nachtsheim, C. and Neter, J., McGraw-Hill/Irwin, 2004.

Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. David A. Belsley, Edwin Kuh, Roy E. Welsch. John Wiley & Sons, 2005 - Mathematics - 292 pg.

Least Squares Regression in SAS

Statistical Programs, University of Idaho

2022-03-25

1 Introduction

2 Data used in examples

2.1 Photosynthesis

2.2 Soybean Yield

3 Simple Linear Regression

3.1 Initial steps

3.2 Estimation and Diagnostics

3.2.1 Required Statements

4 Multiple Linear Regression

4.1 Diagnostics

5 Comparing regression lines (Dummy Variable Regression, DVR)

5.1 Introduction - defining dummy variables

5.1.1 Example using Soybean yield data

6 References: