1 Introduction

This document is an introduction to estimating and diagnosing simple and multiple linear regression fixed effect models in SAS. Some additional knowledge of SAS may be useful for understanding certain aspects of this tutorial and the readers are referred to: Data Step and Graphics and Plotting. Application of mixed model methodology to linear regression is also possible. While much of the theoretical background and diagnostics still apply to these models, the details and considerations of setting these models up in SAS are covered elsewhere. Likewise, readers interested in fixed or mixed nonlinear modeling are referred to the specific tutorials on those subjects.

In simple linear regression, we define a linear relationship between a continuous dependent response, \(y_i\), and an independent continuous regressor variable, \(x_1\) as:

\[y_{i} = \beta_0 + \beta_1 \cdot x_{1} + \epsilon_{i}\] where \(\beta_0\) and \(\beta_1\) are an intercept term and regression coefficient, respectively. The last term, \(\epsilon_i\), is the regression error assumed to be distributed as: \(\epsilon_i \sim N(0, \sigma^2)\). The index \(i\) refers to the \(i^{th}\) observation. With some additional considerations, this model can also be theoretically extended to k multiple regressor variables, e.g. \(x_2, x_3, \ldots, x_k\) and their corresponding coefficients, \(\beta_2, \beta_3, \ldots, \beta_k\).

Regression models have various uses including prediction of the response \(y_i\), description and exploration of potential relationships between variables, and structural comparisons of estimated relationships between groups.

2 Data used in examples

2.1 Photosynthesis

Two data examples will be used here. The first is adapted from Steele and Torrie, Principles and Procedures of Statistics: A Biometrical Approach, McGraw-Hill Book Co., 1980, pg 244. These data describe a study investigating the changes in the response, photosynthetic rate, that correspond to changes in two regressor variables: irradiation level and leaf resistance to water vapor. A data step to read in these data would be:

data photo;
    input photo irrad resist;
    cards;
    348     294      990
    131     190      968
    402     294      1868
    731     550      1814
    526     550      2521
    1346    2000     1516
    655     550      1935
    360     550      4675
    618     550      2234
    1385    2000     1158
    1550    2000     985
    1415    2000     1697
    1467    2000     646
    842     800      1086
    927     800      998
    1099    1200     911
    1086    1200     765
    910     1200     1284
    1255    1600     915
    1137    1600     1410
    349     400      4111
    498     400      1802
    989     800      801
    829     800      983
;

2.2 Soybean Yield

The second set of data is taken from Allen and Cady, Analyzing Experimental Data by Regression, Lifetime Learning Pub., 1982, pg 234 and describes soybean yields as they relate to plant height with an additional qualitative factor describing the controlled lighting conditions: 1) Ambient light, 2) Additional artificial light, and 3) Partial shading. The SAS code for reading this data is:

data soybean;
    input yield height treatment;
    cards;
    12.2    48  1
    12.4    52  1
    11.9    42  1
    11.3    35  1
    11.8    40  1
    12.1    48  1
    13.1    60  1
    12.7    61  1
    12.4    50  1
    11.4    33  1
    12.3    48  1
    12.2    51  1
    12.6    56  1
    13.2    65  1
    12.3    51  1
    16.6    63  2
    15.8    50  2
    16.5    63  2
    15.0    33  2
    15.4    38  2
    15.6    45  2
    15.8    50  2
    15.8    48  2
    16.0    50  2
    15.0    35  2
    16.2    50  2
    16.7    62  2
    15.8    49  2
    15.9    52  2
     9.5    52  3
     9.5    54  3
     9.6    58  3
     8.8    45  3
     9.5    57  3
     9.8    62  3
     9.1    52  3
    10.3    67  3
     9.5    55  3
     8.5    40  3
     8.6    41  3
    10.4    67  3
     9.4    55  3
    10.2    66  3
     9.3    56  3
;
run;

3 Simple Linear Regression

3.1 Initial steps

A first step in regression analysis should always be an examination of the data. Graphical methods in simple linear regression are particularly useful. Here we use the first data set to produce a scatter plot looking at photosynthesis rate vs irradiance using the general plotting procedure PROC SGPLOT.

title1 'Least Squares Regression';

proc sgplot data=photo;
    scatter y=photo x=irrad/; 
    xaxis label='Irradiation' LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Photosynthesis Rate" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Observed Data';
run;

The SGPlot Procedure


This initial plot is useful for discerning characteristics of potential relationships. Here we see that the relationship between photosynthesis and irradiation is positive. We can also see that this relationship is more or less proportional or linear. Hence, a linear regression is a plausible model for these data and we expect that any estimated model should reflect the positive direction of the relationship.

3.2 Estimation and Diagnostics

A common means of estimating linear regression models in SAS is the procedure PROC REG. While there are other procedures that can accomplish this task, the REG procedure has more diagnostic functionality built into it that make it the best choice for a fixed linear regression. As with all SAS procedures, there are required and optional statements. A summary of these is shown below.

3.2.1 Required Statements

MODEL: The only required statement under PROC REG is the MODEL statement. This statement defines the model as well as the order in which variables are included. The general form of the MODEL statement is: Dependent var = Regressor var’s / option1 option2 … Here, dependent var is a numeric response (the variable to be modeled) and regressor var’s is a list of one or more numeric variables paired with the dependent var. In most applications, both dependent variables and regressor variables are continuous, numeric values. Options are keywords which specify various details of the analysis, and as usual in SAS, they are listed at the end of a statement after a slash (/). More than one MODEL statement can be used in each call to PROC REG, if needed.

Unlike the MODEL statement of other procedures, PROC REG does not allow the use of crossed or nested terms. Thus, the symbols *, |, and () should not appear in the MODEL statement. If a crossed term is required in the model, it must be explicitly defined in a preceding DATA STEP or read in with the original data.

MODEL Statement Options: PROC REG provides a wide array of options for regression analysis. All of these, however, fall into three broad categories. Printing options determine what details of the analysis should be printed in the output. Examples would be P, CLM, or I for predicted values, prediction intervals, and the \((X' X)^{-1}\) matrix, respectively. The second type of options are diagnostic. Examples include R, DW, and INFLUENCE, for a preliminary residual analysis, the Durbin-Watson statistic, and influence statistics, respectively. The last category of options are used to specify the model itself. Examples of these might be NOINT, SELECTION, and INCLUDE. These, in order, result in the fit of a model with no intercept, requesting REG use various model selection procedures when multiple regressor variables are present, and providing a list of regressor variables that must be included in all models tested.

OUTPUT: The OUTPUT statement will save requested regression statstics to a new SAS data set. Examples of these would be predicted values, raw or studentized residuals, confidence intervals, etc. The format of the OUTPUT statement is: OUTPUT OUT=(data set name) option=name1 option=name2 …, where the options might be P for predicted values, RSTUDENT or R for studentized or raw residuals, respecively, or L95M and U95M for lower and upper 95% confidence intervals on the estimated model. The name1, name2, etc. give the SAS variable names for the respective output statistics. Although PROC REG produces many diagnostic statsitics and plots by default, outputting these statistics to a separate data set allows SAS to utilize them in other procedures, such as further computation or more sophisticated plotting.

The code below demonstrates a call to PROC REG for the photosynthesis data set. The predicted values and residuals are saved to the output data set, STATS.


proc reg data=photo;
    model photo=irrad ;
    output out=stats p=pred_photo r=resid rstudent = student;
run;
Model: MODEL1
Dependent Variable: photo

Number of Observations Read 24
Number of Observations Used 24

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 3549556 3549556 221.17 <.0001
Error 22 353079 16049
Corrected Total 23 3902635

Root MSE 126.68489 R-Square 0.9095
Dependent Mean 868.95833 Adj R-Sq 0.9054
Coeff Var 14.57894

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 245.42764 49.26053 4.98 <.0001
irrad 1 0.61512 0.04136 14.87 <.0001



Model: MODEL1
Dependent Variable: photo

Panel of fit diagnostics for photo.


Scatter plot of residuals by irrad for photo.


Scatterplot of photo by irrad overlaid with the fit line, a 95% confidence band and lower and upper 95% prediction limits.


From the resulting output, we can see that there were n=24 observations used. In the ANOVA table, there were 23 total degrees of freedom (DF=n-1), 1 DF for the model and 23-1=22 DF for the error term. The model F test (model mean square/error mean square) is large (F=221.2; p=0.0001) indicating a relationship is detectable. Some summary information for CV, \(r^2\), and other statistics are given in the next table. The final table gives the estimated parameter values for the intercept and the regression coefficient. The intercept, or predicted photosynthetic rate when irradiation = 0, is 245.4 and the rate of change in photosynthesis with increasing irradiation is 0.62. This value is positive, as expected, has a small p-value, and estimates that photosynthetic rate will increase 0.62 units for every unit increase in irradiation.

The final section are default graphics displaying various residual and other diagnostics. For example, the studentized residual plots indicate the residuals were of reasonable magnitude (under normality assumptions we expected them to be between 2 to 3 units of the mean of 0) and have a somewhat random distribution (there may be some curvature in the residual pattern indicating a linear model may not completely capture this response relationship). The Cook’s D plot suggests that the second data point may be high leverage affecting the regression because it is above the reference line. The final two plots display a larger version of the raw residuals and the predicted line with 95% confidence bounds on the model (shaded area) and individual points (dashed lines). Overall, this model seems to adequately provide a simple description of the relationship between photosynthetic rate and irradiation. For more information on regression and regression diagnostics, see: Applied Linear Regression Models. Kutner, Nachtsheim, and Neter and Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Belsley, Kuh, and Welsch.

We can also produce some of these plots using the output data set STATS. This data set contains all the original data from data set PHOTO with the additional statistics we requested. The code below uses this output data set to produce a more customizable version of these plots.


proc print data=stats;
run;

proc sgplot data=stats;
    scatter y=photo x=irrad/FILLEDOUTLINEDMARKERS markerattrs=(symbol=circlefilled size=14) MARKERFILLATTRS= (color=blue)
     MARKEROUTLINEATTRS=(color=black); 
    series y=pred_photo x=irrad;
    xaxis label='Irradiation' LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Photosynthesis Rate" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Observed vs Predicted';
run;
    
proc sgplot data=stats;
    scatter y=resid x=pred_photo/FILLEDOUTLINEDMARKERS markerattrs=(symbol=circlefilled size=14) MARKERFILLATTRS= (color=green)
     MARKEROUTLINEATTRS=(color=black); 
    refline 0/axis=y;
    xaxis label='Predicted Value' LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Residual" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Raw residuals vs Predicted';
run;

proc sgplot data=stats;
    scatter y=student x=pred_photo/FILLEDOUTLINEDMARKERS markerattrs=(symbol=circlefilled size=14) MARKERFILLATTRS= (color=orange)
     MARKEROUTLINEATTRS=(color=black); 
    refline 0/axis=y;
    xaxis label='Predicted Value' LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Studentized Residual" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Studentized Residuals vs Predicted';
run;
Obs photo irrad resist pred_photo resid student
1 348 294 990 426.27 -78.274 -0.64140
2 131 190 968 362.30 -231.301 -2.08141
3 402 294 1868 426.27 -24.274 -0.19717
4 731 550 1814 583.75 147.254 1.21474
5 526 550 2521 583.75 -57.746 -0.46280
6 1346 2000 1516 1475.68 -129.676 -1.11325
7 655 550 1935 583.75 71.254 0.57259
8 360 550 4675 583.75 -223.746 -1.93696
9 618 550 2234 583.75 34.254 0.27363
10 1385 2000 1158 1475.68 -90.676 -0.76696
11 1550 2000 985 1475.68 74.324 0.62579
12 1415 2000 1697 1475.68 -60.676 -0.50929
13 1467 2000 646 1475.68 -8.676 -0.07238
14 842 800 1086 737.53 104.473 0.83885
15 927 800 998 737.53 189.473 1.58327
16 1099 1200 911 983.58 115.424 0.92962
17 1086 1200 765 983.58 102.424 0.82134
18 910 1200 1284 983.58 -73.576 -0.58548
19 1255 1600 915 1229.63 25.374 0.20403
20 1137 1600 1410 1229.63 -92.626 -0.75408
21 349 400 4111 491.48 -142.477 -1.18439
22 498 400 1802 491.48 6.523 0.05250
23 989 800 801 737.53 251.473 2.20391
24 829 800 983 737.53 91.473 0.73161



The SGPlot Procedure




The SGPlot Procedure




The SGPlot Procedure


4 Multiple Linear Regression

In this example, we extend the analysis above to multiple regression by examining the relationship of photosynthesis rate with both irradiation and resistance to water vapor. As with simple linear regression, a good place to start is to examine the potential individual relationships graphically. The resulting plots may not be as clear as a simple linear model because the two variables can be influencing the response simultaneously, however, general patterns can often be seen.


title1 'Multiple Regression';

proc sgplot data=photo;
    scatter y=photo x=irrad/; 
    xaxis label='Irradiation' LABELATTRS=(Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Photosynthesis Rate" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Observed Data: Irradiation';
run;

proc sgplot data=photo;
    scatter y=photo x=resist/; 
    xaxis label='Water Vapor Resistance' LABELATTRS=(Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
    yaxis label="Photosynthesis Rate" LABELATTRS=( Family=Arial Size=15 Weight=Bold) VALUEATTRS=(Family=Arial Size=12 Weight=Bold);
title2 'Observed Data: Resistance';
run;

The SGPlot Procedure




The SGPlot Procedure


The plot for irradiance is unchanged from previously. The new plot for resistance indicates a possible negative relationship with photosynthesis rate.

4.1 Diagnostics

While the modeling process and many diagnostics are the same for multiple regression, introducing two or more regressor variables into the model requires additional considerations and diagnostics. Specifically, we now must consider whether the regressors are related to one another. This is referred to as collinearity and can cause problems through inflated variability. SAS provides the means to assess this through COLLIN and VIF options on the MODEL statement as well as the default and additional plotting options.


proc reg data=photo plots = DFBetas(label);
    model photo=irrad resist/r influence collinoint vif;
    output out=stats p=pred_photo r=resid rstudent = student;
run;
Model: MODEL1
Dependent Variable: photo

Number of Observations Read 24
Number of Observations Used 24

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 2 3648208 1824104 150.56 <.0001
Error 21 254427 12116
Corrected Total 23 3902635

Root MSE 110.07085 R-Square 0.9348
Dependent Mean 868.95833 Adj R-Sq 0.9286
Coeff Var 12.66699

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 400.60562 69.20397 5.79 <.0001 0
irrad 1 0.57294 0.03886 14.74 <.0001 1.16925
resist 1 -0.07086 0.02483 -2.85 0.0095 1.16925

Collinearity Diagnostics (intercept adjusted)
Number Eigenvalue Condition
Index
Proportion of Variation
irrad resist
1 1.38046 1.00000 0.30977 0.30977
2 0.61954 1.49272 0.69023 0.69023



Model: MODEL1
Dependent Variable: photo

Output Statistics
Obs Dependent
Variable
Predicted
Value
Std
Error
Mean
Predict
Residual Std Error
Residual
Student
Residual
Cook's D RStudent Hat Diag
H
Cov
Ratio
DFFITS DFBETAS
Intercept irrad resist
1 348 498.8958 42.6780 -150.8958 101.5 -1.487 0.130 -1.5345 0.1503 0.9757 -0.6454 -0.6187 0.5082 0.3849
2 131 440.8694 46.2506 -309.8694 99.882 -3.102 0.688 -4.1136 0.1766 0.2235 -1.9048 -1.8332 1.5588 1.1340
3 402 436.6793 34.4530 -34.6793 104.5 -0.332 0.004 -0.3246 0.0980 1.2633 -0.1070 -0.0738 0.0786 0.0113
4 731 587.1774 27.9986 143.8226 106.5 1.351 0.042 1.3799 0.0647 0.9422 0.3629 0.2205 -0.2057 -0.0156
5 526 537.0783 32.4028 -11.0783 105.2 -0.105 0.000 -0.1028 0.0867 1.2655 -0.0317 -0.0031 0.0090 -0.0160
6 1346 1439 43.8862 -93.0506 100.9 -0.922 0.054 -0.9184 0.1590 1.2160 -0.3993 0.1951 -0.3427 -0.1168
7 655 578.6031 28.0307 76.3969 106.4 0.718 0.012 0.7092 0.0649 1.1490 0.1868 0.0976 -0.0981 0.0120
8 360 384.4425 75.2381 -24.4425 80.342 -0.304 0.027 -0.2976 0.4672 2.1442 -0.2787 0.1438 -0.0413 -0.2587
9 618 557.4155 29.4553 60.5845 106.1 0.571 0.008 0.5619 0.0716 1.1897 0.1560 0.0467 -0.0630 0.0489
10 1385 1464 42.1524 -79.4191 101.7 -0.781 0.035 -0.7736 0.1467 1.2417 -0.3207 0.1100 -0.2608 -0.0300
11 1550 1477 41.9689 73.3219 101.8 0.721 0.029 0.7121 0.1454 1.2565 0.2937 -0.0776 0.2285 -0.0025
12 1415 1426 45.4047 -11.2247 100.3 -0.112 0.001 -0.1093 0.1702 1.3925 -0.0495 0.0272 -0.0429 -0.0189
13 1467 1501 42.8739 -33.7002 101.4 -0.332 0.007 -0.3253 0.1517 1.3432 -0.1376 0.0144 -0.0945 0.0281
14 842 781.9984 28.4019 60.0016 106.3 0.564 0.008 0.5548 0.0666 1.1846 0.1482 0.1230 -0.0680 -0.0813
15 927 788.2343 29.6573 138.7657 106.0 1.309 0.045 1.3331 0.0726 0.9669 0.3730 0.3182 -0.1743 -0.2235
16 1099 1024 27.3153 75.4266 106.6 0.707 0.011 0.6987 0.0616 1.1475 0.1790 0.0969 0.0056 -0.0918
17 1086 1034 29.3413 52.0808 106.1 0.491 0.006 0.4819 0.0711 1.2038 0.1333 0.0801 -0.0024 -0.0801
18 910 997.1421 23.9219 -87.1421 107.4 -0.811 0.011 -0.8042 0.0472 1.1044 -0.1791 -0.0562 -0.0328 0.0356
19 1255 1252 31.8257 2.5358 105.4 0.024 0.000 0.0235 0.0836 1.2631 0.0071 0.0006 0.0037 -0.0018
20 1137 1217 31.1000 -80.3877 105.6 -0.761 0.017 -0.7535 0.0798 1.1567 -0.2219 0.0511 -0.1507 -0.0306
21 349 338.4681 62.1805 10.5319 90.825 0.116 0.002 0.1132 0.3191 1.6969 0.0775 -0.0290 0.0000 0.0668
22 498 502.0874 31.7018 -4.0874 105.4 -0.039 0.000 -0.0378 0.0830 1.2621 -0.0114 -0.0078 0.0078 0.0013
23 989 802.1940 32.8231 186.8060 105.1 1.778 0.103 1.8827 0.0889 0.7788 0.5882 0.5223 -0.2818 -0.4061
24 829 789.2972 29.8820 39.7028 105.9 0.375 0.004 0.3670 0.0737 1.2248 0.1035 0.0887 -0.0485 -0.0628

Bar chart of studentized residuals and Cook's D for photo


Sum of Residuals 0
Sum of Squared Residuals 254427
Predicted Residual SS (PRESS) 337749

Panel of fit diagnostics for photo.


Panel of DFBETAS by observation for photo.


Panel of scatterplots of residuals by regressors for photo.


In this model there are now two degrees of freedom, one for each regressor, and the model has a large F value (150.56, p < 0.0001). The \(r^2\) has increased to \(r^2=0.93\) from the simple linear example above (\(r^2=0.91\)), however, some caution is warranted as this measure will inherently increase as regressors are added to the model. The parameter estimates table gives more information. Both regressors have large t values and corresponding smaller p-values suggesting they are relevant in the relationship to photosynthetic rate. As before, the coefficient for irradiation is positive, although a bit smaller than the previous estimate, and as suspected from the initial data plots, the coefficient for water vapor resistance is negative. The last column in this table gives the VIF or, variance inflation factor, for each regressor. This statistic is computed from the multiple correlation, \(r^2_i\), of each regressor relative to the other regressors, i.e.  \[VIF = 1/(1 - r^2_i)\] where i refers to the \(i^{th}\) regressor. For ‘large’ sample sizes, values larger than 1 may indicate collinearity. In this case, both values are larger than 1, however, the sample size is not that large at n=24, so this cutoff may not be a good measure.

The next table gives further metrics for assessing collinearity. The condition index is conputed from a variance decomposition routine. Values above 10 are usually considered problematic. Both regressors fall well below this value. The last measures are the proportion of variance values. Starting with the smallest eigenvalue (0.6195) the proportion for each variable is 0.69. Again, for large sample sizes, a cutoff of above 0.5 is often suggested for indicating collinearity problems, however, given the lower sample size here, these probably are not an issue even though they are greater than 0.5.

The final table gives influence statistics for each observation. Some of these are repeated from the plots. The DFFits and DFBetas statistics, however, are new. These respectively measure how much each observation influences the model fit, or the respective parameter estimates. Larger values may point to observations that have more influence on the regression fit compared to the other data. Here, the second observation appears to have the most influence. One way to see this graphically is the plots=dfbetas(label) option in the PROC REG call (labeled Influence diagnotics for photo). Although some other points (1 and 23) are slightly beyond the reference lines, this plot shows the second observation to be quite different from the other observations. Notably, this observation also was indicated in the Cook’s D, residual, and leverage plots. A good rule of thumb when lookig at any statistical diagnostic is not to rely on one measure or strict cutoff values for decision making, but rather to make a decision across several metrics. In this case we would want to go back and reassess the second data point. Was it recorded correctly? Are there any field notes relevant to that point that may call into question it’s validity. That said, typically, you should NOT delete observations unless you have a defensible reason for doing so. Many times in scientific investigations, it is the data that do not fit the model that are the most informative.

5 Comparing regression lines (Dummy Variable Regression, DVR)

5.1 Introduction - defining dummy variables

One common use of regression is the incorporation of qualitative treatment structures. Doing so allows the comparison of model relationships based on different subsets of the data. This, however, requires that the models be of the same mathematical form, i.e. parameter estimates from linear models with different regressors are not comparible. All the models used should also fit their respective data well. In this technique, the models will be a proxy for the data, so they should represent the data in a reliable fashion. Given these constraints, two or more models can be compared by combining the models into a single full model via a special class of regressors, dummy variables. Dummy variables are binary, usually coded as [0,1] that act like an on off switch corresponding to the different qualitative or treatment variable levels. A full dummy variable model allows the user to estimate all treatment regression models simultaneously and compare combinations of parameters across the models as needed to test hypotheses of interest.

Suppose, for example, we have two treatments and have linear regression models for each:

\[y_{i1} = \beta_{01} + \beta_{11} \cdot x_{i} + \epsilon_{i1} \hspace{5cm} (1)\]

\[y_{i2} = \beta_{02} + \beta_{12} \cdot x_{i} + \epsilon_{i2} \hspace{5cm} (2)\]

where the second subscript refers to treatment 1 or 2. Together, we have four parameters, two intercepts and two slopes. It may be of interest, for example, to compare the two slope parameters, \(\beta_{11}\) and \(\beta_{12}\). To do this we would need a common pooled estimate of the error, however, our models each have their own error terms, \(\epsilon_{i1}\) and \(\epsilon_{i2}\). In order to build a single full model with one common error term, we define new binary dummy variables that encode the treatment structure:

\[ D_1 = \begin{bmatrix} 1 \hspace{1cm} if \hspace{.2cm} treatment = 1\\ 0 \hspace{1cm} if \hspace{.2cm} treatment = 2 \end{bmatrix}\] \[ D_2 = \begin{bmatrix} 0 \hspace{1cm} if \hspace{.2cm} treatment = 1\\ 1 \hspace{1cm} if \hspace{.2cm} treatment = 2 \end{bmatrix}\] We can now write a full model as:

\[ y_i = (D_1 \cdot \beta_{01} + D_2 \cdot \beta_{02}) \hspace{.2cm} + \hspace{.2cm} (D_1 \cdot \beta_{11} + D_2 \cdot \beta_{12}) \cdot x_i \hspace{.2cm} + \hspace{.2cm} \epsilon_i \hspace{5cm} (3) \]

If we pick up an observation from treatment 1, then \(D_1=1\) and \(D_2=0\) and the full model reduces to model (1) above. If we have an observation from treatment 2, the full model returns model (2). Most importantly, we now have one common pooled error term, \(\epsilon_i\) and can proceed to define contrasts to test hypotheses involving comparison of the regression parameters. This concept can be extended to any number of treatments by defining additional dummy variables, however, some caution is advised as the number of estimated parameters will increase with each treatment we add.

Some may recognize this model as Analysis of Covariance (ANCOVA), where the model contains both continuous variables and qualitative factors. In fact, the two models are equivalent mathematically, however, they differ in their objectives. ANCOVA seeks to adjust mean comparisons of qualitative treatment effects using continuous covariates, while DVR focuses on comparing the continuous relationships across qualitative factors. Beware that DVR is sometimes incorrectly referred to as ANCOVA and, while they are mathematically equivalent, they have different analytical goals.

In SAS, either PROC REG or more general linear model procedures such as PROC GLM, PROC MIXED, or PROC GLIMMIX can be used to fit a full model. Although it is technically possible in PROC REG, using that method requires manually defining dummy variables and their interactions separately for each treatment in a data step. A more expedient method uses the general linear model procedures which define dummy variables internally. In this example we will concentrate on the procedure PROC GLM for the fixed model case. Extension to mixed models through PROC MIXED or PROC GLIMMIX is also possible and similar enough that users can simply apply these principles to those procedures.

5.1.1 Example using Soybean yield data

The following code estimates the simple linear regression of soybean yield as a function of plant height, separately for each treatment. The procedure PROC REG is used for the initial estimation and a BY statement is used to process each treatment independently. Note that sorting the data by treatment is necessary before doing this. Diagnostic plots are omitted here for brevity. In a real analysis, however, this step would be used with full diagnostic evaluation, as shown in the previous examples above, to ensure that each treatment regression was adequate and representative of the data.

proc sort data= soybean;
  by treatment;
run;

proc reg data=soybean plots=none;
  model yield = height;
  by treatment;
run;
Model: MODEL1
Dependent Variable: yield

Number of Observations Read 15
Number of Observations Used 15

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 3.77540 3.77540 244.66 <.0001
Error 13 0.20060 0.01543
Corrected Total 14 3.97600

Root MSE 0.12422 R-Square 0.9495
Dependent Mean 12.26000 Adj R-Sq 0.9457
Coeff Var 1.01322

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 9.45920 0.18191 52.00 <.0001
height 1 0.05677 0.00363 15.64 <.0001



Model: MODEL1
Dependent Variable: yield

Number of Observations Read 14
Number of Observations Used 14

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 3.37515 3.37515 205.60 <.0001
Error 12 0.19699 0.01642
Corrected Total 13 3.57214

Root MSE 0.12813 R-Square 0.9449
Dependent Mean 15.86429 Adj R-Sq 0.9403
Coeff Var 0.80764

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 13.21771 0.18773 70.41 <.0001
height 1 0.05385 0.00376 14.34 <.0001



Model: MODEL1
Dependent Variable: yield

Number of Observations Read 15
Number of Observations Used 15

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 4.32725 4.32725 272.97 <.0001
Error 13 0.20608 0.01585
Corrected Total 14 4.53333

Root MSE 0.12591 R-Square 0.9545
Dependent Mean 9.46667 Adj R-Sq 0.9510
Coeff Var 1.32999

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 5.86098 0.22064 26.56 <.0001
height 1 0.06540 0.00396 16.52 <.0001

Assuming the previous regressions fit the data well, the next step is to form and estimate the full model using PROC GLM. We use this procedure because it can utilize the CLASS statement for the qualitative variable treatment. Internally, this causes SAS to define dummy variables for each treatment level, which can then be entered into the GLM MODEL statement. In the model, the term treatment codes for the intercept parameters and the interaction treatment*height codes for the slopes. The noint option is used because we have explicitly given SAS our own intercepts, and the solution option has GLM print the estimates, which is does not do by default. Contrast statements are added to test two hypotheses:

\(H_0: \beta_{11} - \beta_{12} = 0\) ; are the two light treatments, 1 and 2, equivalent in their slope values, and
\(H_0: \frac{(\beta_{11} + \beta_{12})}{2} - \beta_{13} = 0\) ; is the average slope of the light treatments equivalent to the shaded treatment.

proc glm data=soybean;
  class treatment;
  model yield = treatment treatment*height/ ss3 noint solution;
  contrast 'Compare light treatment slopes' treatment*height 1 -1 0;
  contrast 'Compare average light treatment slope to shaded slope' treatment*height 1 1 -2;
run;
Class Level Information
Class Levels Values
treatment 3 1 2 3

Number of Observations Read 44
Number of Observations Used 44



Dependent Variable: yield

Source DF Sum of Squares Mean Square F Value Pr > F
Model 6 7133.816325 1188.969387 74843.0 <.0001
Error 38 0.603675 0.015886
Uncorrected Total 44 7134.420000

R-Square Coeff Var Root MSE yield Mean
0.998048 1.012003 0.126040 12.45455

Source DF Type III SS Mean Square F Value Pr > F
treatment 3 134.2944937 44.7648312 2817.85 <.0001
height*treatment 3 11.4778011 3.8259337 240.83 <.0001

Contrast DF Contrast SS Mean Square F Value Pr > F
Compare light treatment slopes 1 0.00497139 0.00497139 0.31 0.5792
Compare average light treatment slope to shaded slope 1 0.07180107 0.07180107 4.52 0.0401

Parameter Estimate Standard
Error
t Value Pr > |t|
treatment 1 9.45920319 0.18457288 51.25 <.0001
treatment 2 13.21771422 0.18466955 71.57 <.0001
treatment 3 5.86098445 0.22088018 26.53 <.0001
height*treatment 1 0.05677291 0.00368273 15.42 <.0001
height*treatment 2 0.05385465 0.00369476 14.58 <.0001
height*treatment 3 0.06539931 0.00396257 16.50 <.0001

Analysis of Covariance for yield by height categorized by treatment


Some notes from this output: The total number of observations, 44, is the sum of each treatment n from the regressions above. There are now 6 degrees of freedom in the model, one for each parameter: Three intercepts and three slopes. The tests on treatment and treatment*height evaluate whether any of the intercepts or slopes are zero, respectively (we already know they are different from zero from the preceeding regression analyses).

The two contrasts are listed below the ANOVA table. The F test for the first is small and has a large p-value, hence we would conclude that the two slopes of the light treatments are similar. The second contrast has a larger F value and p-value of 0.04, indicating there is some evidence that the shaded slope differs from the light treatments. The plot of observed and predicted values for each treatment supports this showing the shaded treatment to have a slightly steeper slope than the lighted treatments.

Lastly, note that the parameter estimates shown from the solution option in the MODEL statement are exactly the same as the preceeding regression outputs.This should always be true and should always be checked to ensure the formulated full model is correct.

6 References:

Applied Linear Regression Models. Kutner, M.H. and Nachtsheim, C. and Neter, J., McGraw-Hill/Irwin, 2004.

Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. David A. Belsley, Edwin Kuh, Roy E. Welsch. John Wiley & Sons, 2005 - Mathematics - 292 pg.