1 Introduction

Generalized linear models are an extension of traditional linear models that allow non-normaly distributed data types to be incorporated into the linear modeling framework. While the topic can be complex, Stroup (2014) provides an excellent introduction which is partially reproduced here. I would highly recommend those wishing to use generalized linear models read the complete article and reference it in their published works.

Analysis of variance rests on three assumptions: independent observations, normally distributed data, and homogeneous variance, the latter meaning that the variance among experimental units does not change with treatment. However, data with nonnormal distributions are common in most areas of research. Examples include the percentage of seeds that germinate (binomial), weed count per plot (Poisson or negative binomial), time to flowering (exponential or gamma), disease rating category (multinomial), and proportion of leaf area affected (beta), to name a few. Data from emerging genomic and other “-omic” research often share characteristics with non-normal distributions. For all distributions except the normal, the variance depends on the mean. As a consequence, whenever the normality assumption is violated, the equal variance assumption must also be violated—at least, assuming that treatments affect the mean response. The question addressed here is: how should such data be analyzed?

Between the early 1990s and the late 2000s, advances in statistical theory and methodology that had been incubating for decades, enabled by rapid and sustained increases in computing capability, combined to dramatically change the conversation. The advance specifically relevant to this discussion is the GLMM. Generalized linear mixed models extend the linear model theory underpinning ANOVA to accommodate data that may be non-normal, may have heterogeneous variance, and, indeed, may be correlated. Viewed through the GLMM lens, the pre-1990s understanding of non-normal data—still pervasive in the agricultural research community—is antiquated at best, obsolete at worst.

By the mid-2000s, practical GLMM software began to appear. The SAS PROC GLIMMIX was introduced in 2005. Several GLMM packages in R—GLMPQL, GEE, LME4, etc.—appeared as well. This was a watershed moment for statistical analysis. For the first time, useable software existed to implement the full range of statistical models explicitly intended to accommodate both complex experiments (primarily a mixed model issue) and non-normal data (primarily a generalized linear model issue). The extensive development of theory and methodology during the previous decades became available to researchers in accessible form. This explains why the entire question of how to do statistical analysis of non-normal data from experimental research is now being reassessed. Statistics, like all other disciplines, is dynamic; it is not a fixed set of unchanging rules passed down from Fisher and company.

Stroup (2014)




Terminology

Statistics as a field is an evolving science and some terms and abbreviations have become outdated and confusing. Below are some common model types and their abbreviations. Be aware, in this context, “GLM” as an abbreviation does not mean Proc GLM in SAS. The models we are dealing with here are referred to as GLMM or GLIM.

LM: Linear Model. Assumes a fixed linear process fit to Normal Data. (PROC GLM).

LMM: Linear Mixed Model. Assumes a linear process with fixed and random components fit to Normal data. (PROC MIXED).

GLM: Generalized Linear Model. Assumes a fixed linear process fit to Normal or non-normal data. (PROC GENMOD) Note: This is different than PROC GLM!!

GLMM or GLIM: Generalized Linear Mixed Model. Assumes a linear process with fixed and random components fit to Normal or non-normal data. (PROC GLIMMIX).


As mentioned above, the normal distribution is unlike others in the family of distributions (exponential) that we use for statistical inference. Specifically, it’s mean and variance are independent and separate quantities. In all other distributions we use, the variance is some function of the mean. The table below lists some common examples. In the Poisson distribution, for example, the mean and variance are the same quantity.

Distribution Mean Variance Type of Data & Range
Normal \(\mu\) \(\sigma^2\) Continuous \([- ∞, \infty]\)
Binomial \(Np\) \(Np(1-p)\) Discrete Count from Total integer \([0, N]\)
Negative Binomial \(\mu\) \(\mu + \phi\mu^2\) Discrete Count integer \([0, \infty]\)
Poisson \(\mu\) \(\mu\) Discrete Count integer \([0, \infty]\)
Exponential \(\mu\) \(\mu^2\) Continuous \([0, \infty]\)
Beta \(p\) \(p(1-p)/(1+\phi)\) Continuous [0, 1]

This change from normality not only affects the assumption of constant variance, but it also becomes import during estimation. Consider, for example, the commonly used model for a Randomized Complete Block experimental design:

\[y_{ij} = \mu + \tau_i + \beta_j + \tau\beta_{ij}\] where \(y_{ij}\) is the response, \(\mu\) is the overall mean, \(\tau_i\) is the \(i^{th}\) treatment, \(\beta_j\) is the \(j^{th}\) block, and \(\tau\beta_{ij}\) is the block treatment interaction. Under an assumption of normality, the interaction term (usually the model residual) is considered the estimate for the variance, \(\sigma^2\). In most statistical software, this assumption is implicit and the interaction is not explicitly stated in the code. In SAS Proc Mixed, for example, we would specify:

Proc Mixed;
  Class treatment block;
  model y = treatment;
  random block;
run;

That is, the random statement does not specify block*treatment. SAS “assumes” the residual will estimate the variance.

When we switch to another distribution such as the Poisson, however, the mean and variance are now considered equal, and hence, we must specify the full model to accommodate both the mean and variance. The corresponding code in SAS using the generalized linear mixed model procedure Proc Glimmix would then be:

Proc Glimmix;
  Class treatment block;
  Model y = treatment/dist=poisson;
  Random block block*treatment;
run;

To understand this, we must think about the process generating the data versus what we can observe. At the base of the data process, we assume the underlying nature of the data will be of some form, say, Poisson in this case, e.g. the process will be generating discrete counts of weeds in a plot or cells in a culture. The process of measuring it will, however, be subject to sampling and error. That error is typically, with good reason, assumed to be random and normally distributed. What we actually observe, however, is a combination of these two and is referred to as the marginal distribution. It is neither Poisson nor Normal. What the Proc Glimmix model above is doing is allowing us to separate the two. The Model and Random statements together describe the base unit response as Poisson, given (or after accounting for) the normally distributed random effects of blocks. An important sidenote, is to consider the case where the underlying base distribution is normal. When that is combined with the distribution of random effects, also normal, it turns out that the combination of normal distributions is itself normal. Hence, in the past when we’ve assumed a normal distribution, this issue has never come up or needed to be considered.

The final consideration for generalized models is referred to as scale. When applying generalized distributions to linear modeling of data, the estimation process is best carried out on what is called the “natural parameter” of the distribution. What this term refers to is a function that linearizes or “links” the distribution to a linear additive descriptor. The concept is similar, but not equivalent, to the classical process of transforming data before analysis. For example, in classical analysis, we might take the logarithm of count data before analysis. In a generalized linear model with Poisson count data, we use a log link function. Other distributions have their own associated link functions. For example, binomial and Beta distributions use a logit function, while the Poisson and exponential distributions use a logarithmic (log) link. The normal distribution (and an associated distribution called the lognormal) is, again, unique relative to the others and has a link called an identity function. The identity link essentially does nothing to the data, like multiplying by 1.0 or adding 0.0 to a data value.

This is important to understand because the model (and all its estimates and inferences) will be based on the link values or model scale. To bring these back to a framework that is interpretable, we must untransform or inverse link the model scale estimates back to a data scale. These terms will be used and demonstrated in the examples that follow.

2 Data used in examples

Three data examples are used for demonstration below and are from Stroup (2014). Readers wishing to access these data are referred to the supplementary files associated with that article.

2.1 Binomial Data

The first example, intro_binomial, is a set of binomial observations in a randomized complete block design. Each binomial observation consists of counts (Count) taken from 100 observations (Total) from each experimental unit. There are 2 treatments (Treatment) and eight blocks (Block). These data could represent responses such as the number of diseased plants in a sample of 100 plants from each plot in a field experiment or the number of insect traps with a target species out of 100 traps per plot, etc.




proc print data=intro_binomial;
title1 'Binomial Data';
title2 'Stroup, W. W. 2014.  Rethinking the Analysis of Non-Normal Data in Plant and Soil Science. Agron. J. 106:1–17. ';
run;
Binomial Data
Stroup, W. W. 2014. Rethinking the Analysis of Non-Normal Data in Plant and Soil Science. Agron. J. 106:1–17.

Obs block Treatment Count Total
1 1 0 98 100
2 1 1 94 100
3 2 0 95 100
4 2 1 36 100
5 3 0 93 100
6 3 1 85 100
7 4 0 94 100
8 4 1 88 100
9 5 0 99 100
10 5 1 91 100
11 6 0 61 100
12 6 1 82 100
13 7 0 84 100
14 7 1 43 100
15 8 0 92 100
16 8 1 71 100

2.2 Count Data

The second data set represents discrete counts recorded from a randomized complete block design. Again there are two treatments (Trt) and eight blocks (Block). The counts are values greater than or equal to zero. Examples here might be the number of insects observed in a segment of time or the number of cows with mastitis per dairy. The underlying distributions for these counts can be considered as either Poisson or Negative Binomial. The details on this will be discussed further below.


proc print;
title1 'Poisson/Negative Binomial Data';
title2 'Stroup, W. W. 2014.  Rethinking the Analysis of Non-Normal Data in Plant and Soil Science. Agron. J. 106:1–17. ';
run;
Poisson/Negative Binomial Data
Stroup, W. W. 2014. Rethinking the Analysis of Non-Normal Data in Plant and Soil Science. Agron. J. 106:1–17.

Obs Block Trt Count
1 1 0 1
2 1 1 36
3 2 0 5
4 2 1 109
5 3 0 21
6 3 1 30
7 4 0 7
8 4 1 48
9 5 0 2
10 5 1 0
11 6 0 6
12 6 1 2
13 7 0 0
14 7 1 5
15 8 0 19
16 8 1 26

2.3 Proportional Data

The last data set pertains to data that are inherently recorded as a proportion or percentage. Such data are unique in that they are constrained to a minimum or 0.0 on the low end and 1.0 (or 100%) on the upper end. Such data may be encountered in nutrient sampling, e.g. %C or %N, etc. or compositional data such as % of a given taxa or % make up of a soil sample. The Binomial data from the first data set could also be expressed as a proportion, however, if the complete binomial data is present, it is advisable to treat it as such, rather than as a proportion. Before using the proportional data type, if it is expressed as a percentage, it should be converted to a proportion by dividing it by 100. Also, it should be noted that some data are recorded as a percent change or percent difference. While such data are similar to proportional data, they are not constrained between [0, 1], and can exceed 1.0 or even be negative. That type of data, therefore,must be handled in a different manner. For true proportional data, a strong candidate distribution is the Beta distribution which is naturally constrained to the range [0, 1].


proc print;
title1 'Proportion (Beta) Data';
title2 'Stroup, W. W. 2014.  Rethinking the Analysis of Non-Normal Data in Plant and Soil Science. Agron. J. 106:1–17. ';
run;
Proportion (Beta) Data
Stroup, W. W. 2014. Rethinking the Analysis of Non-Normal Data in Plant and Soil Science. Agron. J. 106:1–17.

Obs block trt proportion
1 1 1 0.573
2 1 2 0.925
3 2 1 0.044
4 2 2 0.835
5 3 1 0.888
6 3 2 0.949
7 4 1 0.008
8 4 2 0.941
9 5 1 0.990
10 5 2 0.994
11 6 1 0.409
12 6 2 0.958
13 7 1 0.117
14 7 2 0.520
15 8 1 0.926
16 8 2 0.975



There are other potential distributions that can be considered that will not be covered here. These could include the lognormal distribution (useful for positively skewed continuous data such as concentrations), binary data which have two responses (Yes/No, 0/1, +/-, etc.), and even the Normal distribution which defaults to a traditional mixed model analysis. Although special circumstances can arise with any of these,the general concepts outlined here will apply to these distributions as well.



Additional Notes on Data

Many of the link functions associated with generalized linear modeling involve a logarithmic function. This can result in problems if the data have a values of zero, or an argument in the link fucntion resolves to zero. While there are more sophisticated models for handling zeros specifically (e.g. zero inflated models or hurdle models), those are beyond the scope of this tutorial. In many cases, the problems can be mitigated by altering the zero values slightly through the addition of a small constant, e.g. 0.0001. In the case of proportional data, this must include recoding values of 1.0 to 0.9999 as well. While this is not an optimal solution, it can yield viable results provided the data does not contain an excessive amount of problematic values. In those cases where excessive zeros are present, it may be advisable to consider alternative analyses such as a conditional analysis only on non-zero values, or analyses that omit certain treatments that have consistent zero response, or even precluding any inferential statistical analysis at all and simply summarizing the data for reporting.


3 Required Statements

The statements and syntax used by Proc Glimmix are very similar to other procedures like Proc GLM or Proc Mixed.

Following the call to Proc Glimmix, there are three statement types that must be specified for ANOVA. The first is the CLASS statement which is used to define variables which represent factor groupings or classifications. Examples would be treatment and replication ID numbers or letters. Since the values of these variables represent the levels of an effect, SAS will accept either numeric or alphabetic data in the CLASS statement. Note: Data do not have to be sorted according to the CLASS variables.

The second required statement is MODEL. The model statement specifies the response variable and and how the independent fixed effect factors of the CLASS statement enter the model. The general form is:

MODEL Response = Fixed factors/<options>;

where Fixed factors is a list of all or some of the variables given in the CLASS statement. Proc Mixed also provides the ability to easily state interaction as well as nested terms in the MODEL statement. Interaction effects represent combinations of factors and can be specified by using the * symbol between two or more terms, for example: A*B. This notation can become cumbersome if many interactions are present so a shorthand version exists. The vertical bar between effects signifies all possible interactions. An example would be: A|B|C which produces the effects A, B, A*B, C, A*C, B*C, and A*B*C, in that order. Nesting occurs when one effect exists completely within another effect either in time or space. These effects are denoted with parentheses such as B(A) which is read as ‘Factor B within factor A’. The *, |, and () can be used alone or in combination to specify the exact model required.

The Response in the MODEL statement will differ when using binomial data. In that case, it will be written in a fractional form, X/Y, where X is the variable for counts and Y is a variable for total samples (trials) evaluated per experimental unit.

The <options> portion of the MODEL statement can diff from previous procedures. Specifically, this is where we tell SAS what distribution we’d like to assume (DIST=), as well as the link function (LINK=). For example, to request a Poisson distribution with a log link we would use:

MODEL y = A B A*B/ DIST=poisson link=log;

If the DIST option is omitted, SAS assumes a normal distribution. If the LINK option is omitted, SAS assumes the predfined link for that distribution (refer to the SAS documentation on Glimmix for details). In a Proc Mixed call, only one MODEL statement can be given.

The last required statement for a mixed model is the Random statement. This statement tells SAS what factors to treat as random effects. These factors must be listed in the CLASS statement along with the fixed effects. As with fixed effects, random factors can also be specified as interaction or nested terms in the same manner as the MODEL statement. Unlike the MODEL statement, multiple RANDOM statements can be used within one Proc Mixed call.

There are two methods of specifying RANDOM statements: Direct and Indirect. Each method can achieve the same goal, but the method of estimation for the random effects may differ between the two. In some cases where SAS is having trouble estimating the model, the indirect method may help. Examples of each are generally shown as:

Direct Form

Random replication;

Indirect Form

Random intercept/subject=replication;

In the indirect form, intercept and subject are key words that must be used. Here, subject specifies the basic experimental or observational unit that the variance components will be computed from. If multiple random effects are necessary and the indirect form is used, then a RANDOM statement must be specified for each random effect. Alternatively, in the direct form, multiple random effects can be listed in a single RANDOM statement.

Both the direct and indirect forms can also specify options. Most commonly, these options relate to how the variance-covariance model matrix should be set up using a Type= option. By default, Proc Mixed assumes a “variance component” structure where a single variance is estimated for each random effect. This can also be specified with a Type=VC option. Other available types allow for the estimation of additional covariances between observations or subjects. The choice of a covariance structure is dependent on the experimental set up and SAS provides for many different types. In using them, however, care should be taken to use the simplest structure possible while maintaining good representation and estimation of the data. Specifying a variance-covariance structure that is too complex will inevitably lead to estimation problems.

3.1 Additional Statements and Options

One of the flexabilities of mixed models is their ability to incorporate correlation structure. Repeated measures designs are an example of this and are accommodated into Proc Glimmix. Unlike Proc Mixed, however, this is done through the RANDOM statement outlined above. As with the standard RANDOM statement, the SUBJECT and TYPE options may be used to specify the basic unit at which the repeated effect occurs, e.g. individual plots, animals, etc., as well as the correlation structure associated with them. The correlation structures available are numerous and cover both time as well as spatial effects. Typical examples of correlational structures for time effects would be UN for unstructured, CS for compound symmetry, and AR(1) for autoregressive correlations. Spatial correlation can also be specified for most common semivariogram models such as SP(POW), SP(GAUS), and SP(EXP) for power, Gaussian, and exponential models, respectively. As before, care should be taken when specifying correlational structures with large numbers of components to avoid computational problems in estimation.

LSMEANS: Although Proc Glimmix estimates models utilizing the technique of Maximum Likelihood, SAS has retained the nomenclature LSMEANS or Least Squares Means for estimating means for fixed effects. SAS provides for comparison of LSMEANS by the DIFF or PDIFF options which produce a table of all possible pair-wise comparisons and their associated t-tests. It is important to remember that the probabilities associated with these comparisons are applicable to a limited number of preplanned tests only. If a multiple comparison adjustment to these tests is needed, the Adjust= option can be used with any of several adjustment types such as Tukeys, Bonferroni, etc. The default is equivalent to Fishers LSD. In addition to the mean estimates, standard errors will also be produced for each mean. LSMEANS can also be used for interaction and nested effects, when applicable. Note, however, that LSMEANS can only be estimated for fixed effects listed in the CLASS statement.

One additional feature Proc Glimmix has here over Proc Mixed is plotting capabilities. This is particularly useful when investigating the nature of potential interactions. LSMEANS can be plotted within Proc Glimmix with confidence intervals, colored groupings, and joined points. An example for a two factor interaction might be:

LSMEANS A*B/plots=mean(join cl sliceby = A ilink);

Here a plot of means vs factor B is produced. The plot contains joined data points for each level of A with 95% confidence intervals. The ilink option (inverse link) requests the plot be drawn on the Data scale.

CONTRAST: Single degree of freedom contrasts which test specific hypotheses of interest are an alternative to pair-wise comparisons. SAS provides for these with the CONTRAST statement. A single df contrast has the general form of:

CONTRAST ‘any label’ Factor name {coefficients};

The label portion allows you to give a name to the hypothesis being tested so it can be identified in the output. The Factor name identifies what fixed effect the contrast is working on and the coefficients specify the contrast itself. An example might be:

CONTRAST ’ Trts vs Ctrl’ A 2 -1 -1;

Here the average of the LSMEANS for the last 2 levels of fixed effect A are contrasted to the first level. Some contrasts test more than one hypothesis at a time (multiple DF contrasts) and these are separated by a comma in the CONTRAST statement. For example:

CONTRAST ‘Testit’ A 2 -1 -1 , A 0 1 -1;

would be a 2 degree of freedom contrast that jointly tests the average of levels 2 and 3 to level 1 of factor A, simultaneously with a comparison of level 2 to level 3 of factor A.

ODS: ODS is an abbreviation for Output Delivery System. In SAS, the ODS is very generalized and has many functions. In this case, ODS will allow us to output and save to a SAS data set many of the internal statistical values involved with ANOVA. Most commonly, this will be model estimates, and specifically for ANOVA, LSMEANS. This gives the user the ability to manipulate, plot, or output from SAS the estimated means. The general format for this would be:

ODS output LSMeans=<name>;

Here, “LSMeans” is a keyword indicating the statistic we want to save. Note the spelling and capitalization as it must be written exactly in this form. The <name> portion is what you want to call the data set containing the means. This data set will have predefined variables indicating the fixed effect, it’s level, the mean value (Estimate), its standard error (StdErr), and the df, t-value and p-value for testing whether the mean is equal to zero. If confidence limits (CL) were requested with the LSMeans, then they will be output as well as “Lower” and “Upper”. When multiple fixed effects are listed in the LSMEANS statement, then all the means for all listed effects will be present in the saved data set. Note: To use the ODS output statement, the requested statistics, such as LSmeans, must be used as part of the Proc Glimmix call.

4 Examples

4.1 Binomial Data

For binomial data, the response on the lefthand side of the model consists of to parts, the counts indicating how may times something happened or was observed (Events), and a total, representing the number of times an event was assessed (Trials). In this hypothetical data there were 100 trials for each treatment and block combination (experimental unit). This could be, say 100 seeds in a germination tray, or the number of wells on a reaction plate. The corresponding events or counts might be how many out of the 100 seeds germinated or how many of the 100 cells showed a positive reaction. This scenario is repeated for each treatment in each block.


proc glimmix data=intro_binomial;
   class block Treatment;
   model Count/Total=Treatment/dist=binomial link=logit;
   random block block*treatment;
   lsmeans Treatment / cl ilink plots=mean(cl ilink);
   title1 'Binomial Example';
run;
Binomial Example

Model Information
Data Set WORK.INTRO_BINOMIAL
Response Variable (Events) Count
Response Variable (Trials) Total
Response Distribution Binomial
Link Function Logit
Variance Function Default
Variance Matrix Not blocked
Estimation Technique Residual PL
Degrees of Freedom Method Containment

Class Level Information
Class Levels Values
block 8 1 2 3 4 5 6 7 8
Treatment 2 0 1

Number of Observations Read 16
Number of Observations Used 16
Number of Events 1306
Number of Trials 1600

Dimensions
G-side Cov. Parameters 2
Columns in X 3
Columns in Z 24
Subjects (Blocks in V) 1
Max Obs per Subject 16

Optimization Information
Optimization Technique Dual Quasi-Newton
Parameters in Optimization 2
Lower Boundaries 2
Upper Boundaries 0
Fixed Effects Profiled
Starting From Data

Iteration History
Iteration Restarts Subiterations Objective
Function
Change Max
Gradient
0 0 3 48.292827554 0.00650964 2.589E-6
1 0 3 48.309327634 0.00060717 1.529E-6
2 0 2 48.309580567 0.00001102 0.000015
3 0 1 48.309583194 0.00000165 2.598E-6
4 0 0 48.309582638 0.00000000 2.409E-6

Convergence criterion (PCONV=1.11022E-8) satisfied.

Fit Statistics
-2 Res Log Pseudo-Likelihood 48.31
Generalized Chi-Square 13.69
Gener. Chi-Square / DF 0.98

Covariance Parameter Estimates
Cov Parm Estimate Standard
Error
block 0.5201 0.6037
block*Treatment 0.8335 0.4926

Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Treatment 1 7 6.75 0.0355

Treatment Least Squares Means
Treatment Estimate Standard
Error
DF t Value Pr > |t| Alpha Lower Upper Mean Standard
Error
Mean
Lower
Mean
Upper
Mean
0 2.5502 0.4405 7 5.79 0.0007 0.05 1.5087 3.5917 0.9276 0.02958 0.8189 0.9732
1 1.2698 0.4232 7 3.00 0.0199 0.05 0.2691 2.2704 0.7807 0.07245 0.5669 0.9064

Plot of Inv. linked Count/Total least-squares means for Treatment. With 95% confidence limits.


4.1.1 Model Information

The initial portions of the output describe the data and options used. For Glimmix, there is additional information regarding the distribution and link function assumed. This is followed by class levels recognized by SAS and observation s used. As with any ANOVA analysis, this should be checked carefully to assure the data are being read correctly and that the classification factors are represented as you expect. Here we have two levels of treatment and 8 blocks. For the binomial, the data are summarized by observations (16) as well as the total number of events (counts) and trials (Totals). In every block and treatment combination we had 100 trials so this is reproted as 16X100=1600. The number of trials is the sum of all counts.

Additional information on the estimation are given in subsequent tables. We won’t cover the details of this here, but the final message in the section:

Convergence criterion (PCONV=1.11022E-8) satisfied.

is important. It indicates that the iterative algorithm of the estimation process was able to settle on an answer. You should always see a similar message when doing these analyses.

In the model fit table, an important piece for discrete data is the statistic for the Chi Square/DF. This value is an indicator for a condition known as overdispersion, which simply means the variance estimated is bigger than expected. Because the the non-normal distributions have known relationships between their means and variances (see the table in the introduction above), we have some idea what the variance of each should be. If this statistic is close to 1.0, then there is little evidence of overdispersion. If, however, it is larger, then the data may indicate overdispersion. That is, the model, as specified, is not accounting for enough variability given the assumed distribution. In this case, the value is 0.98, which is close to one, suggesting things are ok. If, as an exercise, the model were rerun, this time omitting the block*treatment random effect, the statistic would be 7.05, strongly suggesting the model is not accounting for enough variability. If an analysis is showing overdispersion, there are some possibilities: 1) like this example, we have incorrectly specified the model and perhaps omitted something from the random effects; 2) We may have assumed the incorrect distribution (more on this in the next example), or 3) there is unaccounted for variability in our data which we may or may not be able to adjust for. Models with overdispersion can lead to incorrect inference and should, therefore, be viewed skeptically.

4.1.2 Model Estimation

The final tables give estimates for the variance components (covariance parameter estimates), the ANOVA table for fixed effects, and the estimated means. The fixed effect of treatment here shows weak evidence for treatment differences (p=0.0355).

Estimated means are in the last table. For the binomial distribution, the parameter of interest is \(p\), the probability of an event, e.g. seed germination or a positive reaction, occurring. The logit link function used for modeling is defined as:

\(logit(p) = log(\frac{p}{1-p})\)

These logit values for each treatment are shown in the Estimate column of the table, along with their standard errors and tests of whether they equal zero or not. Note that these tests are on the linked, model scale, logit estimates, not the probability estimates of \(p\). Hence, the logit value will only equal zero when the argument of the log function is 1.0, or when \(p = 1-p => p = 0.5\). Therefore, these tests assess whether the probability of an event or non-event are equally likely. Such tests may or may not make sense, depending on the problem. The later columns of this table are usually of more interest to researchers and the result from the ilink or inverse link option in the LSMeans statement. This gets us out of the Model Scale of estimation and back to a more interpretable framework of the Data Scale. The Mean column (and those to the right of it) show information for the estimated probabilities of the event in each treatment, 0.92 and 0.78. Because we added the cl option we also get 95% confidence intervals for these estimates.

The plot following the means table results from the plot= option and depicts each mean and its 95% confidence interval. These are also on the Data Scale due to the link option in the plot request. One important note on this graph: The confidence intervals are non-symmetric. This occurs because the binomial parameter, \(p\), is restricted to between 0.0 and 1.0. In non-normal data, this will always be the case and, because of this, plots that display means plus or minus standard errors are incorrect. It is strongly advised that plots be reported with confidence intervals and not with standard errors when using GLMM.

4.2 Count Data

Count data is common in agricultural and biological research. In this case, we again assume a randomized complete block design with two treatments and eight blocks. Here the response in the model is just one variable, count. The Poisson distribution is a good fit here as it describes the probability of some number of discrete events occuring in a set time or space. Examples would be the number of animal behaviors (drinking, eating, resting, etc) in an hour, or the number of seeds in a weighed sample core. Like the binomial case, the random statement specifies a full model with all possible random effects.

In many biological data sets, count data can also indicate overdispersion. For such data, a common adjustment is to change the distribution assumed to the negative binomial (negbin in SAS). This distribution has an extra parameter, \(\phi\), to account for extra variation. Code for this model is shown below (commented out) for demonstration. The nloptions statement in that example is used to help SAS converge on an estimate by extending the default iterations.

6          
7          proc glimmix data=RCB_Counts;
8              class block trt;
9              model count=trt/d=poisson link=log;
10             lsmeans trt/plots=mean(ilink cl);
11             random block block*trt;
12             title1 'Poisson Example';
13         run;



NOTE: Convergence criterion (PCONV=1.11022E-8) satisfied.
NOTE: PROCEDURE GLIMMIX used (Total process time):
      real time           2.28 seconds
      cpu time            0.53 seconds
      

14         
15         /*
16         proc glimmix data=RCB_Counts;
17            class block trt;
18            model count=trt/d=negbin link=log;
19            nloptions maxiter=50;
20            random block;
21            lsmeans trt/plots=mean(ilink cl);
22         run;
23         */
Poisson Example

Model Information
Data Set WORK.RCB_COUNTS
Response Variable Count
Response Distribution Poisson
Link Function Log
Variance Function Default
Variance Matrix Not blocked
Estimation Technique Residual PL
Degrees of Freedom Method Containment

Class Level Information
Class Levels Values
Block 8 1 2 3 4 5 6 7 8
Trt 2 0 1

Number of Observations Read 16
Number of Observations Used 16

Dimensions
G-side Cov. Parameters 2
Columns in X 3
Columns in Z 24
Subjects (Blocks in V) 1
Max Obs per Subject 16

Optimization Information
Optimization Technique Dual Quasi-Newton
Parameters in Optimization 2
Lower Boundaries 2
Upper Boundaries 0
Fixed Effects Profiled
Starting From Data

Iteration History
Iteration Restarts Subiterations Objective
Function
Change Max
Gradient
0 0 5 55.140021235 0.08825631 3.34E-7
1 0 3 55.810329353 0.02397786 0.00003
2 0 3 55.922604042 0.00270732 2.989E-7
3 0 2 55.928975696 0.00015000 2.107E-7
4 0 1 55.929217383 0.00000511 4.093E-6
5 0 0 55.929227014 0.00000000 3.668E-6

Convergence criterion (PCONV=1.11022E-8) satisfied.

Fit Statistics
-2 Res Log Pseudo-Likelihood 55.93
Generalized Chi-Square 15.16
Gener. Chi-Square / DF 1.08

Covariance Parameter Estimates
Cov Parm Estimate Standard
Error
Block 0.7912 0.9337
Block*Trt 1.2672 0.8284

Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Trt 1 7 3.75 0.0939

Trt Least Squares Means
Trt Estimate Standard
Error
DF t Value Pr > |t|
0 1.5283 0.5447 7 2.81 0.0263
1 2.7142 0.5260 7 5.16 0.0013

Plot of Inv. linked Count least-squares means for Trt. With 95% confidence limits.


4.2.1 Model Information and Estimates

The output from this model is generally similar to the binomial example above. Here the link function is log, so the Estimate column in the LSMeans output are log values. The ilink option provides Data Scale estimated counts under the Mean and associated columns. As was seen in the previous example, the confidence intervals in the mean plot are non-symmetric, and estimates, in this case, are bounded to be strictly greater than zero.

4.3 Proportional Data

Proportional data is measured naturally on a percentage or proportional scale and is best described with a Beta distribution. It does not have the Trials of binomial data and, on the proportional scale, it must between 0.0 and 1.0. Percentages should be divided by 100 prior to analysis, and values equal to 0.0 or 1.0 should be changed to 0.0001 and 0.9999, respectively to avoid computational problems. Like the binomial distribution, the link here is a logit and the parameter of interest is the proportion of an event. Examples could include the percent of a field plot showing disease or the percent N in a soil sample.

The Beta distribution can be tricky to fit. Data that contain numerous values close to zero or 1.0 can cause computational difficulties. Getting analyses on such data to converge may take several adjustments on advanced estimation options not covered here.

The code below demonstrates this distribution in Glimmix. It is recommended to use the estimation method quadrature in the call to Glimmix for proportional data. For quadrature estimation, however, SAS requires we use the indirect method of specifying the random statement(s). Like the negative binomial, the Beta has an extra parameter, \(\phi\), to account for variability, so in this case, the random term block*trt is omitted.

6          
7          proc glimmix data=beta_intro method=quadrature;
8             class block trt;
9             model proportion=trt/d=beta link=logit;
10            random intercept / subject=block;
11            lsmeans trt / cl ilink plots=mean(ilink cl);
12            title 'Proportional (Beta) Data';
13         run;




NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: PROCEDURE GLIMMIX used (Total process time):
      real time           2.32 seconds
      cpu time            0.54 seconds
      

14         
Proportional (Beta) Data

Model Information
Data Set WORK.BETA_INTRO
Response Variable proportion
Response Distribution Beta
Link Function Logit
Variance Function Default
Variance Matrix Blocked By block
Estimation Technique Maximum Likelihood
Likelihood Approximation Gauss-Hermite Quadrature
Degrees of Freedom Method Containment

Class Level Information
Class Levels Values
block 8 1 2 3 4 5 6 7 8
trt 2 1 2

Number of Observations Read 16
Number of Observations Used 16

Dimensions
G-side Cov. Parameters 1
R-side Cov. Parameters 1
Columns in X 3
Columns in Z per Subject 1
Subjects (Blocks in V) 8
Max Obs per Subject 2

Optimization Information
Optimization Technique Dual Quasi-Newton
Parameters in Optimization 4
Lower Boundaries 2
Upper Boundaries 0
Fixed Effects Not Profiled
Starting From GLM estimates
Quadrature Points 7

Iteration History
Iteration Restarts Evaluations Objective
Function
Change Max
Gradient
0 0 4 -13.93330413 . 4.587547
1 0 2 -15.86535019 1.93204606 1.979748
2 0 5 -16.82623233 0.96088214 0.768319
3 0 5 -16.84689488 0.02066255 1.019098
4 0 2 -16.8584461 0.01155122 0.754649
5 0 4 -16.88715996 0.02871386 0.162441
6 0 3 -16.89102183 0.00386188 0.093296
7 0 2 -16.89203944 0.00101761 0.104623
8 0 4 -16.89414547 0.00210603 0.026818
9 0 3 -16.89420959 0.00006412 0.000963
10 0 3 -16.89420988 0.00000028 0.00035
11 0 3 -16.89420989 0.00000001 9.328E-6

Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics
-2 Log Likelihood -16.89
AIC (smaller is better) -8.89
AICC (smaller is better) -5.26
BIC (smaller is better) -8.58
CAIC (smaller is better) -4.58
HQIC (smaller is better) -11.04

Fit Statistics for Conditional Distribution
-2 log L(proportion | r. effects) -23.35
Pearson Chi-Square 11.94
Pearson Chi-Square / DF 0.75

Covariance Parameter Estimates
Cov Parm Subject Estimate Standard
Error
Intercept block 0.3328 0.8315
Scale 1.8540 1.2657

Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
trt 1 7 3.89 0.0892

trt Least Squares Means
trt Estimate Standard
Error
DF t Value Pr > |t| Alpha Lower Upper Mean Standard
Error
Mean
Lower
Mean
Upper
Mean
1 -0.05378 0.4332 7 -0.12 0.9047 0.05 -1.0781 0.9705 0.4866 0.1082 0.2539 0.7252
2 1.3565 0.6275 7 2.16 0.0674 0.05 -0.1274 2.8404 0.7952 0.1022 0.4682 0.9448

Plot of Inv. linked proportion least-squares means for trt. With 95% confidence limits.


References:

Stroup, W. W. 2014. “Rethinking the Analysis of Non-Normal Data in Plant and Soil Science.” Agronomy Journal 106: 1–17.