Type 3_ANOVA_ChickWeight

Note: This article shows an R example on how to conduct a type 3 analysis of variance (ANOVA) on the dataset ChickWeight. For general information about ANOVA please refer to the following article: ANOVA.

In short: In this article, an exemplary type 3 ANOVA is conducted on the dataset ChickWeight which examines if the weight of chicks differs depending on different diets. The results are further investigated with post-hoc tests and visualized.

before running the code you might need to install the packages using the command install.packages() eg. install.packages(”car”).

The Dataset[edit]

The dataset ChickWeight is included in the R base package and contains information on the effect of the early diet on the weight of chicks. Chicks were fed with 4 different protein diets and weighted at birth and every second day after until day 20 and also on day 21.

The variables that will be included in the analysis are:

  • Chick: An ordered factor giving a unique identifier for each chick.
  • weight: A numeric variable containing the bodyweight of the chick.
  • diet: A factor with the levels 1-4, indicating the diet the respective chick received.
  • Time: A numeric vector containing the number of days since birth when the measurement was made.

Examining the Dataset[edit]

Before conducting the analysis, we can examine the dataset and check for the different assumptions of an ANOVA to make sure that the ANOVA produces valid results. The assumptions of the ANOVA are:

  • Normality of the residuals of the model - this can be approximated by measuring the normality of the dependent variable for each factor
  • Equal variance for each factor
  • Independent treatment groups
  • Equal size in each treatment group (balanced design)

The command str() displays the internal structure while summary() displays information about the different variables.

To check for the equal size of groups, the summary() command can be used. Since the chicks were weighted on different days, the measurements are not independent of each other. The data therefore needs to be adapted. The normality of the different factors can be visualized with histograms or tested with a Shapiro-Wilk test. The variances can be tested with a Levene test or visualized in a boxplot, which is usually done.

#Load the necessary library
library(car)

#load the data
data("ChickWeight")

#Inspect the dataset

str(ChickWeight)
summary(ChickWeight)

#Convert Diet to a factor variable (if not already)
ChickWeight$Diet <- as.factor(ChickWeight$Diet)

Output[edit]

The structure and summary of the dataset deliver following results:

> str(ChickWeight)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame':	578 obs. of  4 variables:
 $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
 $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
 $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
 $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "outer")=Class 'formula'  language ~Diet
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "labels")=List of 2
  ..$ x: chr "Time"
  ..$ y: chr "Body weight"
 - attr(*, "units")=List of 2
  ..$ x: chr "(days)"
  ..$ y: chr "(gm)"

> summary(ChickWeight)
     weight           Time           Chick     Diet   
 Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
 1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
 Median :103.0   Median :10.00   20     : 12   3:120  
 Mean   :121.8   Mean   :10.72   10     : 12   4:118  
 3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
 Max.   :373.0   Max.   :21.00   19     : 12          
                                 (Other):506

For the numeric variables weight and time, the minimum and maximum values as well as the quartiles and median and mean are shown. For the factorial variables chick and diet, it is indicated how many datapoints for each factor exist. Each treatment group has an equal size.

A problem for conducting an ANOVA in this dataset lies in repeated measurements of the weight of the chicks over the course of 21 days. The measurements are thus not independent and conducting an ANOVA on this data could lead to a type 1 error (falsely rejecting the 0 hypothesis) and claiming a significant effect where there is none.

Adjusting the Dataset[edit]

To perform an ANVOA we need to either apply a model that corrects for repeated measurements like a mixed effect model or adjust the dataset in a a way that an ANOVA can be performed. In this case we can try to create a subset where we only perform the ANOVA for one of the days that the weight was measured. To include as much of the effect of the diet as possible, we chose day 21. After creating the subset, the structure and summary is checked again.

SubsetDay21 <- subset(ChickWeight, ChickWeight$Time == 21)
summary(SubsetDay21)
str(SubsetDay21)

Output[edit]

SubsetDay21 <- subset(ChickWeight, ChickWeight$Time == 21)
> summary(SubsetDay21)
     weight           Time        Chick    Diet  
 Min.   : 74.0   Min.   :21   13     : 1   1:16  
 1st Qu.:167.0   1st Qu.:21   9      : 1   2:10  
 Median :205.0   Median :21   20     : 1   3:10  
 Mean   :218.7   Mean   :21   10     : 1   4: 9  
 3rd Qu.:266.0   3rd Qu.:21   17     : 1         
 Max.   :373.0   Max.   :21   19     : 1         
                              (Other):39         
> str(ChickWeight)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame':	578 obs. of  4 variables:
 $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
 $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
 $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
 $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "outer")=Class 'formula'  language ~Diet
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "labels")=List of 2
  ..$ x: chr "Time"
  ..$ y: chr "Body weight"
 - attr(*, "units")=List of 2
  ..$ x: chr "(days)"
  ..$ y: chr "(gm)"
> str(SubsetDay21)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame':	45 obs. of  4 variables:
 $ weight: num  205 215 202 157 223 157 305 98 124 175 ...
 $ Time  : num  21 21 21 21 21 21 21 21 21 21 ...
 $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 17 14 11 18 12 20 5 7 13 ...
 $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
>

The subset only contains the measurements for day 21 so the independence of the measurements is now given. The problem now lies in the unbalanced design. The number of measurements for each factor is different for each diet. A normal type 1 ANOVA can thus not be applied. For this case a type 2 or type 3 ANOVA can be applied that is more robust against unbalance designs.

Conducting Type 3 ANOVA[edit]

A type 3 ANOVA is conducted with the subset containing only the data of day 21. To conduct a Type 3 ANOVA the command Anova() with a capital A needs to be used. This command uses the Anova() function of the library car where the type of ANOVA can be specified.

The data is then visualized in a boxplot (Fig. 1).

#perform type 3 ANOVA for subset of day 21, does Diet affect the weight
options(contrasts = c("contr.sum", "contr.poly"))
model <- lm(weight ~ Diet, data = SubsetDay21)
result <- Anova(model, type = 3)
print(result)

#summarise data in boxplots for day 21 (Fig. 1)
boxplot(weight ~ Diet, data = SubsetDay21,
        xlab = "Diet",
        ylab = "Weight on day 21",
        main = "Weight on day 21 by diet",
        col = "burlywood1"
        )

Output[edit]

Fig. 1: Boxplot showing the weight on day 21 of the chicks depending on the diet.
> print(result)
Anova Table (Type III tests)

Response: weight
             Sum Sq Df  F value    Pr(>F)    
(Intercept) 2174324  1 531.1463 < 2.2e-16 ***
Diet          57164  3   4.6547  0.006858 ** 
Residuals    167839 41                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

For interpreting the results, we will only take a look at the effect of Diet. The intercept does not matter for our analysis, since it is an indicator that is of importance for a regression analysis.

  • Df: The Degrees of freedom are a measure of the Independent values that can vary in our design. For each factor they are calculated by n-1.
  • Sum Sq: The sum of squares is a measure of how much variance is explained by the independent variable. The group Sum Sq plus the residuals Sum Sq equals the total variation. If you divide the Sum Sq of one variable by the total variation, you get the explained variation by variable group. Since we ignore the intercept in this calculation, Diet explains roughly 25% of the variation.
  • Pr: The p - value indicates the probability that the 0 hypothesis (Diet has no effect on weight) is true. Since the p-value for diet is below 0.05, we can reject the 0-hypothesis and thus assume an effect of Diet on weight.

Comparing with type 1 ANOVA[edit]

To test if the type 3 ANOVA is more robust against unbalanced designs, we can perform a type 1 ANOVA and compare the results.

#Perform type 1 ANOVA: Does diet affect chick weight?
anova_result <- aov(weight ~ Diet, data = SubsetDay21)

#Display ANOVA summary
summary(anova_result)

Output[edit]

> #Display ANOVA summary
> summary(anova_result)
            Df Sum Sq Mean Sq F value  Pr(>F)   
Diet         3  57164   19055   4.655 0.00686 **
Residuals   41 167839    4094                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

In this case the unbalanced design seems to have negligible effects on the ANOVA since the results of the type 1 and type 3 ANOVA are nearly the same.

Evaluating Model[edit]

A type 3 ANOVA still assumes certain distributions even if it is more robust against unbalanced designs:

  • Normal distribution of the residuals
  • Homogeneity of variances between the different groups

The normality of the residuals can be examined with a Shapiro-Wilk-test or simply with a histogram. For the homogeneity of variances, a Levene´s test which is included in the library car can be performed or the boxplots can be checked.

#Check assumptions
#1. Normality of residuals using Shapiro-Wilk test and a histogram (Fig. 2)
shapiro.test(residuals(model))
hist(residuals(model))

Output[edit]

> shapiro.test(residuals(model))

	Shapiro-Wilk normality test

data:  residuals(model)
W = 0.98846, p-value = 0.9293

>

The Shapiro-Wilk test has the 0-hypothesis that the data is normally distributed. Since the p-value is above 0.05 we assume a normal distribution that can also be seen in the histogram (Fig. 2). As it can be seen in the boxplot (Fig. 1), the equal variance between factors is not completely given however an ANOVA can still be performed.

Fig. 2: Histogram of the type 3 ANOVA

Post-hoc Test[edit]

An ANOVA simply analyses if there is a difference between all the means. A post-hoc test is testing for the differences of the means in the individual groups, correcting for errors in the significance levels when we perform multiple tests.

for the post-hoc test the TukeyHSD test is used. It only needs to be applied if the ANOVA itself delivers a significant result.

#perform TukeyHSD test on model
TukeyHSD(aov(model))

Output[edit]

> TukeyHSD(aov(model))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = model)

$Diet
         diff        lwr       upr     p adj
2-1  36.95000  -32.11064 106.01064 0.4868095
3-1  92.55000   23.48936 161.61064 0.0046959
4-1  60.80556  -10.57710 132.18821 0.1192661
3-2  55.60000  -21.01591 132.21591 0.2263918
4-2  23.85556  -54.85981 102.57092 0.8486781
4-3 -31.74444 -110.45981  46.97092 0.7036249

The TukeyHSD test shows a significant difference only between diet 3 and 1. In this case we can conclude that the diet of chicks impacts their weight however to different extent.

Further Resources[edit]

ChickWeight dataset: RDocumentation

ANOVA() command: RDocumentation

ANOVA: Cookbook for R


The author of this entry is Jana Simon