One-way ANOVA Chickwts

Note: This article shows an R example on how to conduct a one-way analysis of variance (ANOVA) on the dataset chickwts. For general information about ANOVA please refer to the following article: ANOVA.

In short: In this article, a one-way ANOVA is conducted on the dataset chickwts, which examines differences in the weight of chickens based on different feed supplements.

The Dataset[edit]

The Puromycin dataset is included in the R base package shows the results of an experiment where newly hatched chicks allocated into six groups. Each group received a different feed supplement, and the weights of the chicks (in grams) after six weeks were recorded along with the corresponding feed type.

The variables included in this analysis are:

  • weight: a numeric vector containing the chick weight
  • feed: a factor containing the different feed types (casein, horsebean, linseed, meatmeal, soybean and sunflower)

We want to know if different feed types explain the variation in chick weight to check the effectiveness of the various feed supplements. This is tested using an ANOVA, which evaluates whether the null hypothesis can be rejected in favour of the alternative hypothesis.

  • Null hypothesis (H0): There are no differences between the group means of the different feed supplements.
  • Alternative hypothesis (H1): There are differences between the group means of the different feed supplements.

To conduct the analysis you might need to download libraries that are used in the code with the command install.packages(”name of library”).

Inspecting the Data[edit]

Before conducting the analysis we can inspect the dataset and prepare it for our analysis and check for the assumptions of an ANOVA. These are:

  • normality of the residuals of the model - this can be approximated by measuring the normality of the dependent variable for each factor
  • Equal variance between the factors
  • Equal amount of measurements for each factor (balanced design)
  • Independent measurements in each group

The package multcompView is installed to use a function required later for the post-hoc test. The data is inspected with the commands str() and summary() which provide information on the structure of the dataset and the number of measurements for each factor. To check for the normal distribution we can check the histogram of weight (Fig. 1). The normality of the residuals should still be checked later. Equal variance can be checked by an F-test or the data can be visualised in a box plot (Fig. 2). Converting the variable “state” into a factor ensures that an ANOVA, in which the predictor is factorial, can be performed.

install.packages("multcompView")
library(multcompView)

# Load the built-in chickwts dataset
data("chickwts")
head(chickwts)

#Overview of dataset structure 
str(chickwts)

#Check summary statistics 
summary(chickwts)

#normality of variance between factos
boxplot(chickwts$weight~chickwts$feed,
col = terrain.colors(6))

#check for assumptions 
#normality of dependent variable 
hist(chickwts$weight)

#Convert state to factor (it already is, but ensure it)
chickwts$feed <- as factor(chickwts$feed)

Output[edit]

Fig. 1: Histogram of weight
Fig. 2: Boxplots of chick weight by feed type
> str(chickwts)
'data.frame':	71 obs. of  2 variables:
 $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
 $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...

> summary(chickwts)
     weight             feed   
 Min.   :108.0   casein   :12  
 1st Qu.:204.5   horsebean:10  
 Median :258.0   linseed  :12  
 Mean   :261.3   meatmeal :11  
 3rd Qu.:323.5   soybean  :14  
 Max.   :423.0   sunflower:12

The str() command is used to examine the structure of a dataframe and its variable types. The variable “weight” is numeric, while “feed” is a factor. The summary gives basic information about the values of the different variables like the minimum and maximum values, quartiles, median and mean. For factor variables the different factor levels and their abundance in the dataset is printed. We need this information to check whether the design is balanced, which is an important assumption for ANOVA. A balanced design means in this case that all groups of chickens have approximately the same sample size. The histogram roughly shows a normal distribution (Fig. 1). The boxplots can be used to examine if the variance between the different feed supplements differs (Fig. 2). As it can be seen the variance is not quite the same for each factor which means other variables potentially influence the reaction rate. The results of the following analysis therefore need to be interpreted with care.

Conducting the one-way-ANOVA[edit]

To test for differences in the mean weights of the different chicken groups, a one-way ANOVA is conducted. It is important to specify "weight" as the dependent variable, followed by "feed" as the independent variable that explains it.

#one-way ANOVA model
chickwts_model<-aov(weight~feed, data=chickwts)

#Display ANOVA summary
summary(chickwts_model)

Output[edit]

Df Sum Sq Mean Sq F value   Pr(>F)    
feed         5 231129   46226   15.37 5.94e-10 ***
Residuals   65 195556    3009                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Df: The Degrees of freedom are a measure of the Independent values that can vary in our design. For each factor they are calculated by n-1.

Pr: The P-value is under 0.05, thus we reject the null hypothesis that there is no difference between the group means. This means that the different feed types have a significant effect on the weight of the chicken.

Sum Sq: The sum of squares represents the total variation in the data, which is divided into variation explained by the predictor and unexplained variation (residuals). The proportion of explained variance is calculated by dividing the “feed” variable sum of squares by the total sum of squares. Around 54% of the variation in chicken weight can be explained by the type of feed, while the remaining variation is unexplained and may be due to factors such as genetics, environmental conditions, or random variation.

Evaluating ANOVA Results[edit]

An ANOVA has certain assumptions that can be checked with different tests. To check for the normal distribution of the residuals a Shapiro-Wilk-test is performed. Since the result of this test is quite dependent on a sample size, checking for a normal distribution with histograms is usually more robust (Fig. 3). The residuals can also be plotted in diagnostic plots (Fig. 4). The residuals are the difference between the observed values and the values predicted by the model. In the case of ANOVA, they represent the difference between the actual observations and the group means estimated by the model. The normality of the residuals is an important assumption of ANOVA. If this assumption is not fulfilled, the results of the analysis may not be reliable.

#Check assumptions
#Residual normality
shapiro.test(residuals(chickwts_model))

#Plot diagnostic plots
par(mfrow=c(2,2))
plot(chickwts_model)

#histogram of residuals
par(mfrow=c(1,1))
hist(residuals(chickwts_model))

Output[edit]

Fig. 3: Histogram residuals Chickwts
Fig. 4: Diagnostic plots Chickwts
Shapiro-Wilk normality test

data:  residuals(chickwts_model)
W = 0.98616, p-value = 0.6272

The P-value in the Shapiro-Wilk-test is above 0.05 which means the null-hypothesis that the residuals are normally distributed cannot be rejected and we assume a normal distribution. The histogram of the residuals suggests that they are approximately normally distributed, indicating that this assumption is fulfilled (Fig. 3). The Q–Q plot (Fig. 4) provides the same information in a different form: if the residuals are normally distributed, the points should lie close to the reference line. In this case, the points closely follow the line, which is consistent with the pattern observed in the histogram. In the residuals vs. fitted plot (Fig. 4), the residuals should ideally be randomly scattered like “stars in the sky”. Due to the structure of the ANOVA model, some grouping of residuals is unavoidable. However, the residuals appear to be roughly evenly distributed across the fitted values, indicating no clear violations of model assumptions.

Post-hoc-Test[edit]

An ANOVA simply analyses if there is a significant difference between all the means. A post-hoc test examines which specific group means differ from each other, correcting for errors in the significance levels when we perform multiple tests. For the post-hoc test the TukeyHSD test is used. It only needs to be applied if the ANOVA itself delivers a significant result. The function multcompView is used to display the results of the post-hoc test using letters. Groups with the same letters do not differ significantly from each other, while groups that do not share the same letter differ significantly.

#Post-hoc test
tukey_mod<-TukeyHSD(chickwts_model)
print(tukey_mod)
tukey_letters <- multcompView::multcompLetters4(chickwts_model, tukey_mod)
tukey_letters

Output[edit]

> print(tukey_mod)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ feed, data = chickwts)

$feed
                           diff         lwr       upr     p adj
horsebean-casein    -163.383333 -232.346876 -94.41979 0.0000000
linseed-casein      -104.833333 -170.587491 -39.07918 0.0002100
meatmeal-casein      -46.674242 -113.906207  20.55772 0.3324584
soybean-casein       -77.154762 -140.517054 -13.79247 0.0083653
sunflower-casein       5.333333  -60.420825  71.08749 0.9998902
linseed-horsebean     58.550000  -10.413543 127.51354 0.1413329
meatmeal-horsebean   116.709091   46.335105 187.08308 0.0001062
soybean-horsebean     86.228571   19.541684 152.91546 0.0042167
sunflower-horsebean  168.716667   99.753124 237.68021 0.0000000
meatmeal-linseed      58.159091   -9.072873 125.39106 0.1276965
soybean-linseed       27.678571  -35.683721  91.04086 0.7932853
sunflower-linseed    110.166667   44.412509 175.92082 0.0000884
soybean-meatmeal     -30.480519  -95.375109  34.41407 0.7391356
sunflower-meatmeal    52.007576  -15.224388 119.23954 0.2206962
sunflower-soybean     82.488095   19.125803 145.85039 0.003884

The Tukey HSD test shows that several feed types differ significantly in their effects on chicken weight. For example, linseed and casein differ strongly (p < 0.001), as do soybean and casein (p < 0.01). In contrast, linseed and soybean do not differ significantly (p > 0.05), indicating similar mean weights for these groups. This pattern is also reflected in the letter display, where groups sharing the same letter are not significantly different. For instance, linseed (“bc”) and soybean (“b”) share a common letter and therefore do not differ significantly, while casein (“a”) and linseed (“bc”) do not share a letter and are significantly different.


The author of this entry is Hauke Haese