One-way ANOVA Chickwts
Note: This article shows an R example on how to conduct a one-way analysis of variance (ANOVA) on the dataset chickwts. For general information about ANOVA please refer to the following article: ANOVA.
In short: In this article, a one-way ANOVA is conducted on the dataset chickwts, which examines differences in the weight of chickens based on different feed supplements.
Contents
The Dataset[edit]
The Puromycin dataset is included in the R base package shows the results of an experiment where newly hatched chicks allocated into six groups. Each group received a different feed supplement, and the weights of the chicks (in grams) after six weeks were recorded along with the corresponding feed type.
The variables included in this analysis are:
- weight: a numeric vector containing the chick weight
- feed: a factor containing the different feed types (casein, horsebean, linseed, meatmeal, soybean and sunflower)
We want to know if different feed types explain the variation in chick weight to check the effectiveness of the various feed supplements. This is tested using an ANOVA, which evaluates whether the null hypothesis can be rejected in favour of the alternative hypothesis.
- Null hypothesis (H0): There are no differences between the group means of the different feed supplements.
- Alternative hypothesis (H1): There are differences between the group means of the different feed supplements.
To conduct the analysis you might need to download libraries that are used in the code with the command install.packages(”name of library”).
Inspecting the Data[edit]
Before conducting the analysis we can inspect the dataset and prepare it for our analysis and check for the assumptions of an ANOVA. These are:
- normality of the residuals of the model - this can be approximated by measuring the normality of the dependent variable for each factor
- Equal variance between the factors
- Equal amount of measurements for each factor (balanced design)
- Independent measurements in each group
The package multcompView is installed to use a function required later for the post-hoc test. The data is inspected with the commands str() and summary() which provide information on the structure of the dataset and the number of measurements for each factor. To check for the normal distribution we can check the histogram of weight (Fig. 1). The normality of the residuals should still be checked later. Equal variance can be checked by an F-test or the data can be visualised in a box plot (Fig. 2). Converting the variable “state” into a factor ensures that an ANOVA, in which the predictor is factorial, can be performed.
install.packages("multcompView")
library(multcompView)
# Load the built-in chickwts dataset
data("chickwts")
head(chickwts)
#Overview of dataset structure
str(chickwts)
#Check summary statistics
summary(chickwts)
#normality of variance between factos
boxplot(chickwts$weight~chickwts$feed,
col = terrain.colors(6))
#check for assumptions
#normality of dependent variable
hist(chickwts$weight)
#Convert state to factor (it already is, but ensure it)
chickwts$feed <- as factor(chickwts$feed)
Output[edit]
> str(chickwts)
'data.frame': 71 obs. of 2 variables:
$ weight: num 179 160 136 227 217 168 108 124 143 140 ...
$ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
> summary(chickwts)
weight feed
Min. :108.0 casein :12
1st Qu.:204.5 horsebean:10
Median :258.0 linseed :12
Mean :261.3 meatmeal :11
3rd Qu.:323.5 soybean :14
Max. :423.0 sunflower:12
The str() command is used to examine the structure of a dataframe and its variable types. The variable “weight” is numeric, while “feed” is a factor.
The summary gives basic information about the values of the different variables like the minimum and maximum values, quartiles, median and mean. For factor variables the different factor levels and their abundance in the dataset is printed. We need this information to check whether the design is balanced, which is an important assumption for ANOVA. A balanced design means in this case that all groups of chickens have approximately the same sample size. The histogram roughly shows a normal distribution (Fig. 1). The boxplots can be used to examine if the variance between the different feed supplements differs (Fig. 2). As it can be seen the variance is not quite the same for each factor which means other variables potentially influence the reaction rate. The results of the following analysis therefore need to be interpreted with care.
Conducting the one-way-ANOVA[edit]
To test for differences in the mean weights of the different chicken groups, a one-way ANOVA is conducted. It is important to specify "weight" as the dependent variable, followed by "feed" as the independent variable that explains it.
#one-way ANOVA model chickwts_model<-aov(weight~feed, data=chickwts) #Display ANOVA summary summary(chickwts_model)
Output[edit]
Df Sum Sq Mean Sq F value Pr(>F) feed 5 231129 46226 15.37 5.94e-10 *** Residuals 65 195556 3009 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Df: The Degrees of freedom are a measure of the Independent values that can vary in our design. For each factor they are calculated by n-1.
Pr: The P-value is under 0.05, thus we reject the null hypothesis that there is no difference between the group means. This means that the different feed types have a significant effect on the weight of the chicken.
Sum Sq: The sum of squares represents the total variation in the data, which is divided into variation explained by the predictor and unexplained variation (residuals). The proportion of explained variance is calculated by dividing the “feed” variable sum of squares by the total sum of squares. Around 54% of the variation in chicken weight can be explained by the type of feed, while the remaining variation is unexplained and may be due to factors such as genetics, environmental conditions, or random variation.
Evaluating ANOVA Results[edit]
An ANOVA has certain assumptions that can be checked with different tests. To check for the normal distribution of the residuals a Shapiro-Wilk-test is performed. Since the result of this test is quite dependent on a sample size, checking for a normal distribution with histograms is usually more robust (Fig. 3). The residuals can also be plotted in diagnostic plots (Fig. 4). The residuals are the difference between the observed values and the values predicted by the model. In the case of ANOVA, they represent the difference between the actual observations and the group means estimated by the model. The normality of the residuals is an important assumption of ANOVA. If this assumption is not fulfilled, the results of the analysis may not be reliable.
#Check assumptions #Residual normality shapiro.test(residuals(chickwts_model)) #Plot diagnostic plots par(mfrow=c(2,2)) plot(chickwts_model) #histogram of residuals par(mfrow=c(1,1)) hist(residuals(chickwts_model))
Output[edit]
Shapiro-Wilk normality test data: residuals(chickwts_model) W = 0.98616, p-value = 0.6272
The P-value in the Shapiro-Wilk-test is above 0.05 which means the null-hypothesis that the residuals are normally distributed cannot be rejected and we assume a normal distribution. The histogram of the residuals suggests that they are approximately normally distributed, indicating that this assumption is fulfilled (Fig. 3). The Q–Q plot (Fig. 4) provides the same information in a different form: if the residuals are normally distributed, the points should lie close to the reference line. In this case, the points closely follow the line, which is consistent with the pattern observed in the histogram. In the residuals vs. fitted plot (Fig. 4), the residuals should ideally be randomly scattered like “stars in the sky”. Due to the structure of the ANOVA model, some grouping of residuals is unavoidable. However, the residuals appear to be roughly evenly distributed across the fitted values, indicating no clear violations of model assumptions.
Post-hoc-Test[edit]
An ANOVA simply analyses if there is a significant difference between all the means. A post-hoc test examines which specific group means differ from each other, correcting for errors in the significance levels when we perform multiple tests. For the post-hoc test the TukeyHSD test is used. It only needs to be applied if the ANOVA itself delivers a significant result. The function multcompView is used to display the results of the post-hoc test using letters. Groups with the same letters do not differ significantly from each other, while groups that do not share the same letter differ significantly.
#Post-hoc test tukey_mod<-TukeyHSD(chickwts_model) print(tukey_mod) tukey_letters <- multcompView::multcompLetters4(chickwts_model, tukey_mod) tukey_letters
Output[edit]
> print(tukey_mod)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = weight ~ feed, data = chickwts)
$feed
diff lwr upr p adj
horsebean-casein -163.383333 -232.346876 -94.41979 0.0000000
linseed-casein -104.833333 -170.587491 -39.07918 0.0002100
meatmeal-casein -46.674242 -113.906207 20.55772 0.3324584
soybean-casein -77.154762 -140.517054 -13.79247 0.0083653
sunflower-casein 5.333333 -60.420825 71.08749 0.9998902
linseed-horsebean 58.550000 -10.413543 127.51354 0.1413329
meatmeal-horsebean 116.709091 46.335105 187.08308 0.0001062
soybean-horsebean 86.228571 19.541684 152.91546 0.0042167
sunflower-horsebean 168.716667 99.753124 237.68021 0.0000000
meatmeal-linseed 58.159091 -9.072873 125.39106 0.1276965
soybean-linseed 27.678571 -35.683721 91.04086 0.7932853
sunflower-linseed 110.166667 44.412509 175.92082 0.0000884
soybean-meatmeal -30.480519 -95.375109 34.41407 0.7391356
sunflower-meatmeal 52.007576 -15.224388 119.23954 0.2206962
sunflower-soybean 82.488095 19.125803 145.85039 0.003884
The Tukey HSD test shows that several feed types differ significantly in their effects on chicken weight. For example, linseed and casein differ strongly (p < 0.001), as do soybean and casein (p < 0.01). In contrast, linseed and soybean do not differ significantly (p > 0.05), indicating similar mean weights for these groups. This pattern is also reflected in the letter display, where groups sharing the same letter are not significantly different. For instance, linseed (“bc”) and soybean (“b”) share a common letter and therefore do not differ significantly, while casein (“a”) and linseed (“bc”) do not share a letter and are significantly different.
The author of this entry is Hauke Haese