One-way ANOVA_Puromycin
Note: This article shows an R example on how to conduct a one-way analysis of variance (ANOVA) on the dataset Puromycin. For general information about ANOVA please refer to the following article: ANOVA.
In short: In this article an exemplary one-way ANOVA is conducted on the dataset Puromycin, which examines the difference in the reaction of cells based on their treatment with Puromycin. The result of the ANOVA is then compared with the result of a t-test and the results are visualized.
Contents
The Dataset[edit]
The Puromycin dataset is included in the R base package and shows the rate of an enzymatic reaction as a function of substrate concentration for two groups of cells — treated and untreated with Puromycin.
The variables we will look at in our analysis are:
- rate: a numeric vector containing reaction rates of the treated and untreated cells.
- state: a factor with the levels “treated” and “untreated”.
We will test whether the mean reaction rate differs between the treated and untreated groups. This indicates if Puromycin has an influence on the reaction rates of the cells.
To conduct the analysis you might need to download libraries that are used in the code with the command install.packages(”name of library”).
Examining the Dataset[edit]
Before conducting the analysis we can inspect the dataset and prepare it for our analysis and check for the assumptions of an ANOVA. These are:
- Normality of the residuals of the model - this can be approximated by measuring the normality of the dependent variable for each factor
- Equal variance between the factors
- Equal amount of measurements for each factor (balanced design)
- Independent measurements in each group
The data is inspected with the commands str() and summary() which provide information on the structure of the dataset and the number of measurements for each factor. To check for the normal distribution we can check the histogram of rate. This is only a rough check of the assumption. The normality of the residuals should still be checked later. To check for equal variance either an F-test can be used or the data can be visualized in a box plot.
Converting the variable “state” into a factor ensures that an ANOVA, in which the predictor is factorial, can be performed.
#Load the built-in Puromycin dataset
data("Puromycin")
head(Puromycin)
#Overview of dataset structure
str(Puromycin)
#Check summary statistics
summary(Puromycin)
#check for assumptions
#normality of dependent variable with histogram (Fig. 1)
hist(Puromycin$rate)
#Convert state to factor (it already is, but ensure it)
Puromycin$state <- as.factor(Puromycin$state)
Output[edit]
> str(Puromycin)
'data.frame': 23 obs. of 3 variables:
$ conc : num 0.02 0.02 0.06 0.06 0.11 0.11 0.22 0.22 0.56 0.56 ...
$ rate : num 76 47 97 107 123 139 159 152 191 201 ...
$ state: Factor w/ 2 levels "treated","untreated": 1 1 1 1 1 1 1 1 1 1 ...
> summary(Puromycin)
conc rate state
Min. :0.0200 Min. : 47.0 treated :12
1st Qu.:0.0600 1st Qu.: 91.5 untreated:11
Median :0.1100 Median :124.0
Mean :0.3122 Mean :126.8
3rd Qu.:0.5600 3rd Qu.:158.5
Max. :1.1000 Max. :207.0
>
With the str() command, the structure of the dataframe, including the kind of variables it includes, can be examined. “conc” and “rate” are both numeric, while “state” is a factor.
The summary gives basic information about the vales of the different variables like the minimum and maximum values, quartiles, median and mean. For factorial variables the different factor levels and their abundance in the dataset is printed. This can be helpful to determine wether tests like the ANOVA, which require balanced designs, can be performed. In this case the number of measurements for treated and untreated slightly differs which should be kept in mind when evaluating the results of the analysis.
The histogram roughly shows a normal distribution of rate. Especially that there is no clear skew to one side.
The boxplots that are created later to visualize the result of the ANOVA can also be used to examine if the variance between the factors treated and untreated differs. As it can be seen the variance is not quite the same for each factor which means other variables potentially influence the reaction rate. The results of the following analysis therefore need to be interpreted with care.
Conducting ANOVA[edit]
To test the difference between the means of the reaction rates of the treated and untreated cells, a one-way ANOVA is conducted. Since there is only two groups that we want to compare, this could also be achieved with a t-test.
#Fit one-way ANOVA model anova_puro <- aov(rate ~ state, data = Puromycin) #Display ANOVA summary summary(anova_puro)
The examined data can be visualized in a boxplot (Fig.2) to visualize the difference between means.
boxplot(rate ~ state, data = Puromycin,
main = "Reaction Rate by Puromycin Treatment",
xlab = "Treatment State",
ylab = "Reaction Rate",
col = "lightgreen")
Output[edit]
> summary(anova_puro)
Df Sum Sq Mean Sq F value Pr(>F)
state 1 5464 5464 2.596 0.122
Residuals 21 44201 2105
>
- Df: The degrees of freedom are a measure of the independent values that can vary in our design. For each factor they are calculated by n-1.
- Pr: The P-value is above 0.05, thus we cannot reject the 0 Hypothesis that there is no difference between the group means. This means that the treatment with Puromycin seems to have no significant effect on the reaction rate of the cells.
- Sum Sq: The sum of squares is a measure of how much variance is explained by the predictor variable. The group Sum Sq plus the residuals Sum Sq equals to the total variation. If the Sum Sq of the variable state is divided by the Total Sum Sq (sum of squares of residuals plus the variables), you get the explained variation by variable group. In this case only a very small part of the variation is explained by the state which ties in with the non-significant p-value.
Conducting a T-test[edit]
Since an ANOVA is used to test for the difference in means of more than two groups, in this case we could also perform a t-test to test for the difference between the two treatments.
t.test(rate ~ state, data = Puromycin)
Output[edit]
> t.test(rate ~ state, data = Puromycin)
Welch Two Sample t-test
data: rate by state
t = 1.6375, df = 19.578, p-value = 0.1175
alternative hypothesis: true difference in means between group treated and group untreated is not equal to 0
95 percent confidence interval:
-8.504864 70.216985
sample estimates:
mean in group treated mean in group untreated
141.5833 110.7273
The results of the t-test and the ANOVA are nearly the same and in both cases not significant. The difference is due to the different assumptions of the tests. The ANOVA assumes equal variances. The welch t-test, which is performed by R if not specified otherwise, does not assume equal variances and therefore delivers a slightly different result. If the student two sample t-test had been performed, the p-values would have been the same.
Evaluating ANOVA Results[edit]
An ANOVA has certain assumptions that can be checked with different tests. To check for the normal distribution of the residuals a Shapiro-Wilk-test is performed. Since the result of this test is quite dependent on a sample size, checking for a normal distribution with histograms is usually more robust. The residuals can also be plotted in diagnostic plots.
#Check assumptions #Residual normality shapiro.test(residuals(anova_puro)) #Plot diagnostic plots (Fig. 3) par(mfrow=c(2,2)) plot(anova_puro) #histogram of residuals (Fig. 4) hist(residuals(anova_puro))
Output[edit]
> shapiro.test(residuals(anova_puro)) Shapiro-Wilk normality test data: residuals(anova_puro) W = 0.96255, p-value = 0.5168
The p-value in the Shapiro-Wilk-test is above 0.05 which means the 0-Hypothesis that the residuals are normally distributed cannot be rejected, and we assume a normal distribution.
The residuals vs. fitted should ideally be equally distributed for each factor. In this model there seems to be roughly the same amount of variance for each factor although the distribution could be more equal. In the Q-Q plot, the residuals should roughly follow the line. We can still see the factors here which is unavoidable in an ANOVA. The diagnostic plots used here are hence more appropriate for a regression analysis. Alternatively, one can look at the distribution of the residuals on all factor levels, ideally with boxplots or histograms.
The residuals are roughly normally distributed, however they show a slight negative skew (left skew).
Further Resources[edit]
Puromycin dataset: RDocumentation
ANOVA: Cookbook for R
T-test: Cookbook for R
Q-Q-plot: Cookbook for R
The author of this entry is Jana Simon