Chi-Square Test_HairEyeColor
Note: This article shows an example on how to conduct a chi-square test of stochastic independence on the dataset HairEyeColor. For more general information regarding the chi-square test, please refer to the article about Simple Statistical Tests.
In short: In this article, two exemplary chi-square tests of stochastic independence are conducted on the dataset HairEyeColor and the results are visualized.
Contents
The Dataset[edit]
The dataset HairEyeColor is of the type table and has the format of a 3-dimensional array. It is included in the R-base package and includes information from cross-tabulating 592 observations on 3 variables, the hair colour, the eye colour and the sex of statistics students.
The variables are:
- Hair containing the levels Black, Brown, Red, Blond
- Eye containing the levels Brown, Blue, Hazel, Green
- Sex containing the levels Male and Female
#load dataset
data("HairEyeColor")
str(HairEyeColor)
The structure of the dataset can be examined with the command str() which delivers the following output :
> str(HairEyeColor) 'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ... - attr(*, "dimnames")=List of 3 ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond" ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green" ..$ Sex : chr [1:2] "Male" "Female" >
The command str() provides the information that the type of object is a table which is 3-dimensional and has the dimensions 4x4x2 which means there are 4 categories for Hair, 4 categories for Eye, and 2 categories for Sex. The respective categories are listed below.
Chi-Square Test of Hair and Eye[edit]
A chi-square test that tests for a relation of hair and eye colour tests against the null **hypothesis** that both attributes are not related. To test against this hypothesis, the chi-square test tests the expected values against the observed values.
To perform the test on the dataset, a subset with the summed up values of male and female is created to provide a table that only contains the hair and eye colour. For this table, the chi-square test is then performed. To exemplify the test, the expected values can also be printed and compared with the observed values.
#create table that sums up male and female to only include hair and eye color
hair_eye <- margin.table(HairEyeColor, c("Hair", "Eye"))
hair_eye
#perform chisquare test on hair and eye colour
chisq.test(hair_eye)
#compare values with rounded expected values that the chisquare test uses.
round((chisq.test(hair_eye))$expected)
Output[edit]
> hair_eye
Eye
Hair Brown Blue Hazel Green
Black 68 20 15 5
Brown 119 84 54 29
Red 26 17 14 14
Blond 7 94 10 16
>
> #perform chisquare test on hair and eye color
> chisq.test(hair_eye)
Pearson's Chi-squared test
data: hair_eye
X-squared = 138.29, df = 9, p-value < 2.2e-16
>
> #compare values with rounded expected values that the chisquare test uses.
> round((chisq.test(hair_eye))$expected)
Eye
Hair Brown Blue Hazel Green
Black 40 39 17 12
Brown 106 104 45 31
Red 26 26 11 8
Blond 47 46 20 14
>
The values used to evaluate the chi-square test are:
- Df: The degrees of freedom are a measure of how many values can vary ones the row and column totals are fixed. For this table, this means (4-1)x(4-1) = 9.
- X-squared: The x-squared value is a measure of how much the observed values differ from the expected values under independence. The bigger the x-squared value is, the more likely an association between the variables is.
- Pr: The p-value arises from the df and the x-squared. It is highly significant (below 0.05) which means we can reject the 0-hypothesis that there is no relation between hair colour and eye-colour and assume an association between the two variables.
When comparing the expected values with the observed values, we can see that they strongly differ. Considering that eye colour and hair colour are based on genetics and strongly related, this is logical.
Visualizing the Results[edit]
The results of a chi-square test can be visualized in different plots. In this case, an association plot is used that shows the deviation from the expected values (Fig. 1). The bar indicates if there are more or less observations than expected, and the height indicates its contribution to the x-squared value.
assocplot(hair_eye, main = "Hair and Eye Colour")
Chi-Square Test of Sex and Hair[edit]
To compare the relation between hair colour and eye colour to the relation between variables that probably show a weaker relation, a chi-square test of sex and hair colour is performed. A new subset is created where the values of the different eye colours are summed up.
#create table that sums up eye color to only include sex and hair color
hair_sex <- margin.table(HairEyeColor, c("Hair", "Sex"))
hair_sex
#perform chisquare test on sex and hair color
chisq.test(hair_sex)
#compare values with rounded expected values that the chisquare test uses.
round((chisq.test(hair_sex))$expected)
Output[edit]
> hair_sex
Sex
Hair Male Female
Black 56 52
Brown 143 143
Red 34 37
Blond 46 81
>
> #perform chisquare test on sex and hair color
> chisq.test(hair_sex)
Pearson's Chi-squared test
data: hair_sex
X-squared = 7.9942, df = 3, p-value = 0.04613
>
> #compare values with rounded expected values that the chisquare test uses.
> round((chisq.test(hair_sex))$expected)
Sex
Hair Male Female
Black 51 57
Brown 135 151
Red 33 38
Blond 60 67
>
Further Information[edit]
HairEyeColor dataset: RDocumentation
The author of this entry is Jana Simon.