Correlation airquality

Note: This article shows an R example on how to conduct a correlation on the dataset airquality. For general information about correlations, please refer to the article about Correlations.

In short: In this article, exemplary correlations are conducted on the dataset airquality in R. The correlations are visualised in scatterplots and a correlation matrix of the dataset is generated with the library corrplot and a scatterplot matrix is created with the library PerformanceAnalytics.

The Dataset[edit]

The dataset airquality includes daily air quality measurements in New York from May to September 1973.

The variables we will include in our analysis are:

  • Ozone: An Integer containing the mean ozone in parts per billion from 1pm to 3 pm at Roosevelt Island
  • Solar.R: An Integer containing the solar radiation in Langleys in the frequency band 4000--7700 Angstroms from 8 am to 12 pm at Central Park
  • Wind: Numeric variable containing the average wind speed in miles per hour at 7 am and 10 am at LaGuardia Airport
  • Temp: An Integer containing the maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

Before running the code you might need to install the used libraries by running install.packages(”corrplot”) or install.packages (”PerformanceAnalytics”).

Examining the Dataset[edit]

Before conducting the correlations, the dataset is examined to get an idea of its structure.

Correlations are used to test the relationship between two continuous variables. The only assumption of a correlation is that both variables are continuous. For the calculation of Pearson's r, a normal distribution of both variables is assumed. Spearman's rho on the other hand is more robust against non-normally distributed data.

library(corrplot)
library(PerformanceAnalytics)
#load dataset
data("airquality")
df <- airquality

#inspect the data
head(df)
str(df)
summary(df)

Output[edit]

>head(df)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

> str(df)
'data.frame':	153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
 
 >
 
 > summary(df)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                     
 
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               
>

The commands str(), summary() and head() provide an overview of the values in the dataset. The structure provides insight on the datatype of the variables. In this case the variables are mostly integers. The summary provides more detailed information of the different columns in the dataframe. Here we can see that Ozone and Solar.R miss values as indicated by NA´s (NA stands vor “not available” and represents missing values). Missing values can not be calculated with and many statistical tests can thus not be performed. The data therefore has to be adjusted before the analysis can continue.

Adjusting the Data[edit]

Dealing with missing data needs to be addressed by a case to case basis. In some case NA´s present 0´s, that were just inserted wrongly into the dataframe, in which case NA could just be replaced by 0. In other cases there are just a few missing values and the affected rows can simply be removed. When there are many missing values, other methods to replace NA´s that do not affect the statistical explanatory power, need to be applied. In this case, the command na.omit() is used to remove any row in the dataframe that contains a missing value. Since there are 153 observations in total and 37 NA´s in the variable Ozone, a lot of the data is lost when performing this command. This is not ideal for statistical analyses, however it's often less biased than replacing the missing values.

df <- na.omit(df)

To create a correlation matrix of the dataframe, month and day are removed to prevent non-meaningful correlations (e.g. between month and day)

#create dataset with only numeric variables to be able to perform correlations
df_numeric <- df[ , !(names(df) %in% c("Month", "Day")) ]

head(df_numeric)

Performing Correlations[edit]

When performing correlations, one should usually start with a hypothesis to prevent type 1 errors (falsely rejecting the null hypothesis) which could occur by finding a correlation between variables by chance. The more correlations are performed, the more likely it is to find a correlation purely by chance.

Before performing the correlations, the normality of the data is checked by histograms (Fig. 1).

#check for normal distributions of variables with histograms
par(mfrow = c(2,2)) # create 2 by 2 grid
#create a histogram for each variable in the dataframe
for (v in names(df_numeric)) {
  hist(df_numeric[[v]],
       main = v,
       xlab = v)
}
par(mfrow = c(1,1)) #reset grid
Fig1: Histograms of the variables in the dataset

Wind, Solar.R and Temp are quite normally distributed. Ozone however has a strong positive (right) skew and is not normally distributed.

Correlation of Temperature and Ozone[edit]

A Spearman correlation of Ozone and Temp is performed, since Ozone is not normally distributed. Spearman's rank correlation coefficient calculates the rank order of the variables' values using a monotonic function. Since it uses ranks, it is more sensitive to non-linear relationships and works for non-normally distributed data.

The correlation is then plotted in a scatterplot (Fig. 2).

#perform correlation
cor.test(df_numeric$Ozone, df_numeric$Temp, method = "spearman"
#Plot Temp against Ozone in a scatterplot
plot(df_numeric$Ozone, df_numeric$Temp,
     xlab = "Ozone [ppb]",
     ylab = "Temperature [Fahrenheit]",
     main = "Temperature and Ozone", 
     pch = 16,
     col = "cadetblue")

Output[edit]

> cor.test(df_numeric$Ozone, df_numeric$Temp, method = "spearman")

	Spearman's rank correlation rho

data:  df_numeric$Ozone and df_numeric$Temp
S = 51753, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.7729319 
>
Fig. 2: Scatterplot of Temp and Ozone

The correlation delivers a highly significant p-value and a high positive correlation coefficient. The significant p-value indicates that the null hypothesis, that there is no correlation between Ozone and Temp, can be rejected. The positive correlation coefficient indicates that an increase in one value is accompanied by an increase in the other value. We can thus assume that temperature and ozone concentration are somehow related. However, we can infer nothing about causality. For this, a regression analysis with sufficient theoretical underpinning would need to be applied.

Correlation of Temperature and Wind[edit]

Next, the correlation of Temp and Wind is tested and a scatterplot is created.

#perform correlation
cor.test(df_numeric$Wind, df_numeric$Temp, method = "pearson")
#create regression line by creating linear model
modelTW <- lm(Temp~Wind, data = df_numeric)
#plot correlation in a scatterplot
plot(df_numeric$Wind, df_numeric$Temp,
     xlab = "Wind [mph]",
     ylab = "Temperature [Fahrenheit]",
     main = "Temperature and Wind", 
     pch = 16,
     col = "burlywood")

Output[edit]

> cor.test(df_numeric$Wind, df_numeric$Temp, method = "pearson")

	Pearson's product-moment correlation

data:  df_numeric$Wind and df_numeric$Temp
t = -5.9827, df = 109, p-value = 2.842e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.6256061 -0.3425410
sample estimates:
       cor 
-0.4971897 

>
Fig. 3: Scatterplot of Temp and Wind

This correlation also delivers a highly significant p-value and a pretty high negative correlation coefficient. There is a negative relation between Temp and Wind which means as one value increases, the other decreases and vice versa.

Correlation Matrix[edit]

To explore data, a correlation matrix can be used. This can be created with the library “corrplot”. It provides the correlation coefficients for each combination of variables of the dataframe.

#create plot of a correlation matrix that shows the Spearman correlation coefficient
corrplot(cor(df_numeric, method = "spearman"), method = "number")
Fig. 4: Correlation matrix

The correlation coefficients for each combination of variables are printed in the matrix (Fig. 4). Blue represents a positive and red represents a negative correlation. The intensity of the colour represents how strong the correlation is. Each variable has a correlation coefficient of 1.0 with itself. Apart from that, the most correlated values seem to be Temp and Ozone as well as wind and Ozone. Solar.R and Wind have the smallest correlation.

Scatterplot Matrix[edit]

With the library “PerformanceAnalytics” a scatterplot matrix can be created.

#calling the chart.Correlation() function and defining a few parameters.
chart.Correlation(df_numeric, histogram = TRUE)
Fig.5: Scatterplot matrix

The scatterplot matrix (Fig. 5) provides the correlation coefficients as well as the significance indicated by the asterisks in the upper right half of the plot. In the middle line, the histograms for each variable are printed. In the lower half, the scatterplots for each combination of variables are printed. This plot can be useful to evaluate a big amount of correlations at the same time since it presents a large amount of data in one plot.

Further Resources[edit]

Correlation Plots

Airquality dataset: RDocumentation

Corrplot library: RDocumentation

PerformanceAnalytics library: RDocumentation


The author of this entry is Jana Simon