Principal Component Analysis USJudgeRatings
Note: This article shows an example on how to conduct a Principal Component Analysis on the USJudgeRatings dataset. For a more in-depth explanation of the Principal Component Analysis, please refer to the following article: Principal Component Analysis.
Contents
The Dataset[edit]
The dataset is part of Base R. It comprises lawyers' ratings of 43 state judges in the US Superior Court on 12 numeric variables.
- CONT: Number of contacts of a lawyer with the judge
- INTG: Judicial integrity, traits such as honesty, fairness, and incorruptibility
- DMNR: Demeanor, the judge’s behavior and attitude in court (e.g. respectful, calm, professional)
- DILG: Diligence, how hard-working and thorough the judge is in handling cases
- CFMG: Case flow managing, ability to organize and manage cases efficiently to avoid delays
- DECI: Prompt decisions, how quickly and decisively the judge makes rulings
- PREP: Preparation for trial, how well the judge prepares for court proceedings
- FAMI: Familiarity with law, the judge’s knowledge and understanding of legal principles
- ORAL: Sound oral rulings, quality and clarity of the judge’s spoken decisions in court
- WRIT: Sound written rulings, quality and clarity of the judge’s written decisions
- PHYS: Physical ability, the judge’s stamina and capacity to handle the workload
- RTEN: Worthy of retention, overall evaluation of whether the judge should remain in office
Examining the Dataset[edit]
Before conducting the analysis, we install and load the necessary package for visualisation and get an idea of what the dataset looks like.
install.packages("psych")
library("psych")
install.packages("factoextra")
library(factoextra) # For visualization
ratings<-USJudgeRatings
head(ratings)
Output[edit]
The first five rows of the dataset are the following:
> head(swiss)
CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN
AARONSON,L.H. 5.7 7.9 7.7 7.3 7.1 7.4 7.1 7.1 7.1 7.0 8.3 7.8
ALEXANDER,J.M. 6.8 8.9 8.8 8.5 7.8 8.1 8.0 8.0 7.8 7.9 8.5 8.7
ARMENTANO,A.J. 7.2 8.1 7.8 7.8 7.5 7.6 7.5 7.5 7.3 7.4 7.9 7.8
BERDON,R.I. 6.8 8.8 8.5 8.8 8.3 8.5 8.7 8.7 8.4 8.5 8.8 8.7
BRACKEN,J.J. 7.3 6.4 4.3 6.5 6.0 6.2 5.7 5.7 5.1 5.3 5.5 4.8
BURNS,E.B. 6.2 8.8 8.7 8.5 7.9 8.0 8.1 8.0 8.0 8.0 8.6 8.6
Each row contains the measures for one judge. The variables are all numeric, which is necessary for conducting a PCA. If you are unsure, you can also check the structure of the dataset with str(ratings).
Conducting the Principal Component Analysis[edit]
The function used to conduct a Principal Component Analysis in R is prcomp, it is important to set the parameter scale = TRUE to perform scaling and centering of the variables. It standardises all variables to have a mean of 0 and a standard deviation of 1, ensuring that variables measured on different scales contribute equally to the analysis.
#perform pca ratings_PCA<-prcomp(ratings, scale. = TRUE)
Inspection of Analysis Results[edit]
The next step is the visualisation and inspection of the analysis results. To assess how much of the total variance in the dataset is captured by each principal component (PC), a scree plot is created using fviz_eig.
fviz_eig(ratings_PCA, addlabels = TRUE, barfill = "steelblue") + labs(title = "PCA Scree Plot: Variance Explained by Variables")
Output[edit]
The scree plot reveals that the first two principal components together explain over 90% of the total variance, suggesting that the dataset can be reasonably represented in two dimensions.
Contribution of each created Principal Component[edit]
To understand which original variables drive each PC, their individual contributions are plotted.
fviz_contrib(ratings_PCA, choice = "var", axes = 1) fviz_contrib(ratings_PCA, choice = "var", axes = 2)
Output[edit]
As you can see in the plot on the left, for PC1, ORAL, WRIT, RTEN, PREP, DILG, CFMG and DECI contribute at comparable levels. For PC2, CONT is the dominant contributing variable, as you can see in the right plot.
Visualisation of PCA[edit]
Now we can create the Biplot to see how the different variables correlate with each other.
fviz_pca_biplot( ratings_PCA, repel = TRUE, label = "var", # this removes the judge name labels for besser readability title = "PCA of Judge Rating Data" )
Output[edit]
The points show how each individual judge ratings is plotted on this new coordinate system, without the labels, as we removed them in the code.
The main information provided in this plot are:
- The arrows show how the original variables correlate with the two PCs, and in turn, with each others.
- All arrows point in the same direction, except for CONT. This means that the professional qualities rated by lawyers correlated with each other. As CONT stands for the number of times the lawyer and judge got into contact, it is an indicator for the lawyer's familiarity with the judge’s work.
Correlation Matrix[edit]
To examine the different correlation coefficients more closely, a correlation matrix can be created.
corPlot(ratings)
Output[edit]
Overall, correlation coefficients of CONT and the other variables are very low. CONT is slightly negatively correlated with, DMNR, the judge’s behaviour and attitude in court, as well as INTG, Judicial integrity, the two lowest arrows. These two variables may be negatively affected by the lawyer and judge not having met many times. CONT is slightly positively correlated with CFMG, case flow managing, and DECI, Prompt decisions. Higher lawyer and judge encounters may lead lawyers to interpret the judges as more efficient. Besides the low coefficients, there are only 43 observations, so interpretations and explanations have to be treated with caution.
The author of this entry is Antonia Ucher