5  Exploratory Data Analysis

chatGPT 4 - robot octopus, embodied by an AI, holding data science and machine learning tools, colorful underwater, cyberpunk

Exploratory data analysis (EDA) is helpful for summarizing variables, accessing data quality, exploring multivariate trends and refining data analysis strategies.

5.1 Data Summary

When faced with a new data set, a great first step is to identify the types of variables and summarize them. Lets use this as an opportunity to practice the skills we have learned so far. Later we will use some R libraries to take our data summary skills to a new level.

5.1.1 Overview and summary

library(dplyr)

data<-mtcars
data(data)

#data structure
str(data)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
#variable types
lapply(data,class)  %>%
  unlist() %>%
  setNames(.,names(data))
      mpg       cyl      disp        hp      drat        wt      qsec        vs 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
       am      gear      carb 
"numeric" "numeric" "numeric" 
#basic summary
summary(data)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

Next lets use the skimr library for a quick data overview.

5.1.2 skimr

library(skimr)

skim(data) 
Data summary
Name data
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 20.09 6.03 10.40 15.43 19.20 22.80 33.90 ▃▇▅▁▂
cyl 0 1 6.19 1.79 4.00 4.00 6.00 8.00 8.00 ▆▁▃▁▇
disp 0 1 230.72 123.94 71.10 120.83 196.30 326.00 472.00 ▇▃▃▃▂
hp 0 1 146.69 68.56 52.00 96.50 123.00 180.00 335.00 ▇▇▆▃▁
drat 0 1 3.60 0.53 2.76 3.08 3.70 3.92 4.93 ▇▃▇▅▁
wt 0 1 3.22 0.98 1.51 2.58 3.33 3.61 5.42 ▃▃▇▁▂
qsec 0 1 17.85 1.79 14.50 16.89 17.71 18.90 22.90 ▃▇▇▂▁
vs 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
am 0 1 0.41 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
gear 0 1 3.69 0.74 3.00 3.00 4.00 4.00 5.00 ▇▁▆▁▂
carb 0 1 2.81 1.62 1.00 2.00 2.00 4.00 8.00 ▇▂▅▁▁

Do you recognize the output format? This is the same as we previously created in the data wrangling section. Additional information includes overview of missing values, quantiles and variable histograms. This method is specific for different types.

library(dplyr)
mtcars %>%
  mutate(cyl=as.factor(cyl)) %>%
  skim()
Data summary
Name Piped data
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
factor 1
numeric 10
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
cyl 0 1 FALSE 3 8: 14, 4: 11, 6: 7

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 20.09 6.03 10.40 15.43 19.20 22.80 33.90 ▃▇▅▁▂
disp 0 1 230.72 123.94 71.10 120.83 196.30 326.00 472.00 ▇▃▃▃▂
hp 0 1 146.69 68.56 52.00 96.50 123.00 180.00 335.00 ▇▇▆▃▁
drat 0 1 3.60 0.53 2.76 3.08 3.70 3.92 4.93 ▇▃▇▅▁
wt 0 1 3.22 0.98 1.51 2.58 3.33 3.61 5.42 ▃▃▇▁▂
qsec 0 1 17.85 1.79 14.50 16.89 17.71 18.90 22.90 ▃▇▇▂▁
vs 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
am 0 1 0.41 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
gear 0 1 3.69 0.74 3.00 3.00 4.00 4.00 5.00 ▇▁▆▁▂
carb 0 1 2.81 1.62 1.00 2.00 2.00 4.00 8.00 ▇▂▅▁▁

Notice how we get a separate summary for each variable type.

5.2 summarytools summary

library(summarytools)

.summary <- dfSummary(data)
# view(.summary)  #view html report

5.3 Data Quality

Data quality assessment is an important first step for any analysis. Ideally the experimental design includes replicated quality control samples which can be used for this purpose. For this demo we will assess variability for a grouping variable.

.summary<-mtcars %>%
  mutate(cyl=as.factor(cyl)) %>%
  group_by(cyl) %>%
  skim()

.summary
Data summary
Name Piped data
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
numeric 10
________________________
Group variables cyl

Variable type: numeric

skim_variable cyl n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 4 0 1 26.66 4.51 21.40 22.80 26.00 30.40 33.90 ▇▃▂▃▃
mpg 6 0 1 19.74 1.45 17.80 18.65 19.70 21.00 21.40 ▅▂▂▁▇
mpg 8 0 1 15.10 2.56 10.40 14.40 15.20 16.25 19.20 ▂▁▇▃▂
disp 4 0 1 105.14 26.87 71.10 78.85 108.00 120.65 146.70 ▇▂▂▆▃
disp 6 0 1 183.31 41.56 145.00 160.00 167.60 196.30 258.00 ▇▁▁▂▂
disp 8 0 1 353.10 67.77 275.80 301.75 350.50 390.00 472.00 ▇▅▃▂▅
hp 4 0 1 82.64 20.93 52.00 65.50 91.00 96.00 113.00 ▃▆▁▇▃
hp 6 0 1 122.29 24.26 105.00 110.00 110.00 123.00 175.00 ▇▃▁▁▂
hp 8 0 1 209.21 50.98 150.00 176.25 192.50 241.25 335.00 ▇▂▃▁▁
drat 4 0 1 4.07 0.37 3.69 3.81 4.08 4.16 4.93 ▇▅▃▁▂
drat 6 0 1 3.59 0.48 2.76 3.35 3.90 3.91 3.92 ▂▂▁▂▇
drat 8 0 1 3.23 0.37 2.76 3.07 3.12 3.22 4.22 ▃▇▁▁▁
wt 4 0 1 2.29 0.57 1.51 1.89 2.20 2.62 3.19 ▇▅▇▂▅
wt 6 0 1 3.12 0.36 2.62 2.82 3.21 3.44 3.46 ▅▂▁▂▇
wt 8 0 1 4.00 0.76 3.17 3.53 3.76 4.01 5.42 ▇▇▁▁▃
qsec 4 0 1 19.14 1.68 16.70 18.56 18.90 19.95 22.90 ▃▇▇▁▂
qsec 6 0 1 17.98 1.71 15.50 16.74 18.30 19.17 20.22 ▃▇▃▃▇
qsec 8 0 1 16.77 1.20 14.50 16.10 17.18 17.56 18.00 ▂▂▁▅▇
vs 4 0 1 0.91 0.30 0.00 1.00 1.00 1.00 1.00 ▁▁▁▁▇
vs 6 0 1 0.57 0.53 0.00 0.00 1.00 1.00 1.00 ▆▁▁▁▇
vs 8 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
am 4 0 1 0.73 0.47 0.00 0.50 1.00 1.00 1.00 ▃▁▁▁▇
am 6 0 1 0.43 0.53 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
am 8 0 1 0.14 0.36 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
gear 4 0 1 4.09 0.54 3.00 4.00 4.00 4.00 5.00 ▁▁▇▁▂
gear 6 0 1 3.86 0.69 3.00 3.50 4.00 4.00 5.00 ▃▁▇▁▂
gear 8 0 1 3.29 0.73 3.00 3.00 3.00 3.00 5.00 ▇▁▁▁▁
carb 4 0 1 1.55 0.52 1.00 1.00 2.00 2.00 2.00 ▇▁▁▁▇
carb 6 0 1 3.43 1.81 1.00 2.50 4.00 4.00 6.00 ▃▁▇▁▂
carb 8 0 1 3.50 1.56 2.00 2.25 3.50 4.00 8.00 ▇▇▁▁▁

We used group_by to summarize trends for each level of our grouping variable cyl. Notice how the data is formatted. This is referred to as a melted or long data format. Next, lets calculate the coefficient of variation (CV; std/mean) for each column in our data and level of cyl.

#calculate CV
CV <-.summary %>%
 select(contains(c('mean','sd'))) %>%
  {(.[2]/.[1]) * 100} %>%
  setNames(.,'CV')

.summary['CV']<- CV

Plot the CV vs. the mean for each variable separately for each level of cyl.

library(ggplot2)
library(ggrepel)

theme_set(theme_minimal()) # set theme globaly
options(repr.plot.width = 2, repr.plot.height =3) # globaly set wi

ggplot(.summary, aes(x=numeric.mean,y=CV,color=cyl)) + 
  geom_point(size=2, alpha=.75) +
  geom_text_repel(aes(label=skim_variable),show.legend = FALSE) +
  facet_grid(.~cyl) +
  scale_color_brewer(palette = 'Set1') +
  xlab('mean') +
  ylab('Coefficient of variation')

This plot is useful to identify variables with low precision which may need to be omitted for further analyses. In the case of mtcars we see the variables with high CV compared to their mean should all be categorical. For example we can check that am shows a large differences for levels of cyl.

mtcars %>%
  select(one_of(c('cyl','am'))) %>%
  table()
   am
cyl  0  1
  4  3  8
  6  4  3
  8 12  2

5.4 Multivariate Analysis

Next lets visualize sample (row) trends given all variables (columns) using principal components analysis (PCA). PCA is a powerful technique to identify unknown patterns and/or evaluate assumptions about your data. First, lets calculate the principal components and visualize their variance explained.

#calculate and show eigenvalue summary
pca_obj <- prcomp(data,scale= TRUE)

5.4.1 Visualize optimal principal components (PCs) to retain.

#notice the summary method does not return the results as printed.
#we could modify the method todo so or replicate results
eigenvals<-pca_obj$sdev
eigenvals_cumsum<-cumsum(eigenvals)
var_explained<-sum(eigenvals)
prop_var_exp<-eigenvals/var_explained
prop_cumsum<-eigenvals_cumsum/var_explained

pca_eigen<-data.frame(PC=1:length(eigenvals),var_explained=prop_var_exp,eigen_cumsum=prop_cumsum)

Plot eigenvalues to select total number of PCs

library(ggplot2)
library(reshape2)

df<-melt(pca_eigen,id.vars = 'PC')

ggplot(df, aes(
  x = as.factor(PC),
  y = value,
  fill = variable,
  group = variable
)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_fill_brewer(
    palette = 'Set1',
    labels = c('variance explained', 'cummulative variance\nexplained')
  ) +
  theme_minimal() +
  geom_hline(yintercept = .8, linetype = "dashed") +
  xlab('Principal Components') 

5.5 Create a scatter plot matrix to visualize PC scores

Visualize each pairwise PC comparison and color sample scores by cyl.

library(GGally)
library(plotly)

limit<-5
group<-'cyl'

df <- pca_obj$x[,1:limit] %>%
  as.data.frame() %>%
  mutate(id = rownames(.)) %>%
  left_join(mtcars %>% select(one_of(group)) %>% mutate(id = rownames(.))) %>%
  select(-id) %>%
  mutate(group=as.factor(!!sym(group)))

  
p <- ggpairs(df, ggplot2::aes(colour=group ))

ggplotly(p)

5.5.1 Visualize specific PC sample scores

Plot the principal plane (PC1 vs PC2) and identify scores for different numbers of cyl with ellipses.

library(ggrepel)
scores_df <- pca_obj$x %>%
  as.data.frame() %>%
  mutate(id = rownames(.)) %>%
  left_join(mtcars %>% mutate(id = rownames(.), cyl = as.factor(cyl)))

ggplot(scores_df, aes(x = PC1, y = PC2, color = cyl)) +
  geom_point(size=3) +
  geom_text_repel(aes(label=id), show.legend = FALSE, max.overlaps = nrow(df),force=100) +
  stat_ellipse(aes(fill=cyl), geom = "polygon",alpha=.25, show.legend = FALSE) +
  ggtitle('Scores')

This visualization is helpful for identify similarity within and between different numbers of cyl. For example, we can see that the biggest differences (in variables) are between cyl=4 and cyl=8. If we expect sample scores to be bivariate normal in the scores space, samples outside the ellipses can be helpful for identify moderate outliers among groups of cyl.

5.5.2 Visualize variable loadings

df <- pca_obj$rotation %>%
  as.data.frame() %>%
  mutate(id = rownames(.)) 

ggplot(df, aes(x = PC1, y = PC2)) +
  geom_point() +
  geom_text_repel(aes(label=id), show.legend = FALSE) +
  ggtitle('Loadings')

5.5.3 Create a custom component for axis labels

We will show the percent explained for each PC.

prcomp_axis_label <- function(obj,digits=1) {
  
  n<-obj %>% length()
  .name<-
    paste0(
    rep('PC',n),
    c(1:n)
  )
  
  paste0(.name,'[',round(obj,digits),'%]') %>%
    setNames(.name)

}

prcomp_axis_label(pca_eigen$var_explained*100,0)
       PC1        PC2        PC3        PC4        PC5        PC6        PC7 
"PC1[33%]" "PC2[21%]" "PC3[10%]"  "PC4[7%]"  "PC5[6%]"  "PC6[6%]"  "PC7[5%]" 
       PC8        PC9       PC10       PC11 
 "PC8[4%]"  "PC9[4%]" "PC10[3%]" "PC11[2%]" 
my_labels<-prcomp_axis_label(pca_eigen$var_explained*100,0)
x <-'PC1'
y<-'PC4'

.df<-df %>%
  select(one_of(c(x,y,'id')))

ggplot(.df, aes_string(x = x, y = y)) +
  geom_point() +
  geom_text_repel(aes(label=id), show.legend = FALSE, max.overlaps = nrow(df),force=3) +
  ggtitle('Variable loadings') +
  ylab(my_labels[y]) +
  xlab(my_labels[x])

Loadings can be used to identify which variables are most different between groups of sample scores. For example, since groups of cyl scores are spread in the x-axis (PC1) we can look at the variables with largest loadings on PC1 (largest negative and positive PC1 values (position on the x-axis) to identify the largest differences in variables between groups of cyl. This can be useful to identify that cars with smaller number of cyl have higher mpg and lower disp.

We can investigate this observation by making a custom visualization.

p<-ggplot(mtcars, aes(x = disp, y = mpg,color=as.factor(cyl))) +
  geom_point(size=3)

p

Notice how we can almost perfectly separate groups of cyl based on these two dimensions?

library(tidyr)
#calculate min and max for variables given groups of cyl
.ranges<-mtcars %>% 
  group_by(cyl) %>% 
  summarise(min_mpg = min(mpg), max_mpg = max(mpg),
            min_disp = min(disp), max_disp = max(disp)) %>%
  gather(variable, value, -cyl) %>% 
  separate(variable, into = c("variable", "stat"), sep = "_") %>% 
  spread(stat, value)


#add rectangles to the plot
#note: this is sub optimal as we need to know what is x or y axis in the plot
tmp<- .ranges %>%
  split(., .$cyl)

#need to know colors to set rectangle colors
p<-p + 
  scale_colour_brewer(
  palette = 'Set1',
  aesthetics = "colour"
)
#get color codes
library(RColorBrewer)
.colors<-brewer.pal(length(levels(mtcars$cyl)), 'Set1')

for(x in 1:length(tmp)){
  
  i<-tmp[[x]]
  xmin<- i %>%
    filter(variable == 'min') %>%
    select(disp) %>% 
    .[1,,drop=TRUE]
  xmax<- i %>%
    filter(variable == 'max') %>%
    select(disp) %>% 
    .[1,,drop=TRUE]
  ymin<- i %>%
    filter(variable == 'min') %>%
    select(mpg) %>% 
    .[1,,drop=TRUE]
  ymax<- i %>%
    filter(variable == 'max') %>%
    select(mpg) %>% 
    .[1,,drop=TRUE]
    
p<-p +
  annotate("rect", xmin = xmin , xmax = xmax, ymin = ymin, ymax = ymax,
           alpha = .5,fill = .colors[x])
  
}


p + guides(color=guide_legend(title="number of cylinders"))

5.6 Appendix