The best way to learn how to manipulate data in R is to find a simple data set and practice different transformations. The datastesR package has many examples to experiment with.
While we will work with demo data for the rest of the examples, in practice you will likely want to import your own custom data sets.
Take a look at some available data sets in the datastesR package.
if(!require(datasets)) install.packages('datasets') #install package for the first time library(datasets) # load package in session.data<-data()# str(.data) #take a look at the resulting object head(.data$results[,c('Item','Title')]) #extract specific elements
Item
[1,] "billboard"
[2,] "cms_patient_care"
[3,] "cms_patient_experience"
[4,] "construction"
[5,] "fish_encounters"
[6,] "household"
Title
[1,] "Song rankings for Billboard top 100 in the year 2000"
[2,] "Data from the Centers for Medicare & Medicaid Services"
[3,] "Data from the Centers for Medicare & Medicaid Services"
[4,] "Completed construction in the US in 2018"
[5,] "Fish encounters"
[6,] "Household data"
Lets find some data about cars.
#lets look for a key word (i.e. substring) in Title of the datasetskeyword<-'car'#try ?grepl to see how else it can be usedcars_id<-grepl(keyword,as.data.frame(.data$result)$Title,ignore.case = T).data$results[cars_id,]
Package LibPath
[1,] "tidyr" "C:/Users/dgrap/AppData/Local/R/win-library/4.3"
[2,] "tidyr" "C:/Users/dgrap/AppData/Local/R/win-library/4.3"
[3,] "datasets" "C:/Program Files/R/R-4.3.3/library"
[4,] "datasets" "C:/Program Files/R/R-4.3.3/library"
[5,] "datasets" "C:/Program Files/R/R-4.3.3/library"
Item
[1,] "cms_patient_care"
[2,] "cms_patient_experience"
[3,] "CO2"
[4,] "cars"
[5,] "mtcars"
Title
[1,] "Data from the Centers for Medicare & Medicaid Services"
[2,] "Data from the Centers for Medicare & Medicaid Services"
[3,] "Carbon Dioxide Uptake in Grass Plants"
[4,] "Speed and Stopping Distances of Cars"
[5,] "Motor Trend Car Road Tests"
See here for more examples of string processing in R.
Lets load and review the mtcars data set.
data("mtcars")# View(mtcars) #table view for small data - uncomment this line to see an interactive table#note we can also look at this data in the environment tab of Rstudio
Summarize the data.
summary(mtcars) # see more advanced data summaries in the `Exploratory Data Analysis` section
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
Next lets introduce a more readable way to link R functions. We will use the pipe operator %>%.
#lets format the meant miles per gallon to two digitsround(mean(mtcars$mpg),2)
[1] 20.09
# we can rewrite this as a pipe where x %>% f(.) is equivalent to f(x). The '.' can be implicit or used to denote the 'x' on the left side of the equation.mtcars$mpg %>%sd(.) %>%round(.,2)
[1] 6.03
Note %>% can be imported in different ways and depends on the magrittr library. A more recent R update now allows to call the pipe operator from the base library as |>.
We can use the R packagedplyr to create a custom summary. Lets calculate the mean and standard deviation of each column.
We used the summarise_each function to calculate a function (e.g. mean, sd) on each column and return the results as a data.frame. We will later learn more about functions.
A better format could be to output the results as columns for each row which correspond to the original columns in the data. We can check the dplyrcheat-sheet to see what other data wrangling operations this package enables. We will use dplyr column select functions to select and manipulate elements in the data.frame based on the column names.
means <- my_summary %>%select(ends_with('_mean')) %>%t()stdevs <- my_summary %>%select(!ends_with('_mean')) %>%t()(my_summary2 <-data.frame(variable =colnames(mtcars), # create new variablemean = means,stdev = stdevs ))
In addition to dplyr the tidyr package also offers many useful data manipulation functions.
dplyr
tidyr
Lets round the results and then create a summary as mean +/- stdev . To do this we will create our first function. A function simply executes (calls) on a set of inputs (arguments).
# lets start with the logic and then convert it to a function#inputsdigits<-1x<-my_summary2[,2,drop=FALSE] # data to test with-- second column#step 1 - roundx %>%round(.,digits)
#create a function to do both at the same time on arbitrary inputs#note we are using Roxygen syntax (ctrl+shift+alt+R) to also document our funtion which is relevant when making R packages#' summary_function#'#' @param x data.frame#' @param digits int, number of digits to round to#' @param name str, colum name of results#' @param sep str, what to separate the combined columns with#'#' @return data.frame where each column is rounded to `digits` and combined into a string collapsed on `sep`.#' @export#' @details Round each column to `digits` and combined all columns into a string collapsed on `sep`.#' @examplessummary_function <-function(x, digits,name ='mean_sd',sep =' +/- ') { x %>%summarise(across(), round(.,digits)) %>%unite(.,col=!!name, sep = sep) # ... use !! or {{}} for string arguments passed to dplyr verbs read more: https://dplyr.tidyverse.org/articles/programming.html }#call function(tmp <- my_summary2 %>%select(one_of(c('mean', 'stdev'))) %>%summary_function(., digits=2) )
Note, it can also be useful to call a functions by their string names using do.call('function_name',list(<arguments>)).
3.2 Loops
When we executed a function over each column this is executed by looping the calculation n number of times where n is equal to the number of columns. While modern libraries like dplyr, tidyr and purrr do this internally. Next lets explore how to create loops. The simplest way to loop is using the for function. Note R is a vectorized language and looping is often discouraged because its much slower; this approach is still very useful for prototyping complex and simpler to read, understand and debug code.
Lets use apply to mimic summarise.
means<-apply(mtcars,2,mean) # the margin 1 == across rows or 2 == columnsstdevs<-apply(mtcars,2,'sd') # functions or their names are supporteddata.frame(variable=colnames(mtcars),mean=means,stdev=stdevs)
variable mean stdev
mpg mpg 20.090625 6.0269481
cyl cyl 6.187500 1.7859216
disp disp 230.721875 123.9386938
hp hp 146.687500 68.5628685
drat drat 3.596563 0.5346787
wt wt 3.217250 0.9784574
qsec qsec 17.848750 1.7869432
vs vs 0.437500 0.5040161
am am 0.406250 0.4989909
gear gear 3.687500 0.7378041
carb carb 2.812500 1.6152000
Next, lets repeat our column summary calculation using a forloop.
results<-list() #initialize an empty list to store results in. Note it is more efficient to make a list of the same length as the number of elements you want to store.for (i in1:ncol(mtcars)){# print(i) # see iterated variable value results$mean[i]<-mtcars[,i] %>%mean() # store results in position [i] in the list results element named 'mean' results$sd[i]<-mtcars[,i] %>%sd()}data.frame(variable=colnames(mtcars),results)
A lapply is more convenient and versatile version of a forloop.
lapply(mtcars,function(x){c(mean=mean(x),sd=sd(x))}) %>%do.call('rbind',.) %>%#combine list elements rowwise; use 'cbind' to combine columnwise data.frame(variable=colnames(mtcars),.)
variable mean sd
mpg mpg 20.090625 6.0269481
cyl cyl 6.187500 1.7859216
disp disp 230.721875 123.9386938
hp hp 146.687500 68.5628685
drat drat 3.596563 0.5346787
wt wt 3.217250 0.9784574
qsec qsec 17.848750 1.7869432
vs vs 0.437500 0.5040161
am am 0.406250 0.4989909
gear gear 3.687500 0.7378041
carb carb 2.812500 1.6152000
Next, we will build on our function to create summaries for groups of rows. Lets summarize miles per gallon mpg for cars with different number of cylinders cyl. First lets create the functions for the individual steps.
#we need to regenerate our original analysis#we can take this opportunity to functionalize all the steps#1) calculate mean and standard deviation of each column#2) pivot data#3) create a custom summary#1 - execute function(s) on each columncolumn_summary<-function(data,functions =c(mean=mean,stdev=sd)){ data %>%summarise_each(., funs(!!!(functions))) # use !!! for functions or unquoted arguments}#test #1 (x<-mtcars %>%column_summary())
#2 format results#we can explicitly pass column names we want to separate or infer based on suffix#2 A. infer common suffixget_unique_suffix<-function(data,sep='_'){colnames(data) %>%#get column namesstrsplit(.,'_') %>%#split string on '_''do.call('rbind',.) %>%# combine list elements row wise .[,2] %>%#get second column, better to reference by nameunique() #get unique values}#test 2 Aget_unique_suffix(x)
[1] "mean" "stdev"
#2 B transpose elementstranspose_on_suffix<-function(data,sep='_'){ suffixes<- data %>%get_unique_suffix(.,sep)#loop over suffixes and transposelapply(suffixes,function(x){ data %>%select(ends_with(x)) %>%t() # transpose operation, i.e. rotate rows to columns }) %>%do.call('cbind',.) %>%# bind list elements columnwisedata.frame() %>%# make sure its a data.framesetNames(.,suffixes) #set column names}#test 2 Atranspose_on_suffix(x)
#we could A) create a custom loop or B) modify our original one to handle a grouping variable#A) #we will split the data into list elements for each group and execute our simple workflow data<-mtcars # make this more generictmp<-data %>%mutate(groups=as.factor(cyl)) #note, we need to save to an intermediate object for split to play nice with dplyrtmp %>%split(.,.$groups) %>%lapply(.,function(x){ x %>%select(-groups) %>%#remove factor which will cause an issue -- native dplyr handles this for uscolumn_summary(.) %>%transpose_on_suffix(.) %>%mutate(groups=x$groups %>%unique(),variable=colnames(data)) #note, we lost the variable names during the calculation Some options to fix this are A) save and carry forward variables in the original calculation (best -- complicated) or B) set variables as our data column names (simple but hard for others to understand and verify as correct) }) %>%do.call('rbind',.)
#B) #to execute the dplyr we need to modify transpose_on_suffix# we need to account for column_summary to yield results for each level of our grouping variable. # This exercise is not for the feint of heart. For now lets go with plan A or the path of least resistance. Bonus: try to use an AI code helper to see how it would solve this task using dplyr and tidyr# data %>%# mutate(groups=as.factor(cyl)) %>%# group_by(groups) %>%# column_summary() %>%# transpose_on_suffix(.) # our original function needs to keep track of grouping variable levels. An easy solution is not obvious.
3.3 Reshaping
Reshaping is the act of changing data structure between long and wide formats. We might call the conversion from wide to long as melting and from long to wide as pivoting or casting our data. These operations are particularly useful for data visualization. Lets briefly take a look at some examples below.
In a wide format every row represents a unique sample and columns are the samples descriptors or measurements.
We can transform this into a long format wherein we have multiple rows per sample.
if(!require('reshape2')) {install.packages('reshape2')library('reshape2', character.only =TRUE)}#add our sample index to the data ... note alternative functions can do this automaticallydf<-mtcarsdf$name<-row.names(df)melted_df <-melt(df,id.vars ='name')head(melted_df)
The dplyr package provides many methods for combining data frames based on common indices.
3.5 Error handling
Sometimes we may observe unexpected errors when executing functions over parts of the data (e.g. sample size is too low). We could handle this by checking and removing possible errors before hand or (simpler) handling errors in the calculation.
Lets take a moment to learn error handling.
#the general form for for error handling using base R # tryCatch({# expression # function call# }, warning = function(w){# code that handles the warnings# }, error = function(e){# code that handles the errors# }, finally = function(f){# clean-up code# })f<-function(a){ a +1}data<-c(1:10)f(data)
[1] 2 3 4 5 6 7 8 9 10 11
# data<-c('1','a') # uncomment this to see an error message# f(data)tryCatch(f(data),error=function(e){print(as.character(e))}) # in this toy example the error is ignored and instead we print the error message as a string
[1] 2 3 4 5 6 7 8 9 10 11
Note, an alternative is to use the purrr:safelyfunction which returns a more standard list consisting of results and error.
3.6 Debugging
Debugging is the act of investigating the logic of code and/or verifying its correctness. The browser and debug functions can be used to interactively run code and view its state.
#browser can be used as a break point to pause code execution and overview its statef<-function(x){ x <- x +rnorm(1)browser() # use c = continue, n = next line and Q = quit debugger x}# f(2) # uncomment to run#debug will sets a breakpoint any time a given function is runf<-function(x){ x <- x +rnorm(1) x}# debug(f) # uncomment to runf(2)
[1] 3.278332
#wehn you are doneundebug(f)
3.7 Reproducing randomness
Many R functions have random or stochastic components. The set.seed function can be used to reproduce function results with stochastic components.
#can also be used to set the global seed -- show loop exampleset.seed(1)c(rnorm(1),rnorm(1))
[1] -0.6264538 0.1836433
set.seed(2)c(rnorm(1),rnorm(1))
[1] -0.8969145 0.1848492
Data wrangling is an inherent task for every data science workflow. We will build upon the data wrangling skills you have learned so far in the next sections.