2  Data types

chatGPT 4 04/2024 - a 4 headed hydra mythical creature, with heads of a dog, duck, parrot and chicken, cartoon style

R uses different types to describe individual units of data, datum, and stores these in different types of data objects. The following examples are meant to get you up and running with common R data types and data.frames. Take a look in the appendix for more comprehensive examples.

The major types of datum include.

Individual datum can be combined in the following common data structures:

Note, many advanced data types exist which are optimized for specific use cases e.g. arrays and tibbles.

2.1 Hands on

Lets create and manipulate some datum

2.1.1 Making datum

Logical vectors can be used to subset data, execute logical comparisons (e.g. and, or and not) and are interpreted numerically as TRUE = 1 and FALSE = 0. Lets create a logical datum and assign (i.e. save) it to the name obj.

obj<-TRUE
str(obj)
 logi TRUE

2.1.2 Introspection

Using the str function is a great way to inspect the properties of a datum and data type. Other useful introspection functions include:

  • length - length
  • dim - dimensions (e.g. number of rows and columns)
  • class - object-oriented class
  • typeof - type of the datum
  • summary - numeric summary
  • names - object names used for subsetting

2.1.3 Combining datum

Lets add some more datum elements to our obj.

obj<-c(obj,0)
obj
[1] 1 0

This action took advantage of the numeric translation of logicals (see above) and used c to add another element. R also coerced (i.e. changed) the data types in the process to numeric.

str(obj)
 num [1:2] 1 0

We can convert types. For example lets convert the binary (i.e two values) obj back to a logical vector (i.e. object with a one-dimension (i.e. length) and one type).

as.logical(obj)
[1]  TRUE FALSE

2.1.4 Data types

Using c we already created a vector data type. Next lets create a character vector and explore how logical vectors can be used to subset (i.e. select) specific datum.

pet<-c('dog','parrot','chicken','duck')
bird<-c(FALSE,TRUE,TRUE,TRUE)
pet[bird]
[1] "parrot"  "chicken" "duck"   

Doing this we have used the [] function to extract elements from the the object pets based on the logical vector object bird. This returns all values equal to TRUE in bird. Logical vectors are commonly used for comparisons (e.g. logical operations).

For example, we get can easily get all pets which are not birds. This takes advantage of conditional subsetting.

pet[!bird]
[1] "dog"

The example above first calculates all objects not (!) equal to bird == TRUE then extracts all TRUE values from the object pets (rows).

We can use logical operations using multiple logical vectors. For example, lets define if an animals is friendly and select all friendly bird(s).

friendly<-c(TRUE,TRUE,TRUE,FALSE)# define characteristics of each pet
pet[friendly & bird]
[1] "parrot"  "chicken"

Exercise: How would you select all unfriendly birds?

2.1.5 Combining datum in data objects

The examples above created vectors for the various combinations of datum. A vector is Rs default method to store one-dimensional (i.e. only have length) combinations of objects of the same type. Next lets store multiple vectors of different types using a data.frame.

(df<-data.frame(pet=pet,bird=bird,friendly=friendly)) # note we can place the assignment '<-' in parentheses '()' to print the results
      pet  bird friendly
1     dog FALSE     TRUE
2  parrot  TRUE     TRUE
3 chicken  TRUE     TRUE
4    duck  TRUE    FALSE

We can check the dimensions, and column and row names of data.frames as follows.

dim(df) # rows columns 
[1] 4 3
colnames(df)
[1] "pet"      "bird"     "friendly"
rownames(df)
[1] "1" "2" "3" "4"
#note we can get both and column and row names using 'dimnames'

When we created the data frame we nameed the elements, which allows us to subset them using the $ notation. For example, we can do the following to get all bird elements which are not friendly.

df[df$bird & !df$friendly,]
   pet bird friendly
4 duck TRUE    FALSE

Notice, we can index a two-dimensional data.frame using the same [] operator and specify rows or columns using the notation [rows,columns]. The example above returned all rows which are bird = TRUE and not friendly = TRUE and return all columns for the results.

If we wanted to only know the pet which is a bird and is not friendly we can do as follows.

df[df$bird & !df$friendly,]$pet
[1] "duck"

A common data analysis task is to ask if specific elements are %in% (in) the elements of a data object. For example, we can get all birds which are not equal to pet ='duck'.

df[!df$pet %in% 'duck' & df$bird,]
      pet bird friendly
2  parrot TRUE     TRUE
3 chicken TRUE     TRUE

Note, we could have done this is in a more programmatic manner by operating on objects created from the individual calculations. This makes it easier to read your code and simplifies recalculations give changes in inputs.

friendly_bird<-df$bird & df$friendly

df[friendly_bird,]
      pet bird friendly
2  parrot TRUE     TRUE
3 chicken TRUE     TRUE

2.1.6 Missing values

Missing values are denoted in R as NA. Note, other special definitions include Inf,-Inf, NaN and NULL, interpreted as positive and negative infinity, not a number and undefined, respectively. Missing values need to be omitted, imputed or handled in functions else these can cause errors or will be propagated as NA.

Next lets add missing values to or data and retry some of the examples above to see what happens. To create missing values we will use a logical operation to select a specific row, select a column for that row and then assign to NA.

original_df<-df # save original data
df[df$pet == 'dog',]$bird<-NA
df
      pet bird friendly
1     dog   NA     TRUE
2  parrot TRUE     TRUE
3 chicken TRUE     TRUE
4    duck TRUE    FALSE

Lets see what happens when we try select all friendly birds.

friendly_bird<-df$bird & df$friendly
df[friendly_bird,]
       pet bird friendly
NA    <NA>   NA       NA
2   parrot TRUE     TRUE
3  chicken TRUE     TRUE

Notice the NA is propagated in the results which can later cause errors.

We can omit the NA either before (easiest) or after the calculation.

df<-na.omit(df) # remove all columns and/or rows with a missing value

#notice we want to recalculate the original logical operators for the new data since it changed shape
friendly_bird<-df$bird & df$friendly
df[friendly_bird,]
      pet bird friendly
2  parrot TRUE     TRUE
3 chicken TRUE     TRUE

We can also check if a row or a column has an NA and treated in a custom manner. For example we can replace it.

df<-original_df # recreate missing value
df[df$pet == 'dog',]$bird<-NA

#replace
df[is.na(df)]<-FALSE
df
      pet  bird friendly
1     dog FALSE     TRUE
2  parrot  TRUE     TRUE
3 chicken  TRUE     TRUE
4    duck  TRUE    FALSE

We can do as follow to remove any columns with all values == NA.

#create a bad column
df<-original_df
df$bad<-NA
df
      pet  bird friendly bad
1     dog FALSE     TRUE  NA
2  parrot  TRUE     TRUE  NA
3 chicken  TRUE     TRUE  NA
4    duck  TRUE    FALSE  NA
#check for missing
all_missing_columns<-colSums(is.na(df)) == nrow(df)
#remove any columns meeting missing criteria
df[,!all_missing_columns]
      pet  bird friendly
1     dog FALSE     TRUE
2  parrot  TRUE     TRUE
3 chicken  TRUE     TRUE
4    duck  TRUE    FALSE

In the example above we created a new column named ‘bad’, assigned all its values to NA (missing), counted the number of missing values in each column, evaluated if the number of missing is equal to the number of rows and then removed any columns meeting these criteria from the data.frame.

Exercise: How would you remove columns with greater than some% missing values?

2.2 Lists

After data.frames, lists are the most commonly used data types in R. A list can be used to store different types of datum and of unequal lengths. We will learn more about lists later. For now lets compare lists to data.frames.

Lets first create a list and then convert a data.frame to a list.

(df_list<- list(pet=pet,bird = bird,friendly = friendly)) # name = value
$pet
[1] "dog"     "parrot"  "chicken" "duck"   

$bird
[1] FALSE  TRUE  TRUE  TRUE

$friendly
[1]  TRUE  TRUE  TRUE FALSE
df_list2<-as.list(original_df) # we can also convert a data.frame to a list

message('The two lists are identical: ',identical(df_list,df_list2)) # we can check an assertion and print

Similar to data.frames we can extract list elements based on their name.

df_list$pet
[1] "dog"     "parrot"  "chicken" "duck"   

We can also get items based on their numeric index (order in list).

df_list[1] # this returns the list element name and values
$pet
[1] "dog"     "parrot"  "chicken" "duck"   
df_list[[1]] # this returns only the values
[1] "dog"     "parrot"  "chicken" "duck"   

We can unpack all the elements in the list and return a vector.

unlist(df_list)  #notice mixed types are converted to strings (i.e. quoted text)
     pet1      pet2      pet3      pet4     bird1     bird2     bird3     bird4 
    "dog"  "parrot" "chicken"    "duck"   "FALSE"    "TRUE"    "TRUE"    "TRUE" 
friendly1 friendly2 friendly3 friendly4 
   "TRUE"    "TRUE"    "TRUE"   "FALSE" 

2.3 Matrices and Arrays

Matrices (2 dimensional tables) and arrays (n dimensional table) are often used for specialized mathematical calculations. Matrices can be useful for organizing vectors into different dimensions.

Lets represent a vector as table with custom number of rows and columns.

tmp<-unlist(df_list)
(mat<-matrix(tmp,ncol=3)) # note this fills by columns
     [,1]      [,2]    [,3]   
[1,] "dog"     "FALSE" "TRUE" 
[2,] "parrot"  "TRUE"  "TRUE" 
[3,] "chicken" "TRUE"  "TRUE" 
[4,] "duck"    "TRUE"  "FALSE"
# matrix(tmp,ncol=3,byrow = TRUE)  #fill by row

Matrices are useful for many other purposes. Unlike data.frames they do not store mixed types (i.e. when numeric and other types are mixed all values are coerced to strings).

We can convert a matrix to a data.frame.

df2<-as.data.frame(mat)
dimnames(df2)<-dimnames(original_df)# set dimension names
df2
      pet  bird friendly
1     dog FALSE     TRUE
2  parrot  TRUE     TRUE
3 chicken  TRUE     TRUE
4    duck  TRUE    FALSE
str(df2) # notice our original types may have not been preserved
'data.frame':   4 obs. of  3 variables:
 $ pet     : chr  "dog" "parrot" "chicken" "duck"
 $ bird    : chr  "FALSE" "TRUE" "TRUE" "TRUE"
 $ friendly: chr  "TRUE" "TRUE" "TRUE" "FALSE"

Lastly, we can compare R objects. This is very useful for debugging why some examples work and others don’t.

identical(original_df,df2) # more advanced methods can show what is different
[1] FALSE
all.equal(original_df,df2)
[1] "Attributes: < Component \"row.names\": Modes: numeric, character >"              
[2] "Attributes: < Component \"row.names\": target is numeric, current is character >"
[3] "Component \"bird\": Modes: logical, character"                                   
[4] "Component \"bird\": target is logical, current is character"                     
[5] "Component \"friendly\": Modes: logical, character"                               
[6] "Component \"friendly\": target is logical, current is character"                 

2.4 Appendix