<-TRUE
objstr(obj)
logi TRUE
R
uses different types
to describe individual units of data, datum
, and stores these in different types
of data objects
. The following examples are meant to get you up and running with common R
data types and data.frames
. Take a look in the appendix for more comprehensive examples.
The major types of datum
include.
logical
- TRUE
or FALSE
numeric
- 1.0
integer
- 1
complex
- 1i
character
- ‘1’factor
- categorical made up of keys = labels
and values = levels
pairsraw
- 01
Individual datum
can be combined in the following common data structures
:
vector
- one
-dimensional storage for objects of the same type
matrix
- two
-dimensional storage for objects of the same type
data.frame
- two
-dimensional storage for objects of different type
slist
- n
-dimensional storage for objects of different type
sNote, many advanced data types exist which are optimized for specific use cases e.g. arrays
and tibbles.
Lets create and manipulate some datum
datum
Logical vectors can be used to subset data, execute logical comparisons (e.g. and
, or
and not
) and are interpreted numerically as TRUE = 1
and FALSE = 0
. Lets create a logical
datum
and assign
(i.e. save) it to the name obj
.
<-TRUE
objstr(obj)
logi TRUE
Using the str
function is a great way to inspect the properties of a datum
and data type
. Other useful introspection functions include:
length
- lengthdim
- dimensions (e.g. number of rows and columns)class
- object-oriented classtypeof
- type of the datum
summary
- numeric summarynames
- object names used for subsettingdatum
Lets add some more datum
elements
to our obj
.
<-c(obj,0)
obj obj
[1] 1 0
This action took advantage of the numeric
translation of logicals
(see above) and used c
to add another element. R
also coerced
(i.e. changed) the data types
in the process to numeric
.
str(obj)
num [1:2] 1 0
We can convert types
. For example lets convert the binary
(i.e two values) obj
back to a logical vector
(i.e. object with a one
-dimension (i.e. length
) and one type
).
as.logical(obj)
[1] TRUE FALSE
Data types
Using c
we already created a vector
data type
. Next lets create a character vector and explore how logical vectors can be used to subset
(i.e. select) specific datum
.
<-c('dog','parrot','chicken','duck')
pet<-c(FALSE,TRUE,TRUE,TRUE)
bird pet[bird]
[1] "parrot" "chicken" "duck"
Doing this we have used the []
function to extract elements
from the the object pets
based on the logical
vector
object bird
. This returns all values equal to TRUE
in bird
. Logical vectors are commonly used for comparisons (e.g. logical operations).
For example, we get can easily get all pets
which are not
birds
. This takes advantage of conditional subsetting
.
!bird] pet[
[1] "dog"
The example above first calculates all objects not (!)
equal to bird == TRUE
then extracts all TRUE
values from the object pets
(rows).
We can use logical operations
using multiple logical
vectors. For example, lets define if an animals is friendly
and select all friendly
bird
(s).
<-c(TRUE,TRUE,TRUE,FALSE)# define characteristics of each pet
friendly& bird] pet[friendly
[1] "parrot" "chicken"
Exercise
: How would you select all unfriendly birds?
datum
in data objects
The examples above created vector
s for the various combinations of datum
. A vector
is R
s default method to store one
-dimensional (i.e. only have length) combinations of objects of the same type
. Next lets store multiple vectors of different types using a data.frame
.
<-data.frame(pet=pet,bird=bird,friendly=friendly)) # note we can place the assignment '<-' in parentheses '()' to print the results (df
pet bird friendly
1 dog FALSE TRUE
2 parrot TRUE TRUE
3 chicken TRUE TRUE
4 duck TRUE FALSE
We can check the dimensions, and column and row names of data.frames
as follows.
dim(df) # rows columns
[1] 4 3
colnames(df)
[1] "pet" "bird" "friendly"
rownames(df)
[1] "1" "2" "3" "4"
#note we can get both and column and row names using 'dimnames'
When we created the data frame we name
ed the elements, which allows us to subset them using the $
notation. For example, we can do the following to get all bird
elements which are not
friendly
.
$bird & !df$friendly,] df[df
pet bird friendly
4 duck TRUE FALSE
Notice, we can index a two
-dimensional data.frame
using the same []
operator and specify rows
or columns
using the notation [rows,columns]
. The example above returned all rows which are bird = TRUE
and not
friendly = TRUE
and return all columns
for the results.
If we wanted to only know the pet
which is a bird
and is not friendly
we can do as follows.
$bird & !df$friendly,]$pet df[df
[1] "duck"
A common data analysis task is to ask if specific elements
are %in%
(in) the elements
of a data object
. For example, we can get all bird
s which are not equal to pet ='duck'
.
!df$pet %in% 'duck' & df$bird,] df[
pet bird friendly
2 parrot TRUE TRUE
3 chicken TRUE TRUE
Note, we could have done this is in a more programmatic
manner by operating on objects created from the individual calculations. This makes it easier to read your code and simplifies recalculations give changes in input
s.
<-df$bird & df$friendly
friendly_bird
df[friendly_bird,]
pet bird friendly
2 parrot TRUE TRUE
3 chicken TRUE TRUE
Missing values are denoted in R
as NA
. Note, other special definitions include Inf
,-Inf
, NaN
and NULL
, interpreted as positive and negative infinity
, not a number
and undefined
, respectively. Missing values need to be omitted
, imputed
or handled in functions
else these can cause errors or will be propagated as NA
.
Next lets add missing values to or data and retry some of the examples above to see what happens. To create missing values we will use a logical operation to select a specific row, select a column for that row and then assign to NA
.
<-df # save original data
original_df$pet == 'dog',]$bird<-NA
df[df df
pet bird friendly
1 dog NA TRUE
2 parrot TRUE TRUE
3 chicken TRUE TRUE
4 duck TRUE FALSE
Lets see what happens when we try select all friendly birds.
<-df$bird & df$friendly
friendly_bird df[friendly_bird,]
pet bird friendly
NA <NA> NA NA
2 parrot TRUE TRUE
3 chicken TRUE TRUE
Notice the NA
is propagated in the results which can later cause errors.
We can omit the NA
either before (easiest) or after the calculation.
<-na.omit(df) # remove all columns and/or rows with a missing value
df
#notice we want to recalculate the original logical operators for the new data since it changed shape
<-df$bird & df$friendly
friendly_bird df[friendly_bird,]
pet bird friendly
2 parrot TRUE TRUE
3 chicken TRUE TRUE
We can also check if a row or a column has an NA
and treated in a custom manner. For example we can replace it.
<-original_df # recreate missing value
df$pet == 'dog',]$bird<-NA
df[df
#replace
is.na(df)]<-FALSE
df[ df
pet bird friendly
1 dog FALSE TRUE
2 parrot TRUE TRUE
3 chicken TRUE TRUE
4 duck TRUE FALSE
We can do as follow to remove any columns with all
values == NA
.
#create a bad column
<-original_df
df$bad<-NA
df df
pet bird friendly bad
1 dog FALSE TRUE NA
2 parrot TRUE TRUE NA
3 chicken TRUE TRUE NA
4 duck TRUE FALSE NA
#check for missing
<-colSums(is.na(df)) == nrow(df)
all_missing_columns#remove any columns meeting missing criteria
!all_missing_columns] df[,
pet bird friendly
1 dog FALSE TRUE
2 parrot TRUE TRUE
3 chicken TRUE TRUE
4 duck TRUE FALSE
In the example above we created a new column named ‘bad’, assigned all its values to NA
(missing), counted the number of missing values in each column, evaluated if the number of missing is equal to the number of rows and then removed any columns meeting these criteria from the data.frame
.
Exercise
: How would you remove columns with greater than some%
missing values?
After data.frames
, lists
are the most commonly used data types in R
. A list can be used to store different types
of datum
and of unequal lengths. We will learn more about lists
later. For now lets compare lists to data.frames
.
Lets first create a list and then convert a data.frame
to a list.
<- list(pet=pet,bird = bird,friendly = friendly)) # name = value (df_list
$pet
[1] "dog" "parrot" "chicken" "duck"
$bird
[1] FALSE TRUE TRUE TRUE
$friendly
[1] TRUE TRUE TRUE FALSE
<-as.list(original_df) # we can also convert a data.frame to a list
df_list2
message('The two lists are identical: ',identical(df_list,df_list2)) # we can check an assertion and print
Similar to data.frames
we can extract list elements based on their name.
$pet df_list
[1] "dog" "parrot" "chicken" "duck"
We can also get items based on their numeric index (order in list).
1] # this returns the list element name and values df_list[
$pet
[1] "dog" "parrot" "chicken" "duck"
1]] # this returns only the values df_list[[
[1] "dog" "parrot" "chicken" "duck"
We can unpack all the elements in the list and return a vector
.
unlist(df_list) #notice mixed types are converted to strings (i.e. quoted text)
pet1 pet2 pet3 pet4 bird1 bird2 bird3 bird4
"dog" "parrot" "chicken" "duck" "FALSE" "TRUE" "TRUE" "TRUE"
friendly1 friendly2 friendly3 friendly4
"TRUE" "TRUE" "TRUE" "FALSE"
Matrices (2
dimensional tables) and arrays (n
dimensional table) are often used for specialized mathematical calculations. Matrices can be useful for organizing vectors into different dimensions.
Lets represent a vector as table with custom number of rows and columns.
<-unlist(df_list)
tmp<-matrix(tmp,ncol=3)) # note this fills by columns (mat
[,1] [,2] [,3]
[1,] "dog" "FALSE" "TRUE"
[2,] "parrot" "TRUE" "TRUE"
[3,] "chicken" "TRUE" "TRUE"
[4,] "duck" "TRUE" "FALSE"
# matrix(tmp,ncol=3,byrow = TRUE) #fill by row
Matrices are useful for many other purposes. Unlike data.frame
s they do not store mixed types (i.e. when numeric and other types are mixed all values are coerced
to strings
).
We can convert a matrix to a data.frame
.
<-as.data.frame(mat)
df2dimnames(df2)<-dimnames(original_df)# set dimension names
df2
pet bird friendly
1 dog FALSE TRUE
2 parrot TRUE TRUE
3 chicken TRUE TRUE
4 duck TRUE FALSE
str(df2) # notice our original types may have not been preserved
'data.frame': 4 obs. of 3 variables:
$ pet : chr "dog" "parrot" "chicken" "duck"
$ bird : chr "FALSE" "TRUE" "TRUE" "TRUE"
$ friendly: chr "TRUE" "TRUE" "TRUE" "FALSE"
Lastly, we can compare R
objects
. This is very useful for debugging why some examples work and others don’t.
identical(original_df,df2) # more advanced methods can show what is different
[1] FALSE
all.equal(original_df,df2)
[1] "Attributes: < Component \"row.names\": Modes: numeric, character >"
[2] "Attributes: < Component \"row.names\": target is numeric, current is character >"
[3] "Component \"bird\": Modes: logical, character"
[4] "Component \"bird\": target is logical, current is character"
[5] "Component \"friendly\": Modes: logical, character"
[6] "Component \"friendly\": target is logical, current is character"