<- list(c(1,2,4),"Target/Wallmart",c("sugar", "milk","chocolate"),c("Shirt","pants"),c(2,4,4,20,30)) mylist
Lists
A list is a generic Vector. Think of it as a vector that can hold elements of different types.
Let me take you back abit, some objects are very small, eg 4 is an object. Thus if we define f <- 4
object f will have only one element in it which is 4. On the other hand some objects are big or can contain very many elements in them eg atomic vectors, (which we will simply refer to as vectors) can contain 10000+ elements. But one thing to note is that all the elements that form this object are of the same class/type. They are all of the same length.
What if I have elemets that are of different classes yet they all should be in one object? How is even that possible?
Its weekend and you decide to go out shopping, You sit down and make a SHOPPING LIST. This shopping list contain things that are different yet they all form the SHOPPING LIST. so we have an object which is the shopping list which contains various elements of different classes, the prices being numeric, the names of things to buy being characters etc. How can we implement this in a computer?
Thus something called LIST was born. A list is exactly what its name suggests. it can contain many things which are different and of have different lengths. We define a list using the function list()
.
Well if someone was to look at this list he might not understand what is happening. Let me give it names:
names(mylist) <- c("quantity", "store", "foodstuff", "clothing","prices")
mylist
$quantity
[1] 1 2 4
$store
[1] "Target/Wallmart"
$foodstuff
[1] "sugar" "milk" "chocolate"
$clothing
[1] "Shirt" "pants"
$prices
[1] 2 4 4 20 30
Element Extraction from list
Look at this list again. What do you see? The quantity and price are numeric while foodstuffs and clothing are character. How can we tell? We can extract the elements and check the class. Note that to extract elements, we use [[
operator. If we use [
we are just calling part of the list rather than the elements themselves.
Using position to call the first element
1]] mylist[[
[1] 1 2 4
Using names to call the first element
'quantity']] mylist[[
[1] 1 2 4
Compare the above with:
1] mylist[
$quantity
[1] 1 2 4
This is just part of the list. Which is a list by itself.
Another cool operator that we can use for lists is the $
operator. This can be used to access the elements by name. eg
$quantity mylist
[1] 1 2 4
$foodstuff mylist
[1] "sugar" "milk" "chocolate"
A list can contain another list:
<- list(1,2,list(3,4, c(5,6)),7)
b b
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] 4
[[3]][[3]]
[1] 5 6
[[4]]
[1] 7
Changing/replacing values
1] <- 5
b[ b
[[1]]
[1] 5
[[2]]
[1] 2
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] 4
[[3]][[3]]
[1] 5 6
[[4]]
[1] 7
1] <- 1:5 b[
Warning in b[1] <- 1:5: number of items to replace is not a multiple of
replacement length
b
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] 4
[[3]][[3]]
[1] 5 6
[[4]]
[1] 7
1]] <- 1:5
b[[ b
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 2
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] 4
[[3]][[3]]
[1] 5 6
[[4]]
[1] 7
Unlisting a list
Once we have a list, we can unlist
it to make it a vector. When unlisting, note that the class hierarchy will be followed. ie, everything will be promoted/coerced to the highest class. ie if we have a character in the list, everything that is not a character will become a character after unlisting.
unlist(b)
[1] 1 2 3 4 5 2 3 4 5 6 7
Once unlisted, it is difficult to revert back to the list we had before. There is a way to do it, by using a special class called relist
.
Traversing a list
Traversing a list requires iterative methods. There are few if any functions that would work on a list as a whole. That is because a can contain elements of different types which need to be manipulated differently.
Lest assume we have a list that contains height, weight, age.
<- list(Height = c(173,178,190,141),
data1 Weight = c(204,189, 213,153),
Age = c(21, 26, 18, 18))
data1
$Height
[1] 173 178 190 141
$Weight
[1] 204 189 213 153
$Age
[1] 21 26 18 18
Is it practical to find the mean of everything? Not really. So we would have to find the mean of each variable:
mean(data1$Height)
[1] 170.5
mean(data1$Weight)
[1] 189.75
mean(data1$Age)
[1] 20.75
Doesn’t this feel repetitive? Programming should not be repetitive. If you are repeating a bunch of codes then you are missing the whole objective of programming. So how can we go about finding the mean?
We can use iterative methods:
Write a for loop that finds the mean of each variable in data1
MapReduce
This is a whole concept by its own. But here we will simply learn on how to traverse lists without the need of explicitly calling the iterative methods. Under the hood, the functions are doing iterations, but we do not need to know how they implement the process rather we are interested in using the functions to solve the problems at hand. There are common higher order function used in functional programming languages. These include Reduce
, Filter
, Find
, Map
, Negate
, Position
. We will look at most of them in this section while skipping some. To build up to this functions, we will start by the *apply
family of functions. ie lapply
, sapply
, mapply
, vapply
, rapply
, tapply
, eapply
lapply
When you want to apply a function to each element of a list in turn and get a list back
lapply(data1, mean)
$Height
[1] 170.5
$Weight
[1] 189.75
$Age
[1] 20.75
sapply
When you want to apply a function to each element of a list in turn, but you want a simplified vector back, rather than a list. Here, R will try to simplify your result into an array. If it cannot, it will return a list back
sapply(data1, mean)
Height Weight Age
170.50 189.75 20.75
Note that in this case, the sapply
is just a lapply
with the function simplify2array
called on the final result of lapply
.
Assume I want to add 10 to each element in the list data1
what can we do?
+ 10 data1
Error in data1 + 10: non-numeric argument to binary operator
We would have to traverse the list using any of the learned methods/ or even iterative methods:
lapply(data1, `+`, 10)
$Height
[1] 183 188 200 151
$Weight
[1] 214 199 223 163
$Age
[1] 31 36 28 28
Well we could also make use of the anonymous functions:
lapply(data1, function(x)x + 10)
$Height
[1] 183 188 200 151
$Weight
[1] 214 199 223 163
$Age
[1] 31 36 28 28
lapply(data1, \(x) x + 10)
$Height
[1] 183 188 200 151
$Weight
[1] 214 199 223 163
$Age
[1] 31 36 28 28
Map
This is a multivariate version of lapply
. Think of it as lapply
but it traverses more than one vector at a time.
eg Suppose you want to create a sequence from 2:5
then another from 1:4
and another from 10:13
. Note that we can divide this problem such that we have two vectors, one containing the start points ie c(2,1,10)
and the other the end points c(5, 4, 13)
then we can iterate across the two:
<- c(2, 1, 10)
start <- c(5, 4, 13)
end Map(seq, start, end)
[[1]]
[1] 2 3 4 5
[[2]]
[1] 1 2 3 4
[[3]]
[1] 10 11 12 13
Note that with Map
the first argument is the function, then the lists/vectors we want to iterate over.
Note that if we were to use a for-loop we could do;
<- list()
result for(i in seq_along(start)){
<- seq(start[i], end[i])
result[[i]]
} result
[[1]]
[1] 2 3 4 5
[[2]]
[1] 1 2 3 4
[[3]]
[1] 10 11 12 13
mapply
This is the multivariate version of sapply
. It will try to simplify the result to array and if it cannot, then leave the result as a list. Note that it is exactly similar to Map
with the notion of trying to simplify the results. Actually, Map
is defined as mapply(…., SIMPLIFY = FALSE)
Assume you had the vector c("cat", "fish", "hamster")
and you want to convert it into the following output:
[[1]]
[1] "cat" "cat" "cat"
[[2]]
[1] "fish" "fish"
[[3]]
[1] "hamster"
How would you go about it? Try using for loop. Then using Map
. In both cases you will need to use the rep
function. Check different solutions
Exercises
1 . Try this one out
split
This function splits a vector/list into various parts. It takes in the vector/list as the first argument that needs to be split and another vector/list which is the grouping vector. The data in the first argument is then grouped according to the second argument and a list containing the grouped data is returned. Below is an example
Suppose you have the grades of two students as follows:
<- c(90, 87, 78, 95, 89)
grades <- c(1,2,1,1,2) students
You can split the grades to a list whereby the first element is the grades of the first student, the second element is the grades of the second student etc.
split(grades, students)
$`1`
[1] 90 78 95
$`2`
[1] 87 89
With this, we can be able to easily obtain the mean/max/min/sd etc of each student.
sapply(split(grades, students), mean)
1 2
87.66667 88.00000
unsplit
This function reverses what split
does. (Rarely used) You can learn more by running help("unsplit")
tapply
This the combination of sapply
and split
. Although not implemented as such, it works the same way as grouping the data and then for each group, perform an action.
tapply(grades, students, mean)
1 2
87.66667 88.00000
So far we are using simple example to grasp the concept of what the functions are doing.
Filter
This is a function used to keep items that satisfy a condition while discarding items that do not satisfy the given condition. The condition need to be in a function format that returns either TRUE or FALSE. This type of function is called a predicate function. usage Filter(predicate, list/vector)
Take note of the capitalized F
in Filter
, as you will meet other functions with similar name.
Suppose in my data1 above, I need to keep variables with all the values greater than 150
data1
$Height
[1] 173 178 190 141
$Weight
[1] 204 189 213 153
$Age
[1] 21 26 18 18
Note that we only have to keep Weight. We could do:
Filter(\(x)all(x>150), data1)
$Weight
[1] 204 189 213 153
What happened? We have a predicate which is an anonymous function. and used that to filter through our list. Below is the procedure
<- function(x)all(x>150)
my_predicate my_predicate(data1$Height)
[1] FALSE
my_predicate(data1$Weight)
[1] TRUE
my_predicate(data1$Age)
[1] FALSE
We see that only Weight did return TRUE
. The rest returned FALSE
.
Note that we can also use sapply
to filter:
sapply(data1, my_predicate)]# data1[sapply(data1, \(z)all(z>100))] data1[
$Weight
[1] 204 189 213 153
Quiz: Why did we use single brackets above instead of double brackets for extraction?
Reduce
Reduce
uses a binary function to successively combine the elements of a given vector and a possibly given initial value. Starting from an initial position, reduce iteratively combines the elements of a vector/list using the given function. eg assume we want to add the first 5 numbers, ie 1 + 2 + 3 + 4 + 5
. We can do so by first adding 2 to 1,the adding 3 to the result of 2+1, then adding 4 to the result obtained previusly etc. ie (((1 + 2) + 3) + 4) + 5
Note that the function/operation MUST be a binary operator/function. ie a function that only takes two elements a,b
and returns an aggregated value/container.
<- 1:5
vec Reduce(`+`, vec)
[1] 15
Reduce(\(a, b)a + b, vec)
[1] 15
This is just sum(vec)
Note that reduce contains other parameters. eg accumulate = TRUE
ensures that the result at each stage is returned also.
Reduce(`+`, vec, accumulate = TRUE)
[1] 1 3 6 10 15
Reduce(function(x,y)x+y, vec, accumulate = TRUE)
[1] 1 3 6 10 15
This is similar to cumsum(1:5)
We can also pass in an initial value from where we want the function to start at:
Reduce('+', vec, init = 10, accumulate = TRUE)
[1] 10 11 13 16 20 25
similar to cumsum(c(10, vec))
Well assume we wanted to compute something like log(exp(acos(cos(0.5))))
We could simply do:
log(exp(acos(sin(cos(0.5)))))
[1] 0.6932138
Reduce(\(x, f)f(x), list(cos,sin, acos, exp, log), init = 0.5)
[1] 0.6932138
Reduce(\(x, f)f(x), list(0.5, cos,sin, acos, exp, log))
[1] 0.6932138
Simply? Is anything above simple? Could you explain what happened on the second and third lines?
What if we wanted to have the list with functions listed as they appear? ie log, exp, acos, cos
?
Reduce(\(f, x)f(x), list(log, exp, acos, sin, cos, 0.5), right = TRUE)
[1] 0.6932138
Well not bad. But what exactly happened? Lets try something else. We know that while some functions are commutative, others are not. For example in division, \(a/b\neq b/a\). If we wanted to do 1/2/3/4
1/2/3/4
[1] 0.04166667
1/2)/3)/4) (((
[1] 0.04166667
Reduce("/",1:4)
[1] 0.04166667
But what if we wanted to do 1/(2/(3/4))
ie we first do the rightmost division, then we move to the left? Thats when we pass a logical argument TRUE
to the parameter right
within the reduce function:
1/(2/(3/4))
[1] 0.375
Reduce(`/`, 1:4, right = TRUE)
[1] 0.375
What were the intermediate values?
Reduce(`/`, 1:4, right = TRUE, accumulate = TRUE)
[1] 0.375000 2.666667 0.750000 4.000000
The rightmost value is 4. Then 3/4 = 0.75
then 2/0.75=2.6667
. Lastly 1/2.6667 = 0.375
Note that we are using this simple examples to depict what Reduce
does. So far, Reduce
is the only function we have seen whereby the result of the current function evaluation depends on the previous results. Thus Reduce
is practically an iterative function.
With this in mind, we can use Reduce
to run a given operation/function n-number of times, where n is the length of the passed in vector/list.
We therefore can use it to quickly run some iterations: eg
log(log(log(log(1e10))))
[1] 0.1337832
Reduce(\(x,y)log(x), 1:4, init = 1e10)
[1] 0.1337832
Reduce(\(x,y)y(x), rep(c(log), 4), init = 1e10)
[1] 0.1337832
Reduce(\(x,y)log(x), rep(1e10, 5))
[1] 0.1337832
Reduce(\(x,y)log(x), c(1e10, 1:4))
[1] 0.1337832
Reduce(\(x,y)log(y), 1:4, init = 1e10, right = TRUE)
[1] 0.1337832
Reduce(\(x, y)x(y), c(rep(list(log), 4), 1e10), right = TRUE)
[1] 0.1337832
Explain how the above give similar results.
Although Reduce
has the forward and backward directions ie forward is right = FALSE
and backward is right = TRUE
, we will mostly stick to forward direction.
We can also use Reduce
to perform quick iterative methods. Recall the sqrt_heron
function that you wrote back in exercise 3 question 8, we can easily implement it using Reduce:
<- 125348
S <- 15 # number of iterations
n Reduce(function(x,i)(x+S/x)/2, 1:n, 1)
[1] 354.0452
Since i
is not used within the function, 1:n
is not compulsory. As long as we have a vector of length 15, the function will run 15 times: eg to calculate \(\sqrt{10}\) using 5 iterations we can do:
Reduce(function(x, i) (x + 10/x)/2, rep(0, 5), 1)
[1] 3.162278
Reduce(function(x, i) (x + 10/x)/2, c("a", "b", "c", "d", "e"), 1)
[1] 3.162278
Note that I even used strings. That is because the lambda function is just looping over the length of the vector and not over the elements of the vector.
We can then write a function that takes in S and number of iterations:
<- function(x, n = 20,init = 1){
sqrt_heron2 Reduce(\(y,i)(y+x/y)/2, seq_len(n), init)
}sqrt_heron2(10)
[1] 3.162278
To learn more about continued fractions, pi estimation etc you can run help("Reduce")
then copy the examples and paste them on your editor, play around checking what the results is.
DataFrames.
This are lists whose elements have the same length. ie each variable within the list have the same number of elements, and thus can be represented in a 2D array/matrix notation with reach row being one observation.
Consider the list data1
above. We notice that we have three variables ie Height, weight and Age. We could arrange the data in order to appear like a matrix.
<- data.frame(data1)
dat2 dat2
Height Weight Age
1 173 204 21
2 178 189 26
3 190 213 18
4 141 153 18
The above looks just like a matrix. Why do we need a dataframe instead of a matrix? Well Because a dataframe can hold different datatypes for each variable. Suppose we had another variable that had names. This column/variable will be of class character and not numeric. The dataframe can ensure that while the characters are indeed characters, the numbers/numerics remain to be numeric.
eg:
$name <- c("Ally", "Billy", 'Cicy', "Lily")
dat2'YOB']] <- c(2001, 2001, 2002, 2005)
dat2[[ dat2
Height Weight Age name YOB
1 173 204 21 Ally 2001
2 178 189 26 Billy 2001
3 190 213 18 Cicy 2002
4 141 153 18 Lily 2005
We can loop through the data using any of the functions above to check the class for each variable:
sapply(dat2, class) #apply(dat2, 2, class)
Height Weight Age name YOB
"numeric" "numeric" "numeric" "character" "numeric"
Note that Height, Weight, Age and Year of Birth are all numeric while name on the other hand is character.
This can not be the case with a matrix since a matrix can only contain a data of only one type/class. If you try to mix numbers and character, they will all be converted to character class.
<- as.matrix(dat2)
dat3 dat3
Height Weight Age name YOB
[1,] "173" "204" "21" "Ally" "2001"
[2,] "178" "189" "26" "Billy" "2001"
[3,] "190" "213" "18" "Cicy" "2002"
[4,] "141" "153" "18" "Lily" "2005"
You do not need to even check the class for each column. You can see the quotation meaning everything is a string/character.
We can use the class
function to determine the class ti which an object in R belongs to:
class(dat3)
[1] "matrix" "array"
Dataframe Extraction
You can extract elements from a dataframe exactly the same way as you extract them from a list or even from a matrix.
Extract the first element of the first variable
1,1] dat2[
[1] 173
1]][1] dat2[[
[1] 173
1, "Height"] dat2[
[1] 173
"Height"]][1] dat2[[
[1] 173
$Height[1] dat2
[1] 173
Why can’t you do dat2[1][1]
or even dat2['Height'][1]
?
Extract a variable:
$Height dat2
[1] 173 178 190 141
'Height']] dat2[[
[1] 173 178 190 141
Extract the 2nd observation:
2, ] dat2[
Height Weight Age name YOB
2 178 189 26 Billy 2001
In our case, the data does not have unique row names that identify each row. If it did, we could have been able to use that to extract the specific row we are interested in.
Note that while in matrices, the columns do not necessarily need to have column names, this is not the case for dataframes. Each column MUST be named. Usually The names need to be unique, although base R does allow duplicated names. But one iss advised to have unique names for each column to ensure each column is uniquely identifiable.
Quiz: Why is it not necessary for the columns of a matrix to be named? Because the columns might be representing the same thing with the difference being in position. Eg an image which is made up of pixels, each pixel containing the values from 0-255 can be represented as a matrix, whereby the row and column depict the position that the pixel occupies in the image. Note that everything is a pixel, just the position is different. On the other hand for a dataframe, all the columns represent different items/objects/variables/measurements.
With regards to dataframes, we have various functions in Base R that we will probably learn while learning the equivalence from the tidyverse package. These are by
, aggregate
, sweep
etc.