Lists

A list is a generic Vector. Think of it as a vector that can hold elements of different types.

Let me take you back abit, some objects are very small, eg 4 is an object. Thus if we define f <- 4 object f will have only one element in it which is 4. On the other hand some objects are big or can contain very many elements in them eg atomic vectors, (which we will simply refer to as vectors) can contain 10000+ elements. But one thing to note is that all the elements that form this object are of the same class/type. They are all of the same length.

What if I have elemets that are of different classes yet they all should be in one object? How is even that possible?

Its weekend and you decide to go out shopping, You sit down and make a SHOPPING LIST. This shopping list contain things that are different yet they all form the SHOPPING LIST. so we have an object which is the shopping list which contains various elements of different classes, the prices being numeric, the names of things to buy being characters etc. How can we implement this in a computer?

Thus something called LIST was born. A list is exactly what its name suggests. it can contain many things which are different and of have different lengths. We define a list using the function list().

mylist <- list(c(1,2,4),"Target/Wallmart",c("sugar", "milk","chocolate"),c("Shirt","pants"),c(2,4,4,20,30))

Well if someone was to look at this list he might not understand what is happening. Let me give it names:

names(mylist) <- c("quantity", "store", "foodstuff", "clothing","prices")
mylist

$quantity
[1] 1 2 4

$store
[1] "Target/Wallmart"

$foodstuff
[1] "sugar"     "milk"      "chocolate"

$clothing
[1] "Shirt" "pants"

$prices
[1]  2  4  4 20 30

Element Extraction from list

Look at this list again. What do you see? The quantity and price are numeric while foodstuffs and clothing are character. How can we tell? We can extract the elements and check the class. Note that to extract elements, we use [[ operator. If we use [ we are just calling part of the list rather than the elements themselves.

Using position to call the first element

mylist[[1]]

[1] 1 2 4

Using names to call the first element

mylist[['quantity']]

[1] 1 2 4

Compare the above with:

mylist[1]

$quantity
[1] 1 2 4

This is just part of the list. Which is a list by itself.

Another cool operator that we can use for lists is the $ operator. This can be used to access the elements by name. eg

mylist$quantity

[1] 1 2 4

mylist$foodstuff

[1] "sugar"     "milk"      "chocolate"

A list can contain another list:

b <- list(1,2,list(3,4, c(5,6)),7)
b

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] 4

[[3]][[3]]
[1] 5 6


[[4]]
[1] 7

Changing/replacing values

b[1] <- 5
b

[[1]]
[1] 5

[[2]]
[1] 2

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] 4

[[3]][[3]]
[1] 5 6


[[4]]
[1] 7

b[1] <- 1:5

Warning in b[1] <- 1:5: number of items to replace is not a multiple of
replacement length

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] 4

[[3]][[3]]
[1] 5 6


[[4]]
[1] 7

b[[1]] <- 1:5
b

[[1]]
[1] 1 2 3 4 5

[[2]]
[1] 2

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] 4

[[3]][[3]]
[1] 5 6


[[4]]
[1] 7

Unlisting a list

Once we have a list, we can unlist it to make it a vector. When unlisting, note that the class hierarchy will be followed. ie, everything will be promoted/coerced to the highest class. ie if we have a character in the list, everything that is not a character will become a character after unlisting.

unlist(b)

 [1] 1 2 3 4 5 2 3 4 5 6 7

Once unlisted, it is difficult to revert back to the list we had before. There is a way to do it, by using a special class called relist.

Traversing a list

Traversing a list requires iterative methods. There are few if any functions that would work on a list as a whole. That is because a can contain elements of different types which need to be manipulated differently.

Lest assume we have a list that contains height, weight, age.

data1 <- list(Height = c(173,178,190,141),
              Weight = c(204,189, 213,153),
              Age = c(21, 26, 18, 18))
data1

$Height
[1] 173 178 190 141

$Weight
[1] 204 189 213 153

$Age
[1] 21 26 18 18

Is it practical to find the mean of everything? Not really. So we would have to find the mean of each variable:

mean(data1$Height)

[1] 170.5

mean(data1$Weight)

[1] 189.75

mean(data1$Age)

[1] 20.75

Doesn’t this feel repetitive? Programming should not be repetitive. If you are repeating a bunch of codes then you are missing the whole objective of programming. So how can we go about finding the mean?

We can use iterative methods:

Write a for loop that finds the mean of each variable in data1

MapReduce

This is a whole concept by its own. But here we will simply learn on how to traverse lists without the need of explicitly calling the iterative methods. Under the hood, the functions are doing iterations, but we do not need to know how they implement the process rather we are interested in using the functions to solve the problems at hand. There are common higher order function used in functional programming languages. These include Reduce, Filter, Find, Map, Negate, Position. We will look at most of them in this section while skipping some. To build up to this functions, we will start by the *apply family of functions. ie lapply, sapply, mapply, vapply, rapply, tapply, eapply

lapply

When you want to apply a function to each element of a list in turn and get a list back

lapply(data1, mean)

$Height
[1] 170.5

$Weight
[1] 189.75

$Age
[1] 20.75

sapply

When you want to apply a function to each element of a list in turn, but you want a simplified vector back, rather than a list. Here, R will try to simplify your result into an array. If it cannot, it will return a list back

sapply(data1, mean)

Height Weight    Age 
170.50 189.75  20.75

Note that in this case, the sapply is just a lapply with the function simplify2array called on the final result of lapply.

Assume I want to add 10 to each element in the list data1 what can we do?

data1 + 10

Error in data1 + 10: non-numeric argument to binary operator

We would have to traverse the list using any of the learned methods/ or even iterative methods:

lapply(data1, `+`, 10)

$Height
[1] 183 188 200 151

$Weight
[1] 214 199 223 163

$Age
[1] 31 36 28 28

Well we could also make use of the anonymous functions:

lapply(data1, function(x)x + 10)

$Height
[1] 183 188 200 151

$Weight
[1] 214 199 223 163

$Age
[1] 31 36 28 28

lapply(data1, \(x) x + 10)

$Height
[1] 183 188 200 151

$Weight
[1] 214 199 223 163

$Age
[1] 31 36 28 28

Map

This is a multivariate version of lapply. Think of it as lapply but it traverses more than one vector at a time.

eg Suppose you want to create a sequence from 2:5 then another from 1:4 and another from 10:13. Note that we can divide this problem such that we have two vectors, one containing the start points ie c(2,1,10) and the other the end points c(5, 4, 13) then we can iterate across the two:

start <- c(2, 1, 10)
end <- c(5, 4, 13)
Map(seq, start, end)

[[1]]
[1] 2 3 4 5

[[2]]
[1] 1 2 3 4

[[3]]
[1] 10 11 12 13

Note that with Map the first argument is the function, then the lists/vectors we want to iterate over.

Note that if we were to use a for-loop we could do;

result <- list()
for(i in seq_along(start)){
  result[[i]] <- seq(start[i], end[i])
}
result

[[1]]
[1] 2 3 4 5

[[2]]
[1] 1 2 3 4

[[3]]
[1] 10 11 12 13

mapply

This is the multivariate version of sapply. It will try to simplify the result to array and if it cannot, then leave the result as a list. Note that it is exactly similar to Map with the notion of trying to simplify the results. Actually, Map is defined as mapply(…., SIMPLIFY = FALSE)

Assume you had the vector c("cat", "fish", "hamster") and you want to convert it into the following output:

[[1]]
[1] "cat" "cat" "cat"

[[2]]
[1] "fish" "fish"

[[3]]
[1] "hamster"

How would you go about it? Try using for loop. Then using Map. In both cases you will need to use the rep function. Check different solutions

Exercises

1 . Try this one out

split

This function splits a vector/list into various parts. It takes in the vector/list as the first argument that needs to be split and another vector/list which is the grouping vector. The data in the first argument is then grouped according to the second argument and a list containing the grouped data is returned. Below is an example

Suppose you have the grades of two students as follows:

grades <- c(90, 87, 78, 95, 89)
students <- c(1,2,1,1,2)

You can split the grades to a list whereby the first element is the grades of the first student, the second element is the grades of the second student etc.

split(grades, students)

$`1`
[1] 90 78 95

$`2`
[1] 87 89

With this, we can be able to easily obtain the mean/max/min/sd etc of each student.

sapply(split(grades, students), mean)

       1        2 
87.66667 88.00000

unsplit

This function reverses what split does. (Rarely used) You can learn more by running help("unsplit")

tapply

This the combination of sapply and split. Although not implemented as such, it works the same way as grouping the data and then for each group, perform an action.

tapply(grades, students, mean)

       1        2 
87.66667 88.00000

So far we are using simple example to grasp the concept of what the functions are doing.

Filter

This is a function used to keep items that satisfy a condition while discarding items that do not satisfy the given condition. The condition need to be in a function format that returns either TRUE or FALSE. This type of function is called a predicate function. usage Filter(predicate, list/vector) Take note of the capitalized F in Filter, as you will meet other functions with similar name.

Suppose in my data1 above, I need to keep variables with all the values greater than 150

data1

$Height
[1] 173 178 190 141

$Weight
[1] 204 189 213 153

$Age
[1] 21 26 18 18

Note that we only have to keep Weight. We could do:

Filter(\(x)all(x>150), data1)

$Weight
[1] 204 189 213 153

What happened? We have a predicate which is an anonymous function. and used that to filter through our list. Below is the procedure

my_predicate <- function(x)all(x>150)
my_predicate(data1$Height)

[1] FALSE

my_predicate(data1$Weight)

[1] TRUE

my_predicate(data1$Age)

[1] FALSE

We see that only Weight did return TRUE. The rest returned FALSE.

Note that we can also use sapply to filter:

data1[sapply(data1, my_predicate)]# data1[sapply(data1, \(z)all(z>100))]

$Weight
[1] 204 189 213 153

Quiz: Why did we use single brackets above instead of double brackets for extraction?

Reduce

Reduce uses a binary function to successively combine the elements of a given vector and a possibly given initial value. Starting from an initial position, reduce iteratively combines the elements of a vector/list using the given function. eg assume we want to add the first 5 numbers, ie 1 + 2 + 3 + 4 + 5. We can do so by first adding 2 to 1,the adding 3 to the result of 2+1, then adding 4 to the result obtained previusly etc. ie (((1 + 2) + 3) + 4) + 5

Note that the function/operation MUST be a binary operator/function. ie a function that only takes two elements a,b and returns an aggregated value/container.

vec <- 1:5
Reduce(`+`, vec)

[1] 15

Reduce(\(a, b)a + b, vec)

[1] 15

This is just sum(vec)

Note that reduce contains other parameters. eg accumulate = TRUE ensures that the result at each stage is returned also.

Reduce(`+`, vec, accumulate = TRUE)

[1]  1  3  6 10 15

Reduce(function(x,y)x+y, vec, accumulate = TRUE)

[1]  1  3  6 10 15

This is similar to cumsum(1:5)

We can also pass in an initial value from where we want the function to start at:

Reduce('+', vec, init = 10, accumulate = TRUE)

[1] 10 11 13 16 20 25

similar to cumsum(c(10, vec))

Well assume we wanted to compute something like log(exp(acos(cos(0.5))))

We could simply do:

log(exp(acos(sin(cos(0.5)))))

[1] 0.6932138

Reduce(\(x, f)f(x), list(cos,sin, acos, exp, log), init = 0.5)

[1] 0.6932138

Reduce(\(x, f)f(x), list(0.5, cos,sin, acos, exp, log))

[1] 0.6932138

Simply? Is anything above simple? Could you explain what happened on the second and third lines?

What if we wanted to have the list with functions listed as they appear? ie log, exp, acos, cos?

Reduce(\(f, x)f(x), list(log, exp, acos, sin, cos, 0.5), right = TRUE)

[1] 0.6932138

Well not bad. But what exactly happened? Lets try something else. We know that while some functions are commutative, others are not. For example in division, $a/b\neq b/a$. If we wanted to do 1/2/3/4

1/2/3/4

[1] 0.04166667

(((1/2)/3)/4)

[1] 0.04166667

Reduce("/",1:4)

[1] 0.04166667

But what if we wanted to do 1/(2/(3/4)) ie we first do the rightmost division, then we move to the left? Thats when we pass a logical argument TRUE to the parameter right within the reduce function:

1/(2/(3/4))

[1] 0.375

Reduce(`/`, 1:4, right = TRUE)

[1] 0.375

What were the intermediate values?

Reduce(`/`, 1:4, right = TRUE, accumulate = TRUE)

[1] 0.375000 2.666667 0.750000 4.000000

The rightmost value is 4. Then 3/4 = 0.75 then 2/0.75=2.6667. Lastly 1/2.6667 = 0.375

Note that we are using this simple examples to depict what Reduce does. So far, Reduce is the only function we have seen whereby the result of the current function evaluation depends on the previous results. Thus Reduce is practically an iterative function.

With this in mind, we can use Reduce to run a given operation/function n-number of times, where n is the length of the passed in vector/list.

We therefore can use it to quickly run some iterations: eg

log(log(log(log(1e10))))

[1] 0.1337832

Reduce(\(x,y)log(x), 1:4, init = 1e10)

[1] 0.1337832

Reduce(\(x,y)y(x), rep(c(log), 4), init = 1e10)

[1] 0.1337832

Reduce(\(x,y)log(x), rep(1e10, 5))

[1] 0.1337832

Reduce(\(x,y)log(x), c(1e10, 1:4))

[1] 0.1337832

Reduce(\(x,y)log(y), 1:4, init = 1e10, right = TRUE)

[1] 0.1337832

Reduce(\(x, y)x(y), c(rep(list(log), 4), 1e10), right = TRUE)

[1] 0.1337832

Explain how the above give similar results.

Although Reduce has the forward and backward directions ie forward is right = FALSE and backward is right = TRUE, we will mostly stick to forward direction.

We can also use Reduce to perform quick iterative methods. Recall the sqrt_heron function that you wrote back in exercise 3 question 8, we can easily implement it using Reduce:

S  <- 125348
n <- 15 # number of iterations
Reduce(function(x,i)(x+S/x)/2, 1:n, 1)

[1] 354.0452

Since i is not used within the function, 1:n is not compulsory. As long as we have a vector of length 15, the function will run 15 times: eg to calculate $\sqrt{10}$ using 5 iterations we can do:

Reduce(function(x, i) (x + 10/x)/2, rep(0, 5), 1)

[1] 3.162278

Reduce(function(x, i) (x + 10/x)/2, c("a", "b", "c", "d", "e"), 1)

[1] 3.162278

Note that I even used strings. That is because the lambda function is just looping over the length of the vector and not over the elements of the vector.

We can then write a function that takes in S and number of iterations:

sqrt_heron2 <- function(x, n = 20,init = 1){
  Reduce(\(y,i)(y+x/y)/2, seq_len(n), init)
}
sqrt_heron2(10)

[1] 3.162278

To learn more about continued fractions, pi estimation etc you can run help("Reduce") then copy the examples and paste them on your editor, play around checking what the results is.

DataFrames.

This are lists whose elements have the same length. ie each variable within the list have the same number of elements, and thus can be represented in a 2D array/matrix notation with reach row being one observation.

Consider the list data1 above. We notice that we have three variables ie Height, weight and Age. We could arrange the data in order to appear like a matrix.

dat2 <- data.frame(data1)
dat2

  Height Weight Age
1    173    204  21
2    178    189  26
3    190    213  18
4    141    153  18

The above looks just like a matrix. Why do we need a dataframe instead of a matrix? Well Because a dataframe can hold different datatypes for each variable. Suppose we had another variable that had names. This column/variable will be of class character and not numeric. The dataframe can ensure that while the characters are indeed characters, the numbers/numerics remain to be numeric.

eg:

dat2$name <- c("Ally", "Billy", 'Cicy', "Lily")
dat2[['YOB']] <- c(2001, 2001, 2002, 2005)
dat2

  Height Weight Age  name  YOB
1    173    204  21  Ally 2001
2    178    189  26 Billy 2001
3    190    213  18  Cicy 2002
4    141    153  18  Lily 2005

We can loop through the data using any of the functions above to check the class for each variable:

sapply(dat2, class) #apply(dat2, 2, class)

     Height      Weight         Age        name         YOB 
  "numeric"   "numeric"   "numeric" "character"   "numeric"

Note that Height, Weight, Age and Year of Birth are all numeric while name on the other hand is character.

This can not be the case with a matrix since a matrix can only contain a data of only one type/class. If you try to mix numbers and character, they will all be converted to character class.

dat3 <- as.matrix(dat2)
dat3

     Height Weight Age  name    YOB   
[1,] "173"  "204"  "21" "Ally"  "2001"
[2,] "178"  "189"  "26" "Billy" "2001"
[3,] "190"  "213"  "18" "Cicy"  "2002"
[4,] "141"  "153"  "18" "Lily"  "2005"

You do not need to even check the class for each column. You can see the quotation meaning everything is a string/character.

We can use the class function to determine the class ti which an object in R belongs to:

class(dat3)

[1] "matrix" "array"

Dataframe Extraction

You can extract elements from a dataframe exactly the same way as you extract them from a list or even from a matrix.

Extract the first element of the first variable

dat2[1,1]

[1] 173

dat2[[1]][1]

[1] 173

dat2[1, "Height"]

[1] 173

dat2[["Height"]][1]

[1] 173

dat2$Height[1]

[1] 173

Why can’t you do dat2[1][1] or even dat2['Height'][1]?

Extract a variable:

dat2$Height

[1] 173 178 190 141

dat2[['Height']]

[1] 173 178 190 141

Extract the 2nd observation:

dat2[2, ]

  Height Weight Age  name  YOB
2    178    189  26 Billy 2001

In our case, the data does not have unique row names that identify each row. If it did, we could have been able to use that to extract the specific row we are interested in.

Note that while in matrices, the columns do not necessarily need to have column names, this is not the case for dataframes. Each column MUST be named. Usually The names need to be unique, although base R does allow duplicated names. But one iss advised to have unique names for each column to ensure each column is uniquely identifiable.

Quiz: Why is it not necessary for the columns of a matrix to be named? Because the columns might be representing the same thing with the difference being in position. Eg an image which is made up of pixels, each pixel containing the values from 0-255 can be represented as a matrix, whereby the row and column depict the position that the pixel occupies in the image. Note that everything is a pixel, just the position is different. On the other hand for a dataframe, all the columns represent different items/objects/variables/measurements.

With regards to dataframes, we have various functions in Base R that we will probably learn while learning the equivalence from the tidyverse package. These are by, aggregate, sweep etc.