10 Iteration
In programming, iteration is the act of repeating a set of instructions. This can be done several different ways:
- Repeat until some condition is met.
- Repeat a specified number of times.
- Repeat for each element of a collection.
In R, the last example here, repeating each element of a collection, is by far the most common because vectors (collections of elements) are the most important data structure in R. As such, R has built-in functions that make this type of iteration extremely easy. While R does provide the usual iteration abilities through the use of for
and while
loops, these should not be your go-to methods for performing iteration with R.
After reading this chapter you should be able to:
- Use
lapply
and related functions to iterate over the elements of a vector. - Use
for
andwhile
loops to repeatedly evaluate R expressions. - Avoid common pitfalls when using loops in R.
10.1 Apply Functions
One of the most common operations that you will encounter while programming with R is running a function with each element of some vector as input and then collecting the results in a vector.
There are many functions, built-in and otherwise, to accomplish this task. We will begin by looking at the most important and generic function, lapply
.
10.1.1 lapply
The function in R that performs the operation described above is the lapply
function. The general syntax is:
lapply(X = some_list, FUN = f)
That is, some_list
is a vector (atomic vector or list) that the function f
will be “applied” to each element of. Note that it is customary to not name the arguments to lapply
.
lapply(some_list, f)
Let’s start with a very simple example.
lapply(1:3, log)
#> [[1]]
#> [1] 0
#>
#> [[2]]
#> [1] 0.6931472
#>
#> [[3]]
#> [1] 1.098612
Here we see the log
function applied to each of the elements of the vector 1:3
. This would be the same as running the following:
list(
log(1),
log(2),
log(3)
)
#> [[1]]
#> [1] 0
#>
#> [[2]]
#> [1] 0.6931472
#>
#> [[3]]
#> [1] 1.098612
Clearly, this isn’t a particularly useful example, as we could simply do the following:1
log(1:3)
#> [1] 0.0000000 0.6931472 1.0986123
Although, note that lapply
is returning a list, but the above returns an atomic vector. More on that in a moment.
For now, know that lapply
will return a list that has the same length as the input vector.2
Let’s look at an example of iterating over a list.
set.seed(42)
= list(a = runif(5),
ex_list b = runif(5),
c = runif(5))
ex_list
#> $a
#> [1] 0.9148060 0.9370754 0.2861395 0.8304476 0.6417455
#>
#> $b
#> [1] 0.5190959 0.7365883 0.1346666 0.6569923 0.7050648
#>
#> $c
#> [1] 0.4577418 0.7191123 0.9346722 0.2554288 0.4622928
lapply(ex_list, max)
#> $a
#> [1] 0.9370754
#>
#> $b
#> [1] 0.7365883
#>
#> $c
#> [1] 0.9346722
Again, here the input was a list of length three, so the output is as well. You might wish the output was an atomic vector. Again, more on that soon.
lapply(ex_list, range)
#> $a
#> [1] 0.2861395 0.9370754
#>
#> $b
#> [1] 0.1346666 0.7365883
#>
#> $c
#> [1] 0.2554288 0.9346722
Finally, a slightly more useful example. This returns the same object as the following:
list(
range(ex_list[[1]]),
range(ex_list[[2]]),
range(ex_list[[3]])
)
#> [[1]]
#> [1] 0.2861395 0.9370754
#>
#> [[2]]
#> [1] 0.1346666 0.7365883
#>
#> [[3]]
#> [1] 0.2554288 0.9346722
Hopefully, it is becoming clear that lapply
can be used to write concise, useful, and readable code.
What if we want to use a function with more than one argument? For example:
= function(x, c, p) {
multiply_and_power * x ^ p
c }
multiply_and_power(x = 2, c = 3, p = 0.5)
#> [1] 4.242641
multiply_and_power(x = 2, c = 1:3, p = 0.5)
#> [1] 1.414214 2.828427 4.242641
Be aware that depending on how we specify the values we pass to the arguments, there is likely going to be some length coercion taking place.
To use this function together with lapply
, we simply add the values of the additional parameters as arguments to lapply
.3
lapply(1:3, multiply_and_power, c = 1:5, p = 2)
#> [[1]]
#> [1] 1 2 3 4 5
#>
#> [[2]]
#> [1] 4 8 12 16 20
#>
#> [[3]]
#> [1] 9 18 27 36 45
What did this code do?
list(
multiply_and_power(x = 1, c = 1:5, p = 2),
multiply_and_power(x = 2, c = 1:5, p = 2),
multiply_and_power(x = 3, c = 1:5, p = 2)
)
#> [[1]]
#> [1] 1 2 3 4 5
#>
#> [[2]]
#> [1] 4 8 12 16 20
#>
#> [[3]]
#> [1] 9 18 27 36 45
What if we wanted to iterate over a different argument, say c
instead of x
? Specify x
and p
in the call to lapply
. Now lapply
will iterate over c
.
lapply(1:3, multiply_and_power, x = 1:5, p = 2)
#> [[1]]
#> [1] 1 4 9 16 25
#>
#> [[2]]
#> [1] 2 8 18 32 50
#>
#> [[3]]
#> [1] 3 12 27 48 75
So, this time, we did the following:
list(
multiply_and_power(x = 1:5, c = 1, p = 2),
multiply_and_power(x = 1:5, c = 2, p = 2),
multiply_and_power(x = 1:5, c = 3, p = 2)
)
#> [[1]]
#> [1] 1 4 9 16 25
#>
#> [[2]]
#> [1] 2 8 18 32 50
#>
#> [[3]]
#> [1] 3 12 27 48 75
Sure, you could simply use this instead, but imagine needed to iterate over 1:100000
instead of 1:5
.
10.1.2 sapply
Let’s return to the example that found the maximum of each element of a list.
set.seed(42)
= list(a = runif(5),
ex_list b = runif(5),
c = runif(5))
lapply(ex_list, max)
#> $a
#> [1] 0.9370754
#>
#> $b
#> [1] 0.7365883
#>
#> $c
#> [1] 0.9346722
As expected, the result is a list. However, notice that each element of said list is an atomic vector of length one, of the same type. We could actually check that using lapply
.
lapply(lapply(ex_list, max), typeof)
#> $a
#> [1] "double"
#>
#> $b
#> [1] "double"
#>
#> $c
#> [1] "double"
lapply(lapply(ex_list, max), length)
#> $a
#> [1] 1
#>
#> $b
#> [1] 1
#>
#> $c
#> [1] 1
It probably seems like what we really want as output here is an atomic vector that is the same length as the input vector. We can obtain this result by switching from lapply
to sapply
.
sapply(ex_list, max)
#> a b c
#> 0.9370754 0.7365883 0.9346722
The s
in sapply
refers to the simplifying action taken by the function. Much of the details of how the simplification works follow the usual rules of the coercion hierarchy. It is probably best not to worry too much about these rules, but also not rely on simplification too much. Generally, it is best to use sapply
in the case we’ve just seen here: you are certain the result of the function applied to each element is an atomic vector of length one, each with the same type.
Another example:
sapply(1:3, log)
#> [1] 0.0000000 0.6931472 1.0986123
But again, this example isn’t truly necessary, as the following is even better:
log(1:3)
#> [1] 0.0000000 0.6931472 1.0986123
We show this to demonstrate that many operations in R are already vectorized, so there is no need to iterate.
10.1.3 Other Apply Functions
Other apply functions exist. Many are rarely used. One that might be of interest is vapply
which will do simplification like sapply
, but the user will need to specify the expected outcome of each iteration, which will make the simplification more predictable.
vapply(1:3, log, double(1))
#> [1] 0.0000000 0.6931472 1.0986123
vapply(1:3, log, integer(1))
#> Error in vapply(1:3, log, integer(1)): values must be type 'integer',
#> but FUN(X[[1]]) result is type 'double'
Another that you will likely see is the apply
function. We would advise avoiding this unless you truly understand what it does. Also, beware, it should probably not be used with data frames.4
10.2 Loops
Loops are another form of control flow. They allow you to explicitly specify the repetition of some code, in contrast to the apply functions above that did so implicitly.5
Welcome to R Club.
- The first rule of R Club is: Do not use
for
loops!- The second rule of R Club is: Do not use
for
loops!- And the third and final rule: If you have to use a
for
loop, do not grow vectors!— Unknown
Loops are very common in programming, however, in R, it is probably best to avoid them unless you truly need them. The general heuristic you should use to determine if you need a loop or apply function is:
- Use a loop when the result of the next iteration depends on the result of the previous iteration.
- Use an apply function when the results of each iteration are independent.6
10.2.1 for
The most common looping structure is a for
loop. The generic syntax is:
for (element in vector) {
code_to_run }
We’ll refer to element
as the loop variable.
Let’s look at a specific example.
# pre-allocate storage vector
= double(length = 5)
x
# perform loop
for (i in 1:5) {
= i ^ 2
x[[i]]
}
# check results
x
#> [1] 1 4 9 16 25
First, note that for
is not a function, which is why you should consider placing a space between it and the parenthesis that follows. Next, (i in 1:5)
is considered the header of the loop which defines how the iteration will take place. Here the name of the loop variable is i
and it will take a value from the vector 1:5
each time the body runs. The code inside the braces, {}
is called the body of the loop, much like the body of a function.
- Each time through the loop,
i
, will take one of the values from1:5
. Or generally, the loop variable will take the value of each element of some vector. - For each value of
i
, the codex[i] = i ^ 2
will run. In general, for each value of the looping variable, the code in the body will run. And often, that code will depend on the looping variable, like we see here.
So, the above for loop ran each of the following:
1]] = 1 ^ 2
x[[2]] = 2 ^ 2
x[[3]] = 3 ^ 2
x[[4]] = 4 ^ 2
x[[5]] = 5 ^ 2 x[[
This should make it clear that the purpose of a loop is to repeat code, without actually having to repeatedly type the code.
As has become a theme, this for
loop is truly useless in R. We could have simply done:
1:5) ^ 2 (
#> [1] 1 4 9 16 25
Here, i
is functioning much like the name of a function argument, except now, we pass a new value, an element of 1:5
, each time through the loop.
You can use any name you want for the loop variable, but i
, j
, and k
are most common.
for (some_long_var_name in 1:5) {
print(some_long_var_name)
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
A for
loop is a very powerful structure, so it will not be possible for us to illustrate all possible usage examples. Let’s look at a correct loop written poorly, then the same loop written better, and try to draw some conclusion about best practices with for
loops.
Before proceeding, let’s introduce the seq_along
function.
seq_along(5:1)
#> [1] 1 2 3 4 5
Essentially, seq_along
returns the indexes of a vector. Or, you could think of it as returning the result of the following:
1:length(5:1)
#> [1] 1 2 3 4 5
Let’s use a for
loop to create a sequence of numbers. The first two numbers will be 10
, and 5
. Elements after that will be calculated as:
\[ x_i = 3 \cdot \frac{x_{i - 1}}{x_{i - 2}} \]
We’ll use a loop to create a sequence of length ten that follows this specification.
First, a bad example of how to write a loop to accomplish this:
# perform loop
for (i in 1:10) {
if (i == 1) {
= 10
x else if (i == 2) {
} = c(x, 5)
x else {
} = c(x, 3 * x[i - 1] / x[i - 2])
x
}
}
# check results
x
#> [1] 10.0 5.0 1.5 0.9 1.8 6.0 10.0 5.0 1.5 0.9
We see the correct resulting vector, x
, but we have used multiple sub-optimal techniques. In particular, we “grew” the x
vector. The use of x = c(x, some_new_element)
takes what was x
, then creates a new x
but combining the previous x
with some new element. Do not do this. This is one of the reasons people incorrectly think R is slow. This operation is slow, but there is no need for it.
Instead, let’s pre-allocate the x
which we will store our results in.
# pre-allocate x to be a double vector of the correct length
= double(10)
x
# perform loop
for (i in seq_along(x)) {
if (i == 1) {
= 10
x[[i]] else if (i == 2) {
} = 5
x[[i]] else {
} = 3 * x[i - 1] / x[i - 2]
x[[i]]
}
}
# check results
x
#> [1] 10.0 5.0 1.5 0.9 1.8 6.0 10.0 5.0 1.5 0.9
This time, since x
already existed, we are simply replacing individual elements of an already existing vector. This is faster. Any time you grow or add new elements (that is you increase the length of a vector) there is a copy operation taking place under the hood that you could have avoided.
Also, by pre-allocating x
, we can now use seq_along(x)
. In some applications we might be creating x
with a program, and we wouldn’t know its length ahead of time! This will avoid having to specify the length in two locations in code.
Some general ideas to keep in mind:
- Do not attempt to iterate over and store results in the same vector.
- Pre-allocate a “results” vector and update individual elements as you progress through the loop. Do not grow vectors.
- Use
seq_along
and iterate over indexes rather than elements of a vector.
We’ve already discussed why the second item is a problem. Let’s now create an example that demonstrate items one and three.
The following function will check if an number is even.
= function(x) {
is_even %% 2 == 0
x }
We also create a vector y
that stores some numbers.
# create data
set.seed(42)
= sample(1:10, size = 20, replace = TRUE)
y
# view data
y
#> [1] 1 5 1 9 10 4 2 10 1 8 7 4 9 5 4 10 2 3 9 9
Our goal is to create a logical vector, the same length as y
, containing TRUE
at any index where y
is even.
This will, not work:
for (i in y) {
= is_even(i)
y[[i]] }
To better see the issue, temporarily place a print()
statement inside the loop.7
# create data
set.seed(42)
= sample(1:10, size = 20, replace = TRUE)
y
# perform loop
for (i in y) {
print(i)
= is_even(i)
y[[i]] }
#> [1] 1
#> [1] 5
#> [1] 1
#> [1] 9
#> [1] 10
#> [1] 4
#> [1] 2
#> [1] 10
#> [1] 1
#> [1] 8
#> [1] 7
#> [1] 4
#> [1] 9
#> [1] 5
#> [1] 4
#> [1] 10
#> [1] 2
#> [1] 3
#> [1] 9
#> [1] 9
# check results
y
#> [1] 0 1 0 1 0 4 0 1 0 1 7 4 9 5 4 10 2 3 9 9
So i
takes values from y
, but by doing so, we don’t have access to the indexes at which we need to replace with the result of is_even
. Let’s use seq_along
.
# create data
set.seed(42)
= sample(1:10, size = 20, replace = TRUE)
y
# perform loop
for (i in seq_along(y)) {
= is_even(y[i])
y[[i]]
}
# check results
y
#> [1] 0 0 0 0 1 1 1 1 0 1 0 1 0 0 1 1 1 0 0 0
Note that inside of i
, we now need to change i
to y[i]
to get the value rather than the index each time through the loop.
But there’s still an issue! We have 0
and 1
instead of FALSE
and TRUE
. Coercion!
# create data
set.seed(42)
= sample(1:10, size = 20, replace = TRUE)
y
# pre-allocate storage vector
= logical(length(y))
res
# perform loop
for (i in seq_along(y)) {
= is_even(y[[i]])
res[[i]]
}
# check results
res
#> [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
#> [13] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
Much better. But again, remember, many things in R are vectorized:
is_even(y)
#> [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
#> [13] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
This example did not need a loop, because results from one iteration to the next were independent. In the previous example, this was not the case, and was an example of when you truly need a loop.
Note that these examples have used atomic vectors, but, no reason we couldn’t use a list!
10.2.2 while
A while
loop will repeat code until a specified condition is no longer met. The general syntax is:
while (condition) {
code_to_run }
Let’s see an example.
# create some data
= 5
x
# pre-allocate storage vector
= double(length = length(x))
y
# perform loop
while (x > 0) {
print(x)
= x ^ 2
y[[x]] = x - 1
x }
#> [1] 5
#> [1] 4
#> [1] 3
#> [1] 2
#> [1] 1
# check results
x
#> [1] 0
y
#> [1] 1 4 9 16 25
Here, the loop runs until x
is no longer greater than 0
.
Notice that if we don’t modify x
inside the loop, it would run forever! An infinite loop!8
You will likely see for
loops more often, but while
loops are useful when you don’t know how many iterations you’ll need ahead of time, but you can describe a stopping condition.
= 1 # setup initial data
x = 0 # setup result vector
y
# perform loop
while(x > .Machine$double.eps) {
= y + x
y = x / 2
x
}
# check results
y
This example demonstrates a method to numerically evaluate the following sum.9
\[ \sum_{k = 0}^{\infty} \left(\frac{1}{2}\right) ^ k \]
Because we cannot actually sum up an infinite number of terms, as that would take forever, we instead sum up all terms that are indistinguishable from zero to the computer. In this case, .Machine$double.eps
gives us the smallest possible number that R can recognize on the machine that processed this chapter. Because x
will only become smaller as the loop continues, we know that once the loop stops, all future terms would have also been indistinguishable from zero.
$double.eps .Machine
#> [1] 2.220446e-16
10.2.3 repeat
A repeat
loop will continually repeat an expression, without ever stopping.
repeat {
42
}
The above is not run, because it would never stop! It could have also been written using a while
loop:
while (TRUE) {
42
}
Because there is no built-in stopping rule when using repeat
, it is not a feature we will return to often. If necessary, in order to exit a repeat
loop, the break
expression may be used.
= 42
x repeat {
print(x)
if (x < 1) {
break
}= sqrt(x) - 1
x }
#> [1] 42
#> [1] 5.480741
#> [1] 1.341098
#> [1] 0.1580579
10.3 Summary
- TODO: You’ve learned to…
10.4 What’s Next?
- TODO: ?
This example is easier to write, easier to read, and because of vectorization, much faster.↩︎
Think
l
for list. Although, it is unclear if that is the etymology of the name of thelapply
function.↩︎If you check the documentation for
lapply
, you’ll notice an argument called...
. More on this later, but this is what allows R to pass these additional arguments to the function.↩︎The
apply
function is useful when working with matrix objects, which so far we have been avoiding.↩︎Technically, the apply functions could be said to be “hiding loops” as they are mostly just convenience functions wrapped around
for
loops.↩︎Also check that you can’t just use a vectorized operation.↩︎
This is a simple and naive, but incredibly powerful debugging practice.↩︎
If you experience an infinite loop, use Ctrl + C in the console to escape it. Or press the stop button in RStudio.↩︎
Observant readers will recognize this sum as a geometric series and note that there is an analytical solution.↩︎