1. Introduction to {purrr}

A basic introductory tutorial of the purrr::map() family

Shijing Zhou
05-28-2021

What is {purrr}?

{purrr} is a handy package that provides a number of helpful functions often used for iteration with functions and vectors. In this tutorial, different uses of purrr::map() and its variants are demonstrated.

What can you do with purrr::map()?

purrr::map() and its variants are functionals, meaning they are functions that take another function as input, apply that function to the specified data, and return the resulting vector as output. purrr::map() and its “family” of functions allow you to transform their input by applying a function to each element of a list or atomic vector, and it will return an object of the same length as the input. The difference between purrr::map() and its variants is that purrr::map() always returns a list, but other variants return an atomic vector of the indicated type.

Some commonly used variants include:

The map() family’s arguments are relatively simple, but it can take a while to get used to them. Its basic anatomy is as follows:

map(your_data, some_function_or_formula_or_vector, any_necessary_arguments_for_function)

For example,

map_dbl(data, mean, na.rm = TRUE)

would return a vector containing the mean (removing NAs) of each column of data.
Instead of supplying a function as input, you can also write equivalent code supplying a formula instead.

map_dbl(data, ~mean(.x, na.rm = TRUE))

You can also input “anonymous” functions

map_dbl(data, function(x) x + 2)

and vectors for indexing.

map(your_complex_list, c(1, 4))

Let’s demontrate purrr::map() with a few examples from real data

Load package and data

For this tutorial, we will be utilizing an open dataset that contains N=2495 individuals’ responses to a conspiracist ideation measure called the Generic Conspiracist Beliefs Scale (GCBS; Brotherton et al., 2013), a personality measure called the Ten Item Personality Inventory (TIPI; Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr., 2003), and various demographic and validity check items. For more information about the data, see our second post.

library(tidyverse) # note: {purrr} is a {tidyverse} package
conspiracy <- rio::import(here::here("content", "dataCT.csv"))

Let’s take a look at those variables in the dataset.

str(conspiracy)
'data.frame':   2495 obs. of  72 variables:
 $ Q1          : int  5 5 2 5 5 1 4 5 1 1 ...
 $ Q2          : int  5 5 4 4 4 1 3 4 1 2 ...
 $ Q3          : int  3 5 1 1 1 1 3 3 1 1 ...
 $ Q4          : int  5 5 2 2 4 1 3 3 1 1 ...
 $ Q5          : int  5 5 2 4 4 1 4 4 1 1 ...
 $ Q6          : int  5 3 2 5 5 1 3 5 1 5 ...
 $ Q7          : int  5 5 4 4 4 1 3 5 1 1 ...
 $ Q8          : int  3 5 2 1 3 1 4 5 1 1 ...
 $ Q9          : int  4 1 2 4 1 1 2 5 1 1 ...
 $ Q10         : int  5 4 4 5 5 1 3 5 1 4 ...
 $ Q11         : int  5 4 2 5 5 1 3 5 1 1 ...
 $ Q12         : int  5 5 4 5 5 1 2 5 1 1 ...
 $ Q13         : int  3 4 0 1 3 1 2 3 1 1 ...
 $ Q14         : int  5 4 2 4 5 1 3 4 1 1 ...
 $ Q15         : int  5 5 4 5 5 1 4 5 1 5 ...
 $ E1          : int  7070 4086 27535 4561 8841 15267 7249 8024 4654 23787 ...
 $ E2          : int  7469 13107 7814 5589 7575 7112 4651 7343 6076 12375 ...
 $ E3          : int  7383 2807 7762 3506 3832 4798 5496 6808 3032 2006 ...
 $ E4          : int  6540 5030 10290 3784 7775 5214 3936 6794 3984 3650 ...
 $ E5          : int  9098 7405 8558 5093 4160 3683 7831 8743 4328 3188 ...
 $ E6          : int  4998 7864 10538 3555 5216 4130 6816 6196 4070 48851 ...
 $ E7          : int  6971 16234 4740 3158 7559 4487 6167 7762 4012 9013 ...
 $ E8          : int  4713 2603 4162 1887 5792 2376 2032 4797 2430 2128 ...
 $ E9          : int  6032 14174 6492 7678 10296 3273 4000 8015 4191 2898 ...
 $ E10         : int  5878 9423 11512 2304 5455 5501 3583 5764 8444 10420 ...
 $ E11         : int  4031 11683 6874 3604 3864 3790 4481 5717 4224 5820 ...
 $ E12         : int  4386 12718 11440 2724 11799 7777 5071 5352 4404 2049 ...
 $ E13         : int  9077 4816 0 2689 7872 4553 2368 6387 1065 9901 ...
 $ E14         : int  5113 6806 11418 2657 10543 5944 4408 9671 5533 3838 ...
 $ E15         : int  4204 4823 9872 3824 4224 4028 6103 5622 4964 7208 ...
 $ introelapse : int  11 6 7 5 4 35 12 27 2 26 ...
 $ testelapse  : int  95 125 141 58 105 87 75 104 67 148 ...
 $ surveyelapse: int  142 144 90 135 210 154 67 186 121 118 ...
 $ TIPI1       : int  5 6 6 6 1 4 2 4 4 1 ...
 $ TIPI2       : int  3 7 6 7 3 2 5 5 5 6 ...
 $ TIPI3       : int  6 6 6 7 7 6 4 6 6 3 ...
 $ TIPI4       : int  2 7 1 5 2 2 2 2 2 1 ...
 $ TIPI5       : int  6 6 7 7 6 6 5 7 4 5 ...
 $ TIPI6       : int  6 3 5 6 4 5 6 4 5 7 ...
 $ TIPI7       : int  7 7 6 5 5 6 2 5 6 6 ...
 $ TIPI8       : int  2 5 5 1 5 3 3 5 2 5 ...
 $ TIPI9       : int  7 1 7 5 5 6 5 3 7 7 ...
 $ TIPI10      : int  1 1 7 1 3 2 5 1 2 4 ...
 $ VCL1        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ VCL2        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ VCL3        : int  1 0 1 1 0 1 0 0 0 1 ...
 $ VCL4        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ VCL5        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ VCL6        : int  0 0 1 0 0 0 0 0 0 0 ...
 $ VCL7        : int  0 0 1 0 0 0 0 0 0 0 ...
 $ VCL8        : int  0 0 1 1 0 0 0 0 0 0 ...
 $ VCL9        : int  0 0 0 0 0 1 0 0 0 0 ...
 $ VCL10       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ VCL11       : int  1 1 1 1 0 1 0 0 0 0 ...
 $ VCL12       : int  0 0 0 0 0 1 0 0 0 0 ...
 $ VCL13       : int  1 0 1 1 1 1 1 0 1 1 ...
 $ VCL14       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ VCL15       : int  1 1 1 0 1 1 1 1 1 1 ...
 $ VCL16       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ education   : int  3 1 4 3 2 3 2 2 1 3 ...
 $ urban       : int  0 2 2 1 2 1 2 1 3 3 ...
 $ gender      : int  1 2 2 1 1 1 1 1 1 1 ...
 $ engnat      : int  2 1 2 1 1 1 1 1 1 2 ...
 $ age         : int  28 14 26 25 37 34 17 23 17 28 ...
 $ hand        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ religion    : int  2 1 1 12 2 7 1 2 4 2 ...
 $ orientation : int  1 2 1 1 2 1 1 1 2 1 ...
 $ race        : int  5 4 4 4 4 4 4 4 4 4 ...
 $ voted       : int  2 2 1 1 2 1 2 2 2 1 ...
 $ married     : int  1 1 1 1 2 2 1 1 1 2 ...
 $ familysize  : int  1 1 2 3 2 2 2 3 2 3 ...
 $ major       : chr  "ACTING" "" "philosophy" "history" ...

Using purrr::nest() along with purrr:map()

By looking at the dataset, you can tell it is pretty massive. What if you are interested in how responses differ by gender groups? You probably want to split the data. However, instead of using split() from base r, purrr::nest() can do the same thing.

by_gender <- conspiracy %>% 
  group_by(gender) %>% 
  nest()

by_gender
# A tibble: 4 x 2
# Groups:   gender [4]
  gender data                 
   <int> <list>               
1      1 <tibble [1,222 x 71]>
2      2 <tibble [1,137 x 71]>
3      3 <tibble [130 x 71]>  
4      0 <tibble [6 x 71]>    

From the output, we can see that the new dataset by_gender contains a list column for each gender. We can play with the list columns a little bit more.

Let’s say we want to know how many observations were under each gender category.

by_gender <- conspiracy %>% 
  group_by(gender) %>% 
  nest() %>% 
  mutate(n = map(data, nrow))

by_gender
# A tibble: 4 x 3
# Groups:   gender [4]
  gender data                  n        
   <int> <list>                <list>   
1      1 <tibble [1,222 x 71]> <int [1]>
2      2 <tibble [1,137 x 71]> <int [1]>
3      3 <tibble [130 x 71]>   <int [1]>
4      0 <tibble [6 x 71]>     <int [1]>

As you can see, map() returned the number of observations as a list for each gender category, which is not convenient for us to read in order to obtain such simple information. In case like this, variants of map() become handy.

Using purrr::map_dbl()

To simplify the previous output, we probably want map() to just return us a vector of double (i.e., numeric). It is a good time to use purrr::map_dbl() then.

by_gender <- conspiracy %>% 
  group_by(gender) %>% 
  nest() %>% 
  mutate(n = map_dbl(data, nrow))

by_gender
# A tibble: 4 x 3
# Groups:   gender [4]
  gender data                      n
   <int> <list>                <dbl>
1      1 <tibble [1,222 x 71]>  1222
2      2 <tibble [1,137 x 71]>  1137
3      3 <tibble [130 x 71]>     130
4      0 <tibble [6 x 71]>         6

Is it much easier to read now?

Do some more complex analysis with purrr:map()!

Let’s say we are interested in the relationship between education (education) and participants’ self ratings for the open to new experiences, complex item (TIPI5). Is the level of education a good predictor of open to new experiences in different gender groups?

by_gender <- by_gender %>% 
  mutate(edu_m = map(data, ~lm(TIPI5 ~ education, data = .x)))

by_gender
# A tibble: 4 x 4
# Groups:   gender [4]
  gender data                      n edu_m 
   <int> <list>                <dbl> <list>
1      1 <tibble [1,222 x 71]>  1222 <lm>  
2      2 <tibble [1,137 x 71]>  1137 <lm>  
3      3 <tibble [130 x 71]>     130 <lm>  
4      0 <tibble [6 x 71]>         6 <lm>  

What about religion (religion) predicting open to new experiences (TIPI5) in different gender groups?

by_gender <- by_gender %>% 
  mutate(religion_m = map(data, ~lm(TIPI5 ~ religion, data = .x)))

by_gender
# A tibble: 4 x 5
# Groups:   gender [4]
  gender data                      n edu_m  religion_m
   <int> <list>                <dbl> <list> <list>    
1      1 <tibble [1,222 x 71]>  1222 <lm>   <lm>      
2      2 <tibble [1,137 x 71]>  1137 <lm>   <lm>      
3      3 <tibble [130 x 71]>     130 <lm>   <lm>      
4      0 <tibble [6 x 71]>         6 <lm>   <lm>      

From the outputs above, you can see that the results from two liner model are two separate columns, and the result for each gender is wrapped in list column with using mutate() after nest(), but how can do know which variable does better job predict the personality of open to new experiences?

Parallel iteration with purrr::map2()

As mentioned previously, map() transforms the input by applying a function to each element of a list or atomic vector. However, {purrr} also provides a function to iterate over two vectors concurrently, purrr::map2(), which takes the following form:

map2(vector1, vector2, a_function/formula/vector, arguments)

Using purrr::map2(), we can actually compare between models we created previously!
Just as you would use stats::anova() to compare two models, we can use stats::anova() within purrr::map2() to compare two list columns of models.

mods <- by_gender %>%
  mutate(
    edu_m = map(data, ~lm(TIPI5 ~ education, data = .x)),
    religion_m = map(data, ~lm(TIPI5 ~ religion, data = .x))
) %>% 
  mutate(comp = map2(edu_m, religion_m, anova))

mods
# A tibble: 4 x 6
# Groups:   gender [4]
  gender data                      n edu_m  religion_m comp           
   <int> <list>                <dbl> <list> <list>     <list>         
1      1 <tibble [1,222 x 71]>  1222 <lm>   <lm>       <anova [2 x 6]>
2      2 <tibble [1,137 x 71]>  1137 <lm>   <lm>       <anova [2 x 6]>
3      3 <tibble [130 x 71]>     130 <lm>   <lm>       <anova [2 x 6]>
4      0 <tibble [6 x 71]>         6 <lm>   <lm>       <anova [2 x 6]>

Now we have model comparison for each gender group with list column! From there, you can extract any information you want.

This concludes our tutorial on the purrr::map() family. We hope you had fun with functionals!