A basic introductory tutorial of the purrr::map()
family
{purrr}
?{purrr}
is a handy package that provides a number of helpful functions often used for iteration with functions and vectors. In this tutorial, different uses of purrr::map()
and its variants are demonstrated.
purrr::map()
?purrr::map()
and its variants are functionals, meaning they are functions that take another function as input, apply that function to the specified data, and return the resulting vector as output. purrr::map()
and its “family” of functions allow you to transform their input by applying a function to each element of a list or atomic vector, and it will return an object of the same length as the input. The difference between purrr::map()
and its variants is that purrr::map()
always returns a list, but other variants return an atomic vector of the indicated type.
Some commonly used variants include:
purrr::map_lgl()
: returns a logical type vectorpurrr::map_int()
: returns a integer type vectorpurrr::map_dbl()
: returns a double type vectorpurrr::map_char()
: returns a character type vectorpurrr::map_df()
: returns a data frame, often used for batch load dataThe map()
family’s arguments are relatively simple, but it can take a while to get used to them. Its basic anatomy is as follows:
map(your_data, some_function_or_formula_or_vector, any_necessary_arguments_for_function)
For example,
map_dbl(data, mean, na.rm = TRUE)
would return a vector containing the mean (removing NA
s) of each column of data
.
Instead of supplying a function as input, you can also write equivalent code supplying a formula instead.
map_dbl(data, ~mean(.x, na.rm = TRUE))
You can also input “anonymous” functions
map_dbl(data, function(x) x + 2)
and vectors for indexing.
map(your_complex_list, c(1, 4))
purrr::map()
with a few examples from real dataFor this tutorial, we will be utilizing an open dataset that contains N=2495 individuals’ responses to a conspiracist ideation measure called the Generic Conspiracist Beliefs Scale (GCBS; Brotherton et al., 2013), a personality measure called the Ten Item Personality Inventory (TIPI; Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr., 2003), and various demographic and validity check items. For more information about the data, see our second post.
Let’s take a look at those variables in the dataset.
str(conspiracy)
'data.frame': 2495 obs. of 72 variables:
$ Q1 : int 5 5 2 5 5 1 4 5 1 1 ...
$ Q2 : int 5 5 4 4 4 1 3 4 1 2 ...
$ Q3 : int 3 5 1 1 1 1 3 3 1 1 ...
$ Q4 : int 5 5 2 2 4 1 3 3 1 1 ...
$ Q5 : int 5 5 2 4 4 1 4 4 1 1 ...
$ Q6 : int 5 3 2 5 5 1 3 5 1 5 ...
$ Q7 : int 5 5 4 4 4 1 3 5 1 1 ...
$ Q8 : int 3 5 2 1 3 1 4 5 1 1 ...
$ Q9 : int 4 1 2 4 1 1 2 5 1 1 ...
$ Q10 : int 5 4 4 5 5 1 3 5 1 4 ...
$ Q11 : int 5 4 2 5 5 1 3 5 1 1 ...
$ Q12 : int 5 5 4 5 5 1 2 5 1 1 ...
$ Q13 : int 3 4 0 1 3 1 2 3 1 1 ...
$ Q14 : int 5 4 2 4 5 1 3 4 1 1 ...
$ Q15 : int 5 5 4 5 5 1 4 5 1 5 ...
$ E1 : int 7070 4086 27535 4561 8841 15267 7249 8024 4654 23787 ...
$ E2 : int 7469 13107 7814 5589 7575 7112 4651 7343 6076 12375 ...
$ E3 : int 7383 2807 7762 3506 3832 4798 5496 6808 3032 2006 ...
$ E4 : int 6540 5030 10290 3784 7775 5214 3936 6794 3984 3650 ...
$ E5 : int 9098 7405 8558 5093 4160 3683 7831 8743 4328 3188 ...
$ E6 : int 4998 7864 10538 3555 5216 4130 6816 6196 4070 48851 ...
$ E7 : int 6971 16234 4740 3158 7559 4487 6167 7762 4012 9013 ...
$ E8 : int 4713 2603 4162 1887 5792 2376 2032 4797 2430 2128 ...
$ E9 : int 6032 14174 6492 7678 10296 3273 4000 8015 4191 2898 ...
$ E10 : int 5878 9423 11512 2304 5455 5501 3583 5764 8444 10420 ...
$ E11 : int 4031 11683 6874 3604 3864 3790 4481 5717 4224 5820 ...
$ E12 : int 4386 12718 11440 2724 11799 7777 5071 5352 4404 2049 ...
$ E13 : int 9077 4816 0 2689 7872 4553 2368 6387 1065 9901 ...
$ E14 : int 5113 6806 11418 2657 10543 5944 4408 9671 5533 3838 ...
$ E15 : int 4204 4823 9872 3824 4224 4028 6103 5622 4964 7208 ...
$ introelapse : int 11 6 7 5 4 35 12 27 2 26 ...
$ testelapse : int 95 125 141 58 105 87 75 104 67 148 ...
$ surveyelapse: int 142 144 90 135 210 154 67 186 121 118 ...
$ TIPI1 : int 5 6 6 6 1 4 2 4 4 1 ...
$ TIPI2 : int 3 7 6 7 3 2 5 5 5 6 ...
$ TIPI3 : int 6 6 6 7 7 6 4 6 6 3 ...
$ TIPI4 : int 2 7 1 5 2 2 2 2 2 1 ...
$ TIPI5 : int 6 6 7 7 6 6 5 7 4 5 ...
$ TIPI6 : int 6 3 5 6 4 5 6 4 5 7 ...
$ TIPI7 : int 7 7 6 5 5 6 2 5 6 6 ...
$ TIPI8 : int 2 5 5 1 5 3 3 5 2 5 ...
$ TIPI9 : int 7 1 7 5 5 6 5 3 7 7 ...
$ TIPI10 : int 1 1 7 1 3 2 5 1 2 4 ...
$ VCL1 : int 1 1 1 1 1 1 1 1 1 1 ...
$ VCL2 : int 1 1 1 1 1 1 1 1 1 1 ...
$ VCL3 : int 1 0 1 1 0 1 0 0 0 1 ...
$ VCL4 : int 1 1 1 1 1 1 1 1 1 1 ...
$ VCL5 : int 1 1 1 1 1 1 1 1 1 1 ...
$ VCL6 : int 0 0 1 0 0 0 0 0 0 0 ...
$ VCL7 : int 0 0 1 0 0 0 0 0 0 0 ...
$ VCL8 : int 0 0 1 1 0 0 0 0 0 0 ...
$ VCL9 : int 0 0 0 0 0 1 0 0 0 0 ...
$ VCL10 : int 1 1 1 1 1 1 1 1 1 1 ...
$ VCL11 : int 1 1 1 1 0 1 0 0 0 0 ...
$ VCL12 : int 0 0 0 0 0 1 0 0 0 0 ...
$ VCL13 : int 1 0 1 1 1 1 1 0 1 1 ...
$ VCL14 : int 1 1 1 1 1 1 1 1 1 1 ...
$ VCL15 : int 1 1 1 0 1 1 1 1 1 1 ...
$ VCL16 : int 1 1 1 1 1 1 1 1 1 1 ...
$ education : int 3 1 4 3 2 3 2 2 1 3 ...
$ urban : int 0 2 2 1 2 1 2 1 3 3 ...
$ gender : int 1 2 2 1 1 1 1 1 1 1 ...
$ engnat : int 2 1 2 1 1 1 1 1 1 2 ...
$ age : int 28 14 26 25 37 34 17 23 17 28 ...
$ hand : int 1 1 1 1 1 1 1 1 1 1 ...
$ religion : int 2 1 1 12 2 7 1 2 4 2 ...
$ orientation : int 1 2 1 1 2 1 1 1 2 1 ...
$ race : int 5 4 4 4 4 4 4 4 4 4 ...
$ voted : int 2 2 1 1 2 1 2 2 2 1 ...
$ married : int 1 1 1 1 2 2 1 1 1 2 ...
$ familysize : int 1 1 2 3 2 2 2 3 2 3 ...
$ major : chr "ACTING" "" "philosophy" "history" ...
purrr::nest()
along with purrr:map()
By looking at the dataset, you can tell it is pretty massive. What if you are interested in how responses differ by gender groups? You probably want to split the data. However, instead of using split()
from base r, purrr::nest()
can do the same thing.
by_gender <- conspiracy %>%
group_by(gender) %>%
nest()
by_gender
# A tibble: 4 x 2
# Groups: gender [4]
gender data
<int> <list>
1 1 <tibble [1,222 x 71]>
2 2 <tibble [1,137 x 71]>
3 3 <tibble [130 x 71]>
4 0 <tibble [6 x 71]>
From the output, we can see that the new dataset by_gender
contains a list column for each gender. We can play with the list columns a little bit more.
Let’s say we want to know how many observations were under each gender category.
by_gender <- conspiracy %>%
group_by(gender) %>%
nest() %>%
mutate(n = map(data, nrow))
by_gender
# A tibble: 4 x 3
# Groups: gender [4]
gender data n
<int> <list> <list>
1 1 <tibble [1,222 x 71]> <int [1]>
2 2 <tibble [1,137 x 71]> <int [1]>
3 3 <tibble [130 x 71]> <int [1]>
4 0 <tibble [6 x 71]> <int [1]>
As you can see, map()
returned the number of observations as a list for each gender category, which is not convenient for us to read in order to obtain such simple information. In case like this, variants of map()
become handy.
purrr::map_dbl()
To simplify the previous output, we probably want map()
to just return us a vector of double (i.e., numeric). It is a good time to use purrr::map_dbl()
then.
by_gender <- conspiracy %>%
group_by(gender) %>%
nest() %>%
mutate(n = map_dbl(data, nrow))
by_gender
# A tibble: 4 x 3
# Groups: gender [4]
gender data n
<int> <list> <dbl>
1 1 <tibble [1,222 x 71]> 1222
2 2 <tibble [1,137 x 71]> 1137
3 3 <tibble [130 x 71]> 130
4 0 <tibble [6 x 71]> 6
Is it much easier to read now?
purrr:map()
!Let’s say we are interested in the relationship between education (education
) and participants’ self ratings for the open to new experiences, complex item (TIPI5
). Is the level of education a good predictor of open to new experiences in different gender groups?
by_gender <- by_gender %>%
mutate(edu_m = map(data, ~lm(TIPI5 ~ education, data = .x)))
by_gender
# A tibble: 4 x 4
# Groups: gender [4]
gender data n edu_m
<int> <list> <dbl> <list>
1 1 <tibble [1,222 x 71]> 1222 <lm>
2 2 <tibble [1,137 x 71]> 1137 <lm>
3 3 <tibble [130 x 71]> 130 <lm>
4 0 <tibble [6 x 71]> 6 <lm>
What about religion (religion
) predicting open to new experiences (TIPI5
) in different gender groups?
by_gender <- by_gender %>%
mutate(religion_m = map(data, ~lm(TIPI5 ~ religion, data = .x)))
by_gender
# A tibble: 4 x 5
# Groups: gender [4]
gender data n edu_m religion_m
<int> <list> <dbl> <list> <list>
1 1 <tibble [1,222 x 71]> 1222 <lm> <lm>
2 2 <tibble [1,137 x 71]> 1137 <lm> <lm>
3 3 <tibble [130 x 71]> 130 <lm> <lm>
4 0 <tibble [6 x 71]> 6 <lm> <lm>
From the outputs above, you can see that the results from two liner model are two separate columns, and the result for each gender is wrapped in list column with using mutate()
after nest()
, but how can do know which variable does better job predict the personality of open to new experiences?
purrr::map2()
As mentioned previously, map()
transforms the input by applying a function to each element of a list or atomic vector. However, {purrr} also provides a function to iterate over two vectors concurrently, purrr::map2()
, which takes the following form:
map2(vector1, vector2, a_function/formula/vector, arguments)
Using purrr::map2()
, we can actually compare between models we created previously!
Just as you would use stats::anova()
to compare two models, we can use stats::anova()
within purrr::map2()
to compare two list columns of models.
mods <- by_gender %>%
mutate(
edu_m = map(data, ~lm(TIPI5 ~ education, data = .x)),
religion_m = map(data, ~lm(TIPI5 ~ religion, data = .x))
) %>%
mutate(comp = map2(edu_m, religion_m, anova))
mods
# A tibble: 4 x 6
# Groups: gender [4]
gender data n edu_m religion_m comp
<int> <list> <dbl> <list> <list> <list>
1 1 <tibble [1,222 x 71]> 1222 <lm> <lm> <anova [2 x 6]>
2 2 <tibble [1,137 x 71]> 1137 <lm> <lm> <anova [2 x 6]>
3 3 <tibble [130 x 71]> 130 <lm> <lm> <anova [2 x 6]>
4 0 <tibble [6 x 71]> 6 <lm> <lm> <anova [2 x 6]>
Now we have model comparison for each gender group with list column! From there, you can extract any information you want.
This concludes our tutorial on the purrr::map()
family. We hope you had fun with functionals!