Are {tidyverse} function names getting longer?

Are we approaching sentence-length tidyverse functions? quarto::read_blog_carefully_with_eyes() to find out.

tidyverse
plotly
Author

Jack Davison

Published

March 4, 2023


Note

None of this is meant as an insult to the tidyverse developers, of course, who have done excellent work making R one of the top languages for data analysis. This is all in good fun!

Scrolling through #rstats Twitter recently I noticed a lot of conversation about the recent tidyr deprecations being overly wordy. After all, separate() being superseded by separate_wider_position() does feel almost comically long!

This made me think, however - which tidyverse packages have the wordiest functions? And are functions always shorter than the functions that supersede them? Let’s use R to find out.

Data

First, we need to load the whole tidyverse. I’m not talking about the “core” tidyverse - we need the whole thing!

(pkgs <- tidyverse::tidyverse_packages(include_self = FALSE))
 [1] "broom"         "conflicted"    "cli"           "dbplyr"       
 [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
 [9] "googledrive"   "googlesheets4" "haven"         "hms"          
[13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
[17] "modelr"        "pillar"        "purrr"         "ragg"         
[21] "readr"         "readxl"        "reprex"        "rlang"        
[25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
[29] "tidyr"         "xml2"         

Let’s use purrr to rapidly load them all. We’ll also need plotly and ggiraph for data visualisation.

purrr::walk(pkgs, ~ library(.x, character.only = TRUE))

library(ggiraph)
library(plotly)

We can now easily extract a list of functions in a package using the ls() function.

ls("package:broom")
[1] "augment"         "augment_columns" "bootstrap"       "confint_tidy"   
[5] "finish_glance"   "fix_data_frame"  "glance"          "tidy"           
[9] "tidy_irlba"     

The lifecycle package can even tell us which functions are superseded or not for many tidyverse packages.

lifecycle::pkg_lifecycle_statuses(package = "ggplot2")
    package                 fun  lifecycle
4   ggplot2                aes_ deprecated
5   ggplot2          aes_string deprecated
6   ggplot2               aes_q deprecated
8   ggplot2            aes_auto deprecated
18  ggplot2 annotation_logticks superseded
37  ggplot2          coord_flip superseded
38  ggplot2           coord_map superseded
39  ggplot2      coord_quickmap superseded
171 ggplot2              gg_dep deprecated

This process can then be tidied by writing a function and once again using purrr.

list_functions_tbl <- function(pkg){
  life <- lifecycle::pkg_lifecycle_statuses(package = pkg) %>%
    select(-package)
  ls(str_glue("package:{pkg}")) %>%
    tibble() %>%
    set_names("fun") %>%
    mutate(pkg = {{pkg}},
           fun.len = str_length(fun)) %>%
    left_join(life, by = join_by(fun))
}

functions <-
  purrr::map(pkgs, list_functions_tbl) %>%
  list_rbind() %>%
  relocate(lifecycle, .after = pkg)

slice_sample(functions, n = 10)
# A tibble: 10 × 4
   fun            pkg       lifecycle   fun.len
   <chr>          <chr>     <chr>         <int>
 1 data_frame     dplyr     <NA>             10
 2 as.tbl         dplyr     deprecated        6
 3 nanoseconds    lubridate <NA>             11
 4 is_cpl_na      rlang     questioning       9
 5 cur_data       dplyr     deprecated        8
 6 GuideAxisTheta ggplot2   <NA>             14
 7 do             dplyr     superseded        2
 8 env_browse     rlang     <NA>             10
 9 flatten_int    purrr     superseded       11
10 enquo          rlang     <NA>              5

Here we have the function lengths, but it may also be interesting to see the number of constituent words that make up each function. For example, separate() is one word, whereas separate_wider_position() is three. We can achieve this using tidytext.

counts <-
  functions %>%
  mutate(fun2 = snakecase::to_snake_case(fun) %>% str_replace_all("_", " ")) %>%
  tidytext::unnest_tokens(word, fun2, drop = FALSE) %>%
  count(pkg, fun, name = "fun.words")

functions <-
  left_join(functions, counts, by = join_by(pkg, fun))

The last thing we’ll do is mark the “core” tidyverse so we can easily distinguish a key package like dplyr from a more niche one like xml2. Note that I’m including lubridate in the “core” tidyverse as it is due to be added in an upcoming update.

functions <-
  functions %>%
  mutate(pkg.core = pkg %in% c("ggplot2", "dplyr", "tidyr", "readr", "purrr", "tibble", "stringr", "forcats", "lubridate"),
         .after = pkg)

Function Lengths

Now we have our data in order, lets investigate the distributions of some of these data. First, let’s consider the function character length distributions of the different packages in Figure 1.

functions %>%
  mutate(pkg = fct_reorder(pkg, fun.len, median),
         color = if_else(pkg.core, "Core", "Non-Core")) %>%
  plot_ly(x = ~pkg, y = ~fun.len, colors = "Dark2") %>%
  add_boxplot(color = ~color) %>%
  layout(yaxis = list(title = "Function Length"),
         xaxis = list(title = "Package"))
Figure 1: The distributions of different function name lengths in tidyverse packages. Hover over each boxplot to see the summary statistics they represent.

The package with the longest median function length is rstudioapi with a median value of 14 characters. The shortest is crayon with a median of 7. Both of these packages are not “core” and, furthermore, they’re not particularly common to call directly! Looking at just the “core” tidyverse, ggplot2 has the highest median function length and lubridate has the lowest. Not particularly surprising when we compare mammoths like scale_linewidth_continuous() to small fries like ymd()!

Table 1 shows the longest and shortest 10 functions in the tidyverse. The longest functions are from googlesheets4, although these are exported internal vctrs methods. The longest named function that anyone is likely to call commonly is the aforementioned scale_linewidth_continuous() at a massive 26 characters long. On the other end, the shortest function in the tidyverse is dplyr::n() at only one character.

bind_rows(
  "longest" = slice_max(functions, n = 10, order_by = fun.len),
  "shortest" = slice_min(functions, n = 10, order_by = fun.len),
  .id = "type"
) %>%
  knitr::kable()
Table 1: The longest and shortest function names in the tidyverse by characters.
type fun pkg pkg.core lifecycle fun.len fun.words
longest ffi_standalone_check_number_1.0.7 rlang FALSE NA 33 7
longest vec_ptype2.googlesheets4_formula googlesheets4 FALSE NA 32 6
longest db_supports_table_alias_with_as dbplyr FALSE NA 31 6
longest vec_cast.googlesheets4_formula googlesheets4 FALSE NA 30 5
longest cli_progress_builtin_handlers cli FALSE NA 29 4
longest getRStudioPackageDependencies rstudioapi FALSE NA 29 5
longest registerCommandStreamCallback rstudioapi FALSE NA 29 4
longest ffi_standalone_is_bool_1.0.7 rlang FALSE NA 28 7
longest launcherPlacementConstraint rstudioapi FALSE NA 27 3
longest ansi_has_hyperlink_support cli FALSE NA 26 4
longest scale_linewidth_continuous ggplot2 TRUE NA 26 3
shortest n dplyr TRUE NA 1 1
shortest no cli FALSE NA 2 1
shortest do dplyr TRUE superseded 2 1
shortest id dplyr TRUE defunct 2 1
shortest am lubridate TRUE NA 2 1
shortest hm lubridate TRUE NA 2 1
shortest ms lubridate TRUE NA 2 1
shortest my lubridate TRUE NA 2 1
shortest pm lubridate TRUE NA 2 1
shortest tz lubridate TRUE NA 2 1
shortest ym lubridate TRUE NA 2 1
shortest yq lubridate TRUE NA 2 1
shortest or magrittr FALSE NA 2 1
shortest !! rlang FALSE NA 2 NA
shortest := rlang FALSE NA 2 NA
shortest ll rlang FALSE NA 2 1
shortest UQ rlang FALSE deprecated 2 1

Looking at the distribution of the number of words in Figure 2, we can see that most tidyverse functions are made up of 2 words (e.g., separate_wider()). The wordiest two functions, however, have a massive 5 words. These are the relatively new fct_na_level_to_value() and fct_na_value_to_level() from forcats, both of which superseded fct_explicit_na() which itself was three words long.

The least wordy tidyverse functions actually have no words at all, those being: %>%, %+%, %–%, %!>%, %$%, %<>%, %@%, %||%, !!, !!!, %@%<-, %|%, %<~%, :=. As you can tell, they’re mostly magrittr pipes and rlang syntax stuff.

functions %>%
  count(fun.words, pkg.core) %>%
  mutate(color = if_else(pkg.core, "Core", "Non-Core"),
         fun.words = if_else(is.na(fun.words), 0, fun.words)) %>%
  plot_ly(x = ~factor(fun.words), y = ~n, colors = "Dark2") %>%
  add_bars(color = ~color) %>%
  layout(xaxis = list(title = "Number of Words"))
Figure 2: Counts of the number of words in tidyverse functions. A function with 0 words has no English characters (e.g., %>%).

Superseded Functions

All of that was interesting, but lets examine the original point of this post - are superseded functions shorter than new ones?

First, lets make sure that we’re only looking at packages that are actually classified using lifecycle. Of the remaining packages, all of the NA values represent stable functions.

functions2 <-
  filter(functions, any(!is.na(lifecycle)), .by = pkg) %>%
  mutate(lifecycle = replace_na(lifecycle, "stable"))

We’ll visuaise all of the lifecycle stages available, although the three key ones we should be looking at are:

  • Stable - functions that are in current use,

  • Superseded - functions that have newer versions that are now recommended (e.g., separate()), and

  • Experimental - the newest functions that are in development (e.g., separate_wider_delim()).

Figure 3 shows the distributions of different function lengths, similar to Figure 1 but with the function lifecycle instead of its package. Deprecated and superseded functions have similar mean lengths of around about 9.5, compared to stable functions’ 11.2 and experimental functions’ 14.2! This does imply that new functions are indeed getting longer - at least on average!

functions2 %>%
  mutate(lifecycle = fct_reorder(lifecycle, fun.len, mean)) %>%
  mutate(fun.avglen = mean(fun.len, na.rm = TRUE), .by = lifecycle) %>%
  plot_ly(x = ~ lifecycle, y = ~ fun.len) %>%
  add_boxplot(name = "Boxplot", color = I("grey75")) %>%
  add_lines(
    y = ~ round(fun.avglen, 2),
    legendgroup = "marker",
    showlegend = FALSE,
    color = I("red")
  ) %>%
  add_markers(
    y = ~ round(fun.avglen, 2),
    name = "Mean Marker",
    legendgroup = "marker",
    color = I("red")
  ) %>%
  layout(yaxis = list(title = "Function Length"))
Figure 3: The distributions of function lengths at different lifecycle stages.

A similar observation is seen in Figure 4, which shows that a whopping 52% of experimental tidyverse experimental functions are three words long, plus an extra 12% at four words. Conversely, only 10% of superseded functions are three words long and none are four words.

functions2 %>%
  count(lifecycle, fun.words) %>%
  mutate(fun.words = replace_na(fun.words, 0),
         lifecycle = fct_reorder(lifecycle, fun.words, median)) %>%
  mutate(n = n / sum(n), .by = lifecycle) %>%
  plot_ly(x = ~ lifecycle, y = ~ n) %>%
  add_bars(color = ~ factor(fun.words)) %>%
  layout(barmode = "stack",
         yaxis = list(tickformat = '.0%'))
Figure 4: The percentage of tidyverse functions at different lifecycle stages and numbers of words.

Now to make this truly robust we should be comparing individual superseded functions with the functions that replaced them. To do this, I wrote out all of the superseded functions to a CSV and searched through the documentation to determine their replacements. A few notes before we continue:

  • Some superseded functions simply weren’t replaced with anything in particular. For example, dplyr::with_groups() was replaced by the by/.by argument in functions like summarise() so won’t be analysed here. Many dplyr functions were effectively replaced by across() so are also ignored.

  • A lot of readr functions were moved to meltr so aren’t considered here.

  • The purrr::map_dfr() family was replaced by a combination of map() and list_rbind(). It wouldn’t be fair to compare these because of course two functions will be longer than one!

  • tidyr possesses unnest_legacy() and nest_legacy() which are listed as superseded but, in reality, they were never labelled “legacy” originally, so these are discounted.

  • Some functions were also replaced by multiple functions. For example, do() is replaced by either reframe(), nest_by() or pick(). In that case, all three are used.

supers <- 
  read_csv(here::here("posts/2023-03-04-tidyverseLength/superseded.csv")) %>%
  left_join(select(functions, -pkg.core, -lifecycle)) %>%
  left_join(
    select(functions, -pkg.core, -lifecycle),
    by = join_by(oldfun == fun, pkg),
    suffix = c("_new", "_old")
  ) %>%
  drop_na(fun) %>%
  filter(fun != "across",
         !str_detect(oldfun, "legacy"))

slice_sample(supers, n = 10)
# A tibble: 10 × 7
   oldfun        pkg   fun   fun.len_new fun.words_new fun.len_old fun.words_old
   <chr>         <chr> <chr>       <int>         <int>       <int>         <int>
 1 separate_rows tidyr sepa…          21             3          13             2
 2 do            dplyr pick            4             1           2             1
 3 simplify      purrr list…          13             2           8             1
 4 top_frac      dplyr slic…           9             2           8             2
 5 as_vector     purrr list…          13             2           9             2
 6 flatten       purrr list…          12             2           7             1
 7 coord_map     ggpl… coor…           8             2           9             2
 8 gather        tidyr pivo…          12             2           6             1
 9 spread        tidyr pivo…          11             2           6             1
10 sample_n      dplyr slic…          12             2           8             2

Now, finally, lets compare the old and new! Figure 5 shows the function lengths of specific superseded functions and the functions they were replaced with. Most functions marked superseded that were directly replaced by one or more functions are in the dplyr, purrr and tidyr packages, and many of them do indeed get longer! The function that increased in length the most was separate(), which became the massive separate_wider_position(), gaining 15 additional characters. The function that decreased in length the most was coord_quickmap(), which was replaced by the petite coord_sf(), losing 6 characters. In this dataset, the median change between the old and new functions was an increase in 3 characters.

plt <- 
  supers %>%
  mutate(
    id = row_number(),
    diff = fun.len_new - fun.len_old,
    diff = if_else(diff > 0, paste("+", diff, sep = ""), paste(diff)),
    tooltip = str_glue("`{oldfun}()` to `{fun}()` ({diff})"),
    color = if_else(fun.len_new > fun.len_old, "Increase", "Decrease")
  ) %>%
  pivot_longer(c(fun.len_new, fun.len_old)) %>%
  mutate(
    name = case_match(name,
                      "fun.len_new" ~ "New",
                      "fun.len_old" ~ "Old"),
    name = factor(name, c("Old", "New"))
  ) %>%
  ggplot(aes(x = name, y = value, color = color)) +
  geom_line_interactive(aes(
    group = id,
    tooltip = tooltip,
    data_id = tooltip
  ),
  hover_nearest = TRUE) +
  geom_point_interactive(
    aes(data_id = tooltip),
    shape = 21,
    fill = "white",
    stroke = 1
  ) +
  coord_cartesian(clip = "off") +
  scale_x_discrete(expand = expansion()) +
  expand_limits(y = 0) +
  theme_minimal() +
  theme(
    panel.spacing = unit(.7, "cm"),
    legend.position = "top",
    aspect.ratio = 1
  ) +
  facet_wrap(vars(pkg), scales = "free_x") +
  labs(x = NULL, y = "Function Length", color = NULL)

girafe(ggobj = plt,
       options = list(
         opts_hover(""),
         opts_hover_inv("opacity:0.25")
       ))
Figure 5: The change in the length of superseded function names. Hover over the lines to identify the specific functions.

Conclusion

This blog post was on a bit of a silly topic, but I’m hopeful that it was a useful demonstration of various tidyverse functions and the plotly package for interactive data visualisation. The key conclusion that I’m taking is that, while there is a lot of variation, superseded functions do tend to be shorter than the functions that replace them! Yesterday we had separate(), today we have separate_wider_position(), and tomorrow separate_wider_integer_position_ignore_toofew_removecols() may be on the cards!