Are we approaching sentence-length tidyverse functions? quarto::read_blog_carefully_with_eyes() to find out.
tidyverse
plotly
Author
Jack Davison
Published
March 4, 2023
Note
None of this is meant as an insult to the tidyverse developers, of course, who have done excellent work making R one of the top languages for data analysis. This is all in good fun!
Scrolling through #rstats Twitter recently I noticed a lot of conversation about the recent tidyr deprecations being overly wordy. After all, separate() being superseded by separate_wider_position() does feel almost comically long!
This made me think, however - which tidyverse packages have the wordiest functions? And are functions always shorter than the functions that supersede them? Let’s use R to find out.
Data
First, we need to load the whole tidyverse. I’m not talking about the “core” tidyverse - we need the whole thing!
Here we have the function lengths, but it may also be interesting to see the number of constituent words that make up each function. For example, separate() is one word, whereas separate_wider_position() is three. We can achieve this using tidytext.
counts<-functions%>%mutate(fun2 =snakecase::to_snake_case(fun)%>%str_replace_all("_", " "))%>%tidytext::unnest_tokens(word, fun2, drop =FALSE)%>%count(pkg, fun, name ="fun.words")functions<-left_join(functions, counts, by =join_by(pkg, fun))
The last thing we’ll do is mark the “core” tidyverse so we can easily distinguish a key package like dplyr from a more niche one like xml2. Note that I’m including lubridate in the “core” tidyverse as it is due to be added in an upcoming update.
Now we have our data in order, lets investigate the distributions of some of these data. First, let’s consider the function character length distributions of the different packages in Figure 1.
functions%>%mutate(pkg =fct_reorder(pkg, fun.len, median), color =if_else(pkg.core, "Core", "Non-Core"))%>%plot_ly(x =~pkg, y =~fun.len, colors ="Dark2")%>%add_boxplot(color =~color)%>%layout(yaxis =list(title ="Function Length"), xaxis =list(title ="Package"))
The package with the longest median function length is rstudioapi with a median value of 14 characters. The shortest is crayon with a median of 7. Both of these packages are not “core” and, furthermore, they’re not particularly common to call directly! Looking at just the “core” tidyverse, ggplot2 has the highest median function length and lubridate has the lowest. Not particularly surprising when we compare mammoths like scale_linewidth_continuous() to small fries like ymd()!
Table 1 shows the longest and shortest 10 functions in the tidyverse. The longest functions are from googlesheets4, although these are exported internal vctrs methods. The longest named function that anyone is likely to call commonly is the aforementioned scale_linewidth_continuous() at a massive 26 characters long. On the other end, the shortest function in the tidyverse is dplyr::n() at only one character.
bind_rows("longest"=slice_max(functions, n =10, order_by =fun.len),"shortest"=slice_min(functions, n =10, order_by =fun.len), .id ="type")%>%knitr::kable()
Table 1: The longest and shortest function names in the tidyverse by characters.
type
fun
pkg
pkg.core
lifecycle
fun.len
fun.words
longest
ffi_standalone_check_number_1.0.7
rlang
FALSE
NA
33
7
longest
vec_ptype2.googlesheets4_formula
googlesheets4
FALSE
NA
32
6
longest
db_supports_table_alias_with_as
dbplyr
FALSE
NA
31
6
longest
vec_cast.googlesheets4_formula
googlesheets4
FALSE
NA
30
5
longest
cli_progress_builtin_handlers
cli
FALSE
NA
29
4
longest
getRStudioPackageDependencies
rstudioapi
FALSE
NA
29
5
longest
registerCommandStreamCallback
rstudioapi
FALSE
NA
29
4
longest
ffi_standalone_is_bool_1.0.7
rlang
FALSE
NA
28
7
longest
launcherPlacementConstraint
rstudioapi
FALSE
NA
27
3
longest
ansi_has_hyperlink_support
cli
FALSE
NA
26
4
longest
scale_linewidth_continuous
ggplot2
TRUE
NA
26
3
shortest
n
dplyr
TRUE
NA
1
1
shortest
no
cli
FALSE
NA
2
1
shortest
do
dplyr
TRUE
superseded
2
1
shortest
id
dplyr
TRUE
defunct
2
1
shortest
am
lubridate
TRUE
NA
2
1
shortest
hm
lubridate
TRUE
NA
2
1
shortest
ms
lubridate
TRUE
NA
2
1
shortest
my
lubridate
TRUE
NA
2
1
shortest
pm
lubridate
TRUE
NA
2
1
shortest
tz
lubridate
TRUE
NA
2
1
shortest
ym
lubridate
TRUE
NA
2
1
shortest
yq
lubridate
TRUE
NA
2
1
shortest
or
magrittr
FALSE
NA
2
1
shortest
!!
rlang
FALSE
NA
2
NA
shortest
:=
rlang
FALSE
NA
2
NA
shortest
ll
rlang
FALSE
NA
2
1
shortest
UQ
rlang
FALSE
deprecated
2
1
Looking at the distribution of the number of words in Figure 2, we can see that most tidyverse functions are made up of 2 words (e.g., separate_wider()). The wordiest two functions, however, have a massive 5 words. These are the relatively new fct_na_level_to_value() and fct_na_value_to_level() from forcats, both of which superseded fct_explicit_na() which itself was three words long.
The least wordy tidyverse functions actually have no words at all, those being: %>%, %+%, %–%, %!>%, %$%, %<>%, %@%, %||%, !!, !!!, %@%<-, %|%, %<~%, :=. As you can tell, they’re mostly magrittr pipes and rlang syntax stuff.
functions%>%count(fun.words, pkg.core)%>%mutate(color =if_else(pkg.core, "Core", "Non-Core"), fun.words =if_else(is.na(fun.words), 0, fun.words))%>%plot_ly(x =~factor(fun.words), y =~n, colors ="Dark2")%>%add_bars(color =~color)%>%layout(xaxis =list(title ="Number of Words"))
Superseded Functions
All of that was interesting, but lets examine the original point of this post - are superseded functions shorter than new ones?
First, lets make sure that we’re only looking at packages that are actually classified using lifecycle. Of the remaining packages, all of the NA values represent stable functions.
We’ll visuaise all of the lifecycle stages available, although the three key ones we should be looking at are:
Stable - functions that are in current use,
Superseded - functions that have newer versions that are now recommended (e.g., separate()), and
Experimental - the newest functions that are in development (e.g., separate_wider_delim()).
Figure 3 shows the distributions of different function lengths, similar to Figure 1 but with the function lifecycle instead of its package. Deprecated and superseded functions have similar mean lengths of around about 9.5, compared to stable functions’ 11.2 and experimental functions’ 14.2! This does imply that new functions are indeed getting longer - at least on average!
functions2%>%mutate(lifecycle =fct_reorder(lifecycle, fun.len, mean))%>%mutate(fun.avglen =mean(fun.len, na.rm =TRUE), .by =lifecycle)%>%plot_ly(x =~lifecycle, y =~fun.len)%>%add_boxplot(name ="Boxplot", color =I("grey75"))%>%add_lines( y =~round(fun.avglen, 2), legendgroup ="marker", showlegend =FALSE, color =I("red"))%>%add_markers( y =~round(fun.avglen, 2), name ="Mean Marker", legendgroup ="marker", color =I("red"))%>%layout(yaxis =list(title ="Function Length"))
A similar observation is seen in Figure 4, which shows that a whopping 52% of experimental tidyverse experimental functions are three words long, plus an extra 12% at four words. Conversely, only 10% of superseded functions are three words long and none are four words.
Now to make this truly robust we should be comparing individual superseded functions with the functions that replaced them. To do this, I wrote out all of the superseded functions to a CSV and searched through the documentation to determine their replacements. A few notes before we continue:
Some superseded functions simply weren’t replaced with anything in particular. For example, dplyr::with_groups() was replaced by the by/.by argument in functions like summarise() so won’t be analysed here. Manydplyr functions were effectively replaced by across() so are also ignored.
A lot of readr functions were moved to meltr so aren’t considered here.
The purrr::map_dfr() family was replaced by a combination of map() and list_rbind(). It wouldn’t be fair to compare these because of course two functions will be longer than one!
tidyr possesses unnest_legacy() and nest_legacy() which are listed as superseded but, in reality, they were never labelled “legacy” originally, so these are discounted.
Some functions were also replaced by multiple functions. For example, do() is replaced by either reframe(), nest_by() or pick(). In that case, all three are used.
supers<-read_csv(here::here("posts/2023-03-04-tidyverseLength/superseded.csv"))%>%left_join(select(functions, -pkg.core, -lifecycle))%>%left_join(select(functions, -pkg.core, -lifecycle), by =join_by(oldfun==fun, pkg), suffix =c("_new", "_old"))%>%drop_na(fun)%>%filter(fun!="across",!str_detect(oldfun, "legacy"))slice_sample(supers, n =10)
Now, finally, lets compare the old and new! Figure 5 shows the function lengths of specific superseded functions and the functions they were replaced with. Most functions marked superseded that were directly replaced by one or more functions are in the dplyr, purrr and tidyr packages, and many of them do indeed get longer! The function that increased in length the most was separate(), which became the massive separate_wider_position(), gaining 15 additional characters. The function that decreased in length the most was coord_quickmap(), which was replaced by the petite coord_sf(), losing 6 characters. In this dataset, the median change between the old and new functions was an increase in 3 characters.
plt<-supers%>%mutate( id =row_number(), diff =fun.len_new-fun.len_old, diff =if_else(diff>0, paste("+", diff, sep =""), paste(diff)), tooltip =str_glue("`{oldfun}()` to `{fun}()` ({diff})"), color =if_else(fun.len_new>fun.len_old, "Increase", "Decrease"))%>%pivot_longer(c(fun.len_new, fun.len_old))%>%mutate( name =case_match(name,"fun.len_new"~"New","fun.len_old"~"Old"), name =factor(name, c("Old", "New")))%>%ggplot(aes(x =name, y =value, color =color))+geom_line_interactive(aes( group =id, tooltip =tooltip, data_id =tooltip), hover_nearest =TRUE)+geom_point_interactive(aes(data_id =tooltip), shape =21, fill ="white", stroke =1)+coord_cartesian(clip ="off")+scale_x_discrete(expand =expansion())+expand_limits(y =0)+theme_minimal()+theme( panel.spacing =unit(.7, "cm"), legend.position ="top", aspect.ratio =1)+facet_wrap(vars(pkg), scales ="free_x")+labs(x =NULL, y ="Function Length", color =NULL)girafe(ggobj =plt, options =list(opts_hover(""),opts_hover_inv("opacity:0.25")))
Conclusion
This blog post was on a bit of a silly topic, but I’m hopeful that it was a useful demonstration of various tidyverse functions and the plotly package for interactive data visualisation. The key conclusion that I’m taking is that, while there is a lot of variation, superseded functions do tend to be shorter than the functions that replace them! Yesterday we had separate(), today we have separate_wider_position(), and tomorrow separate_wider_integer_position_ignore_toofew_removecols() may be on the cards!