library(tidyverse)
3 select, filter & arrange
The tidyverse
is a collection of R packages designed for data science. It includes packages such as ggplot2
for data visualization, dplyr
for data manipulation, and tidyr
for reshaping data. The tidyverse
is built around the idea of “tidy data,” which is a standardized way of organizing and structuring data for analysis. The packages in the tidyverse
are designed to work together seamlessly, making it a popular choice for data scientists and analysts who use R.
3.1 Setup
Read data
<- read_csv("data/adsl.csv") adsl
Rows: 306 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (23): STUDYID, USUBJID, DTHFL, AGEU, SEX, RACE, ETHNIC, ARMCD, ARM, ACT...
dbl (7): SUBJID, SITEID, AGE, DMDY, TRTDURD, DTHADY, LDDTHELD
lgl (3): RFICDTC, REGION1, DTHA30FL
dttm (3): RFPENDTC, TRTSDTM, TRTEDTM
date (14): RFSTDTC, RFENDTC, RFXSTDTC, RFXENDTC, DTHDTC, DMDTC, TRTSDT, TRTE...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can have a look at the data using many different commands / functions, e.g. the head()
function which gives us the first six observations:
head(adsl)
# A tibble: 6 × 50
STUDYID USUBJID SUBJID RFSTDTC RFENDTC RFXSTDTC RFXENDTC RFICDTC
<chr> <chr> <dbl> <date> <date> <date> <date> <lgl>
1 CDISCPILOT… 01-701… 1015 2014-01-02 2014-07-02 2014-01-02 2014-07-02 NA
2 CDISCPILOT… 01-701… 1023 2012-08-05 2012-09-02 2012-08-05 2012-09-01 NA
3 CDISCPILOT… 01-701… 1028 2013-07-19 2014-01-14 2013-07-19 2014-01-14 NA
4 CDISCPILOT… 01-701… 1033 2014-03-18 2014-04-14 2014-03-18 2014-03-31 NA
5 CDISCPILOT… 01-701… 1034 2014-07-01 2014-12-30 2014-07-01 2014-12-30 NA
6 CDISCPILOT… 01-701… 1047 2013-02-12 2013-03-29 2013-02-12 2013-03-09 NA
# … with 42 more variables: RFPENDTC <dttm>, DTHDTC <date>, DTHFL <chr>,
# SITEID <dbl>, AGE <dbl>, AGEU <chr>, SEX <chr>, RACE <chr>, ETHNIC <chr>,
# ARMCD <chr>, ARM <chr>, ACTARMCD <chr>, ACTARM <chr>, COUNTRY <chr>,
# DMDTC <date>, DMDY <dbl>, TRT01P <chr>, TRT01A <chr>, TRTSDTM <dttm>,
# TRTSTMF <chr>, TRTEDTM <dttm>, TRTETMF <chr>, TRTSDT <date>, TRTEDT <date>,
# TRTDURD <dbl>, SCRFDT <date>, EOSDT <date>, EOSSTT <chr>, FRVDT <date>,
# RANDDT <date>, DTHDT <date>, DTHADY <dbl>, LDDTHELD <dbl>, …
3.2 dplyr
dplyr
is a package in the tidyverse
that provides a set of functions for efficiently manipulating and cleaning data. It is built around the idea of “verbs” that correspond to common data manipulation tasks, such as select()
for selecting specific columns from a data frame, filter()
for filtering rows based on certain conditions, arrange()
for sorting data-frames and group_by()
and summarize()
for grouping and summarizing data by one or more variables.
dplyr
is not strictly needed for any of that, everything can be done in base R. However, dplyr
provides a framework to write readable code and a pipeline to work efficiently.
There are various functions within dplyr
for datawrangling which follow a consistent structure. The first input of the most used dplyr
functions is the data-frame. Then follow arguments specifying the behaviour of the function. Compared to the base r syntax we do not have to write column / variable names in quotation marks; dplyr
syntax lets us refer to columns within a data-frame without the need to always reference the data-frame of origin.
3.2.1 select
The select
function lets us select all variables mentioned in the arguments (and drops all other variables). Alternatively, we can selectively drop variables if we place a minus (-
) in front of the variable name.
We can first have a look at all variable names of the data-frame:
names(adsl)
[1] "STUDYID" "USUBJID" "SUBJID" "RFSTDTC" "RFENDTC" "RFXSTDTC"
[7] "RFXENDTC" "RFICDTC" "RFPENDTC" "DTHDTC" "DTHFL" "SITEID"
[13] "AGE" "AGEU" "SEX" "RACE" "ETHNIC" "ARMCD"
[19] "ARM" "ACTARMCD" "ACTARM" "COUNTRY" "DMDTC" "DMDY"
[25] "TRT01P" "TRT01A" "TRTSDTM" "TRTSTMF" "TRTEDTM" "TRTETMF"
[31] "TRTSDT" "TRTEDT" "TRTDURD" "SCRFDT" "EOSDT" "EOSSTT"
[37] "FRVDT" "RANDDT" "DTHDT" "DTHADY" "LDDTHELD" "LSTALVDT"
[43] "AGEGR1" "SAFFL" "RACEGR1" "REGION1" "LDDTHGR1" "DTH30FL"
[49] "DTHA30FL" "DTHB30FL"
And then select the desired variables:
# dplyr::select
select(adsl,
STUDYID,
USUBJID,
ARM,
AGE,
SEX, RACE)
# A tibble: 306 × 6
STUDYID USUBJID ARM AGE SEX RACE
<chr> <chr> <chr> <dbl> <chr> <chr>
1 CDISCPILOT01 01-701-1015 Placebo 63 F WHITE
2 CDISCPILOT01 01-701-1023 Placebo 64 M WHITE
3 CDISCPILOT01 01-701-1028 Xanomeline High Dose 71 M WHITE
4 CDISCPILOT01 01-701-1033 Xanomeline Low Dose 74 M WHITE
5 CDISCPILOT01 01-701-1034 Xanomeline High Dose 77 F WHITE
6 CDISCPILOT01 01-701-1047 Placebo 85 F WHITE
7 CDISCPILOT01 01-701-1057 Screen Failure 59 F WHITE
8 CDISCPILOT01 01-701-1097 Xanomeline Low Dose 68 M WHITE
9 CDISCPILOT01 01-701-1111 Xanomeline Low Dose 81 F WHITE
10 CDISCPILOT01 01-701-1115 Xanomeline Low Dose 84 M WHITE
# … with 296 more rows
We end up with a new data-frame including only the selected variables. Note here that we do not save the resulting data-frame at the moment.
There are also some helper functions to use within the select
function of dplyr
. starts_with()
ends_with()
num_range()
. They allow us to select multiple columns sharing a naming structure. num_range()
let’s us select consecutively numbered columns, e.g.: num_range("example", 1:4)
would select the columns named: example1, example2, example3, example4.
We can try out starts_with()
:
select(adsl,
USUBJID, starts_with("trt"))
# A tibble: 306 × 10
USUBJID TRT01P TRT01A TRTSDTM TRTSTMF TRTEDTM TRTETMF
<chr> <chr> <chr> <dttm> <chr> <dttm> <chr>
1 01-701… Place… Place… 2014-01-02 00:00:00 H 2014-07-02 23:59:59 H
2 01-701… Place… Place… 2012-08-05 00:00:00 H 2012-09-01 23:59:59 H
3 01-701… Xanom… Xanom… 2013-07-19 00:00:00 H 2014-01-14 23:59:59 H
4 01-701… Xanom… Xanom… 2014-03-18 00:00:00 H 2014-03-31 23:59:59 H
5 01-701… Xanom… Xanom… 2014-07-01 00:00:00 H 2014-12-30 23:59:59 H
6 01-701… Place… Place… 2013-02-12 00:00:00 H 2013-03-09 23:59:59 H
7 01-701… Scree… Scree… NA <NA> NA <NA>
8 01-701… Xanom… Xanom… 2014-01-01 00:00:00 H 2014-07-09 23:59:59 H
9 01-701… Xanom… Xanom… 2012-09-07 00:00:00 H 2012-09-16 23:59:59 H
10 01-701… Xanom… Xanom… 2012-11-30 00:00:00 H 2013-01-23 23:59:59 H
# … with 296 more rows, and 3 more variables: TRTSDT <date>, TRTEDT <date>,
# TRTDURD <dbl>
And ends_with()
:
# in this df, all variables that contain dates end with "DT".
# We can select them:
select(adsl,
USUBJID, ends_with("DT"))
# A tibble: 306 × 9
USUBJID TRTSDT TRTEDT SCRFDT EOSDT FRVDT RANDDT
<chr> <date> <date> <date> <date> <date> <date>
1 01-701-1015 2014-01-02 2014-07-02 NA 2014-07-02 NA 2014-01-02
2 01-701-1023 2012-08-05 2012-09-01 NA 2012-09-02 2013-02-18 2012-08-05
3 01-701-1028 2013-07-19 2014-01-14 NA 2014-01-14 NA 2013-07-19
4 01-701-1033 2014-03-18 2014-03-31 NA 2014-04-14 2014-09-15 2014-03-18
5 01-701-1034 2014-07-01 2014-12-30 NA 2014-12-30 NA 2014-07-01
6 01-701-1047 2013-02-12 2013-03-09 NA 2013-03-29 2013-07-28 2013-02-12
7 01-701-1057 NA NA 2013-12-20 NA NA NA
8 01-701-1097 2014-01-01 2014-07-09 NA 2014-07-09 NA 2014-01-01
9 01-701-1111 2012-09-07 2012-09-16 NA 2012-09-17 2013-02-22 2012-09-07
10 01-701-1115 2012-11-30 2013-01-23 NA 2013-01-23 2013-05-20 2012-11-30
# … with 296 more rows, and 2 more variables: DTHDT <date>, LSTALVDT <date>
If we want a data-frame that does not include any dates, we can make use of the minus sign in combination with the ends_with()
function:
select(adsl,
-ends_with("DT"))
# A tibble: 306 × 42
STUDYID USUBJID SUBJID RFSTDTC RFENDTC RFXSTDTC RFXENDTC RFICDTC
<chr> <chr> <dbl> <date> <date> <date> <date> <lgl>
1 CDISCPILO… 01-701… 1015 2014-01-02 2014-07-02 2014-01-02 2014-07-02 NA
2 CDISCPILO… 01-701… 1023 2012-08-05 2012-09-02 2012-08-05 2012-09-01 NA
3 CDISCPILO… 01-701… 1028 2013-07-19 2014-01-14 2013-07-19 2014-01-14 NA
4 CDISCPILO… 01-701… 1033 2014-03-18 2014-04-14 2014-03-18 2014-03-31 NA
5 CDISCPILO… 01-701… 1034 2014-07-01 2014-12-30 2014-07-01 2014-12-30 NA
6 CDISCPILO… 01-701… 1047 2013-02-12 2013-03-29 2013-02-12 2013-03-09 NA
7 CDISCPILO… 01-701… 1057 NA NA NA NA NA
8 CDISCPILO… 01-701… 1097 2014-01-01 2014-07-09 2014-01-01 2014-07-09 NA
9 CDISCPILO… 01-701… 1111 2012-09-07 2012-09-17 2012-09-07 2012-09-16 NA
10 CDISCPILO… 01-701… 1115 2012-11-30 2013-01-23 2012-11-30 2013-01-23 NA
# … with 296 more rows, and 34 more variables: RFPENDTC <dttm>, DTHDTC <date>,
# DTHFL <chr>, SITEID <dbl>, AGE <dbl>, AGEU <chr>, SEX <chr>, RACE <chr>,
# ETHNIC <chr>, ARMCD <chr>, ARM <chr>, ACTARMCD <chr>, ACTARM <chr>,
# COUNTRY <chr>, DMDTC <date>, DMDY <dbl>, TRT01P <chr>, TRT01A <chr>,
# TRTSDTM <dttm>, TRTSTMF <chr>, TRTEDTM <dttm>, TRTETMF <chr>,
# TRTDURD <dbl>, EOSSTT <chr>, DTHADY <dbl>, LDDTHELD <dbl>, AGEGR1 <chr>,
# SAFFL <chr>, RACEGR1 <chr>, REGION1 <lgl>, LDDTHGR1 <chr>, DTH30FL <chr>, …
3.2.2 filter
The filter
function allows us to look at a subset of observations. As input, the function requires a logical vector and (of course) a data-frame. This time, we first save the reduced (selected) data-frame and use that as the first argument to filter
.
<- select(adsl,
selected_data
STUDYID,
USUBJID,
ARM,
AGE,
SEX, RACE)
The logical vector is generally created within the function call and can use any of the following logic operators:
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== equal
!= not equal
!x not x (negation)
x | y x OR y
x & y x AND y
x %in% y logical vector of length x with TRUE if element of x is in y
Within filter
, we can chain logical vectors by separating them with a comma (,
). Lets have a look at women that are 70 and older:
filter(selected_data,
>= 70,
AGE == "F") SEX
# A tibble: 141 × 6
STUDYID USUBJID ARM AGE SEX RACE
<chr> <chr> <chr> <dbl> <chr> <chr>
1 CDISCPILOT01 01-701-1034 Xanomeline High Dose 77 F WHITE
2 CDISCPILOT01 01-701-1047 Placebo 85 F WHITE
3 CDISCPILOT01 01-701-1111 Xanomeline Low Dose 81 F WHITE
4 CDISCPILOT01 01-701-1133 Xanomeline High Dose 81 F WHITE
5 CDISCPILOT01 01-701-1146 Xanomeline High Dose 75 F WHITE
6 CDISCPILOT01 01-701-1153 Placebo 79 F WHITE
7 CDISCPILOT01 01-701-1162 Screen Failure 82 F WHITE
8 CDISCPILOT01 01-701-1181 Xanomeline High Dose 79 F WHITE
9 CDISCPILOT01 01-701-1192 Xanomeline Low Dose 80 F WHITE
10 CDISCPILOT01 01-701-1203 Placebo 81 F BLACK OR AFRICAN A…
# … with 131 more rows
Now we have a reduced data frame with female patients over 70. However, the nested call is not very intuitive to read. If any more functions get added to this code, it becomes even less readable. That is where the pipe operator (%>%
) comes in.
# select, filter, & pipe:
%>% # This pipe forwards adsl to the select function as its first argument
adsl select(STUDYID,
USUBJID,
ARM,
AGE,
SEX, %>% # this pipe forwards the selected variables to the filter function
RACE) filter(AGE >= 70,
== "F") SEX
# A tibble: 141 × 6
STUDYID USUBJID ARM AGE SEX RACE
<chr> <chr> <chr> <dbl> <chr> <chr>
1 CDISCPILOT01 01-701-1034 Xanomeline High Dose 77 F WHITE
2 CDISCPILOT01 01-701-1047 Placebo 85 F WHITE
3 CDISCPILOT01 01-701-1111 Xanomeline Low Dose 81 F WHITE
4 CDISCPILOT01 01-701-1133 Xanomeline High Dose 81 F WHITE
5 CDISCPILOT01 01-701-1146 Xanomeline High Dose 75 F WHITE
6 CDISCPILOT01 01-701-1153 Placebo 79 F WHITE
7 CDISCPILOT01 01-701-1162 Screen Failure 82 F WHITE
8 CDISCPILOT01 01-701-1181 Xanomeline High Dose 79 F WHITE
9 CDISCPILOT01 01-701-1192 Xanomeline Low Dose 80 F WHITE
10 CDISCPILOT01 01-701-1203 Placebo 81 F BLACK OR AFRICAN A…
# … with 131 more rows
There is another inline operator which can be very useful within the filter function; %in%
. With this operator, we can select rows based on a prespecified vector of values. This can be useful if there are specified values (e.g., specific USUBJID
) which we would like to look at.
# we save 4 USUBJID's in a vector:
<- c("01-716-1151", "01-710-1443", "01-708-1184", "01-705-1186")
lookup_ids
# and then create a logical vector which returns TRUE for every entry in the
# USUBJID vector which are represented in the lookup_ids, and else FALSE
$USUBJID %in% lookup_ids adsl
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[217] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[265] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[301] FALSE FALSE FALSE FALSE FALSE FALSE
# this approach can be used in the filter function:
%>%
adsl select(STUDYID, USUBJID, ARM, AGE, SEX, RACE) %>%
filter(USUBJID %in% lookup_ids)
# A tibble: 4 × 6
STUDYID USUBJID ARM AGE SEX RACE
<chr> <chr> <chr> <dbl> <chr> <chr>
1 CDISCPILOT01 01-705-1186 Placebo 84 F WHITE
2 CDISCPILOT01 01-708-1184 Screen Failure 70 F WHITE
3 CDISCPILOT01 01-710-1443 Screen Failure 88 F WHITE
4 CDISCPILOT01 01-716-1151 Xanomeline Low Dose 83 F WHITE
Note that within the filter
function (and in all major dplyr
functions) R looks for the requested variables first within the supplied data-frame and afterwards in the global environment.
3.2.3 arrange
We can sort the dataframe with the arrange() function. It allows the sorting based on multiple variables. Note that the order of arranging variables determines the sorting hierarchy, so in this example we first order by AGE
and
%>%
adsl select(STUDYID, USUBJID, ARM, AGE, SEX, RACE) %>%
filter(AGE >= 70,
== "F") %>%
SEX arrange(ARM, AGE)
# A tibble: 141 × 6
STUDYID USUBJID ARM AGE SEX RACE
<chr> <chr> <chr> <dbl> <chr> <chr>
1 CDISCPILOT01 01-705-1282 Placebo 70 F BLACK OR AFRICAN AMERICAN
2 CDISCPILOT01 01-704-1260 Placebo 71 F WHITE
3 CDISCPILOT01 01-703-1210 Placebo 72 F WHITE
4 CDISCPILOT01 01-716-1026 Placebo 73 F WHITE
5 CDISCPILOT01 01-718-1150 Placebo 73 F WHITE
6 CDISCPILOT01 01-708-1087 Placebo 74 F WHITE
7 CDISCPILOT01 01-708-1316 Placebo 74 F WHITE
8 CDISCPILOT01 01-709-1001 Placebo 76 F WHITE
9 CDISCPILOT01 01-710-1077 Placebo 76 F WHITE
10 CDISCPILOT01 01-715-1397 Placebo 76 F WHITE
# … with 131 more rows
To sort by descending order, we can use the helper function desc()
within arrange()
:
%>%
adsl select(STUDYID, USUBJID, ARM, AGE, SEX, RACE) %>%
filter(AGE >= 70,
== "F") %>%
SEX arrange(ARM, desc(AGE))
# A tibble: 141 × 6
STUDYID USUBJID ARM AGE SEX RACE
<chr> <chr> <chr> <dbl> <chr> <chr>
1 CDISCPILOT01 01-710-1083 Placebo 89 F WHITE
2 CDISCPILOT01 01-710-1368 Placebo 88 F WHITE
3 CDISCPILOT01 01-714-1035 Placebo 88 F WHITE
4 CDISCPILOT01 01-701-1387 Placebo 87 F WHITE
5 CDISCPILOT01 01-704-1233 Placebo 87 F WHITE
6 CDISCPILOT01 01-716-1024 Placebo 87 F WHITE
7 CDISCPILOT01 01-705-1349 Placebo 86 F WHITE
8 CDISCPILOT01 01-710-1271 Placebo 86 F WHITE
9 CDISCPILOT01 01-716-1108 Placebo 86 F WHITE
10 CDISCPILOT01 01-701-1047 Placebo 85 F WHITE
# … with 131 more rows