library(dplyr)
library(lubridate)
<- readRDS("data/dm.rds")
dm <- readRDS("data/ae.rds") ae
5 mutate
Creating New Columns Using mutate()
The equivalent of creating a new variables in SAS inside a data
step is to use the mutate()
function. In the simplest case a static value is assigned to the new column.
<- dm %>% mutate(DATASET = "ADSL") adsl
This will set the value of the new variable DATASET
to "ADSL"
for all records.
%>% select(DATASET) adsl
# A tibble: 306 × 1
DATASET
<chr>
1 ADSL
2 ADSL
3 ADSL
4 ADSL
5 ADSL
6 ADSL
7 ADSL
8 ADSL
9 ADSL
10 ADSL
# … with 296 more rows
Note that new variables are always appended after existing columns such that DATASET
is now the last column of adsl
.
colnames(adsl)
[1] "STUDYID" "DOMAIN" "USUBJID" "SUBJID" "RFSTDTC" "RFENDTC"
[7] "RFXSTDTC" "RFXENDTC" "RFICDTC" "RFPENDTC" "DTHDTC" "DTHFL"
[13] "SITEID" "AGE" "AGEU" "SEX" "RACE" "ETHNIC"
[19] "ARMCD" "ARM" "ACTARMCD" "ACTARM" "COUNTRY" "DMDTC"
[25] "DMDY" "DATASET"
Assigning the value of an existing column to a new column is the same as in SAS. The new column name goes to the left of =
and the existing column to the right.
<- adsl %>% mutate(TRT01P = ARM)
adsl %>% select(ARM, TRT01P) adsl
# A tibble: 306 × 2
ARM TRT01P
<chr> <chr>
1 Placebo Placebo
2 Placebo Placebo
3 Xanomeline High Dose Xanomeline High Dose
4 Xanomeline Low Dose Xanomeline Low Dose
5 Xanomeline High Dose Xanomeline High Dose
6 Placebo Placebo
7 Screen Failure Screen Failure
8 Xanomeline Low Dose Xanomeline Low Dose
9 Xanomeline Low Dose Xanomeline Low Dose
10 Xanomeline Low Dose Xanomeline Low Dose
# … with 296 more rows
In most cases new variables are created by applying functions on existing variables to somehow transform them.
<- adsl %>% mutate(RFSTDT = ymd(RFSTDTC)) adsl
You can create multiple new variables inside mutate()
similar to how you would do it inside a data
step.
<- ae %>% mutate(
adae ASTDT = ymd(AESTDTC),
ASTDY = ASTDT - TRTSDT + 1
)%>% select(AESTDTC, ASTDT, TRTSDT, ASTDY) adae
# A tibble: 1,191 × 4
AESTDTC ASTDT TRTSDT ASTDY
<chr> <date> <date> <drtn>
1 2014-01-03 2014-01-03 2014-01-02 2 days
2 2014-01-03 2014-01-03 2014-01-02 2 days
3 2014-01-09 2014-01-09 2014-01-02 8 days
4 2012-08-26 2012-08-26 2012-08-05 22 days
5 2012-08-07 2012-08-07 2012-08-05 3 days
6 2012-08-07 2012-08-07 2012-08-05 3 days
7 2012-08-07 2012-08-07 2012-08-05 3 days
8 2013-07-21 2013-07-21 2013-07-19 3 days
9 2013-08-08 2013-08-08 2013-07-19 21 days
10 2014-08-27 2014-08-27 2014-07-01 58 days
# … with 1,181 more rows
Just like in SAS you can use conditional logic to assign different values to a new variable depending on which value another variable has using if_else()
.
%>%
adae mutate(ASTDY = if_else(ASTDT <= TRTSDT, ASTDT - TRTSDT, ASTDT - TRTSDT + 1)) %>%
select(USUBJID, TRTSDT, ASTDT, ASTDY)
# A tibble: 1,191 × 4
USUBJID TRTSDT ASTDT ASTDY
<chr> <date> <date> <drtn>
1 01-701-1015 2014-01-02 2014-01-03 2 days
2 01-701-1015 2014-01-02 2014-01-03 2 days
3 01-701-1015 2014-01-02 2014-01-09 8 days
4 01-701-1023 2012-08-05 2012-08-26 22 days
5 01-701-1023 2012-08-05 2012-08-07 3 days
6 01-701-1023 2012-08-05 2012-08-07 3 days
7 01-701-1023 2012-08-05 2012-08-07 3 days
8 01-701-1028 2013-07-19 2013-07-21 3 days
9 01-701-1028 2013-07-19 2013-08-08 21 days
10 01-701-1034 2014-07-01 2014-08-27 58 days
# … with 1,181 more rows
At this point let’s make a small excursion to cover how R handles missing values, i.e. NA
, when using conditional logic. Unlike in SAS where missing numbers are the smallest possible values such that . < 10
is true, in R any comparison involving NA
returns NA
as a result.
NA < 9
[1] NA
NA == 0
[1] NA
This is the same when using if_else()
.
$AGE[1] <- NA
adsl%>%
adsl mutate(AGEGR = if_else(AGE >= 65, "Elderly", "Adult")) %>%
select(USUBJID, AGE, AGEGR)
# A tibble: 306 × 3
USUBJID AGE AGEGR
<chr> <int> <chr>
1 01-701-1015 NA <NA>
2 01-701-1023 64 Adult
3 01-701-1028 71 Elderly
4 01-701-1033 74 Elderly
5 01-701-1034 77 Elderly
6 01-701-1047 85 Elderly
7 01-701-1057 59 Adult
8 01-701-1097 68 Elderly
9 01-701-1111 81 Elderly
10 01-701-1115 84 Elderly
# … with 296 more rows
To check whether a value is missing use the is.na()
function.
is.na(NA)
[1] TRUE
is.na("NA")
[1] FALSE
Finally, it’s noteworthy that is are actually different types on NA
s in R. We’ll make use of them next.
Type | Example | Missing Value |
---|---|---|
character |
"Brazil" |
NA_character_ |
double |
2.51 |
NA_real_ |
integer |
1L |
NA_integer_ |
logical |
FALSE |
NA |
If the logic is more complex than a simple if_else()
than use case_when()
instead.
%>%
adsl mutate(
AGEGR1 = case_when(
< 18 ~ "<18",
AGE < 45 ~ "<45",
AGE < 65 ~ "<65",
AGE TRUE ~ ">=65"
) %>%
) select(USUBJID, AGE, AGEGR1)
# A tibble: 306 × 3
USUBJID AGE AGEGR1
<chr> <int> <chr>
1 01-701-1015 NA >=65
2 01-701-1023 64 <65
3 01-701-1028 71 >=65
4 01-701-1033 74 >=65
5 01-701-1034 77 >=65
6 01-701-1047 85 >=65
7 01-701-1057 59 <65
8 01-701-1097 68 >=65
9 01-701-1111 81 >=65
10 01-701-1115 84 >=65
# … with 296 more rows
The final condition TRUE
is the is a catch all term and must be used with some caution. Notice what happened to the AGE
of the first subject whose value we set to NA
above.
To mitigate this you should either explicitly handle missing values as a separate condition or be explicit for all cases. The former would look something like this.
%>%
adsl mutate(
AGEGR1 = case_when(
is.na(AGE) ~ NA_character_,
< 18 ~ "<18",
AGE < 45 ~ "<45",
AGE < 65 ~ "<65",
AGE TRUE ~ ">=65"
) %>%
) select(USUBJID, AGE, AGEGR1)
# A tibble: 306 × 3
USUBJID AGE AGEGR1
<chr> <int> <chr>
1 01-701-1015 NA <NA>
2 01-701-1023 64 <65
3 01-701-1028 71 >=65
4 01-701-1033 74 >=65
5 01-701-1034 77 >=65
6 01-701-1047 85 >=65
7 01-701-1057 59 <65
8 01-701-1097 68 >=65
9 01-701-1111 81 >=65
10 01-701-1115 84 >=65
# … with 296 more rows
And the latter like this.
%>%
adsl mutate(
AGEGR1 = case_when(
< 18 ~ "<18",
AGE < 45 ~ "<45",
AGE < 65 ~ "<65",
AGE >= 65 ~ ">=65"
AGE
) %>%
) select(USUBJID, AGE, AGEGR1)
# A tibble: 306 × 3
USUBJID AGE AGEGR1
<chr> <int> <chr>
1 01-701-1015 NA <NA>
2 01-701-1023 64 <65
3 01-701-1028 71 >=65
4 01-701-1033 74 >=65
5 01-701-1034 77 >=65
6 01-701-1047 85 >=65
7 01-701-1057 59 <65
8 01-701-1097 68 >=65
9 01-701-1111 81 >=65
10 01-701-1115 84 >=65
# … with 296 more rows
Finally, note that when a value does not match any of the conditions given which may be the case when not using a final TRUE
then it is assigned NA
.
%>%
adsl mutate(
AGEGR1 = case_when(
< 18 ~ "<18",
AGE < 45 ~ "<45",
AGE < 65 ~ "<65"
AGE
) %>%
) select(USUBJID, AGE, AGEGR1)
# A tibble: 306 × 3
USUBJID AGE AGEGR1
<chr> <int> <chr>
1 01-701-1015 NA <NA>
2 01-701-1023 64 <65
3 01-701-1028 71 <NA>
4 01-701-1033 74 <NA>
5 01-701-1034 77 <NA>
6 01-701-1047 85 <NA>
7 01-701-1057 59 <65
8 01-701-1097 68 <NA>
9 01-701-1111 81 <NA>
10 01-701-1115 84 <NA>
# … with 296 more rows