5 mutate

Creating New Columns Using mutate()

Author

Thomas Neitmann

library(dplyr)
library(lubridate)
dm <- readRDS("data/dm.rds")
ae <- readRDS("data/ae.rds")

The equivalent of creating a new variables in SAS inside a data step is to use the mutate() function. In the simplest case a static value is assigned to the new column.

adsl <- dm %>% mutate(DATASET = "ADSL")

This will set the value of the new variable DATASET to "ADSL" for all records.

adsl %>% select(DATASET)

# A tibble: 306 × 1
   DATASET
   <chr>  
 1 ADSL   
 2 ADSL   
 3 ADSL   
 4 ADSL   
 5 ADSL   
 6 ADSL   
 7 ADSL   
 8 ADSL   
 9 ADSL   
10 ADSL   
# … with 296 more rows

Note that new variables are always appended after existing columns such that DATASET is now the last column of adsl.

colnames(adsl)

 [1] "STUDYID"  "DOMAIN"   "USUBJID"  "SUBJID"   "RFSTDTC"  "RFENDTC" 
 [7] "RFXSTDTC" "RFXENDTC" "RFICDTC"  "RFPENDTC" "DTHDTC"   "DTHFL"   
[13] "SITEID"   "AGE"      "AGEU"     "SEX"      "RACE"     "ETHNIC"  
[19] "ARMCD"    "ARM"      "ACTARMCD" "ACTARM"   "COUNTRY"  "DMDTC"   
[25] "DMDY"     "DATASET"

Assigning the value of an existing column to a new column is the same as in SAS. The new column name goes to the left of = and the existing column to the right.

adsl <- adsl %>% mutate(TRT01P = ARM)
adsl %>% select(ARM, TRT01P)

# A tibble: 306 × 2
   ARM                  TRT01P              
   <chr>                <chr>               
 1 Placebo              Placebo             
 2 Placebo              Placebo             
 3 Xanomeline High Dose Xanomeline High Dose
 4 Xanomeline Low Dose  Xanomeline Low Dose 
 5 Xanomeline High Dose Xanomeline High Dose
 6 Placebo              Placebo             
 7 Screen Failure       Screen Failure      
 8 Xanomeline Low Dose  Xanomeline Low Dose 
 9 Xanomeline Low Dose  Xanomeline Low Dose 
10 Xanomeline Low Dose  Xanomeline Low Dose 
# … with 296 more rows

In most cases new variables are created by applying functions on existing variables to somehow transform them.

adsl <- adsl %>% mutate(RFSTDT = ymd(RFSTDTC))

You can create multiple new variables inside mutate() similar to how you would do it inside a data step.

adae <- ae %>% mutate(
  ASTDT = ymd(AESTDTC),
  ASTDY = ASTDT - TRTSDT + 1
)
adae %>% select(AESTDTC, ASTDT, TRTSDT, ASTDY)

# A tibble: 1,191 × 4
   AESTDTC    ASTDT      TRTSDT     ASTDY  
   <chr>      <date>     <date>     <drtn> 
 1 2014-01-03 2014-01-03 2014-01-02  2 days
 2 2014-01-03 2014-01-03 2014-01-02  2 days
 3 2014-01-09 2014-01-09 2014-01-02  8 days
 4 2012-08-26 2012-08-26 2012-08-05 22 days
 5 2012-08-07 2012-08-07 2012-08-05  3 days
 6 2012-08-07 2012-08-07 2012-08-05  3 days
 7 2012-08-07 2012-08-07 2012-08-05  3 days
 8 2013-07-21 2013-07-21 2013-07-19  3 days
 9 2013-08-08 2013-08-08 2013-07-19 21 days
10 2014-08-27 2014-08-27 2014-07-01 58 days
# … with 1,181 more rows

Just like in SAS you can use conditional logic to assign different values to a new variable depending on which value another variable has using if_else().

adae %>%
  mutate(ASTDY = if_else(ASTDT <= TRTSDT, ASTDT - TRTSDT, ASTDT - TRTSDT + 1)) %>% 
  select(USUBJID, TRTSDT, ASTDT, ASTDY)

# A tibble: 1,191 × 4
   USUBJID     TRTSDT     ASTDT      ASTDY  
   <chr>       <date>     <date>     <drtn> 
 1 01-701-1015 2014-01-02 2014-01-03  2 days
 2 01-701-1015 2014-01-02 2014-01-03  2 days
 3 01-701-1015 2014-01-02 2014-01-09  8 days
 4 01-701-1023 2012-08-05 2012-08-26 22 days
 5 01-701-1023 2012-08-05 2012-08-07  3 days
 6 01-701-1023 2012-08-05 2012-08-07  3 days
 7 01-701-1023 2012-08-05 2012-08-07  3 days
 8 01-701-1028 2013-07-19 2013-07-21  3 days
 9 01-701-1028 2013-07-19 2013-08-08 21 days
10 01-701-1034 2014-07-01 2014-08-27 58 days
# … with 1,181 more rows

At this point let’s make a small excursion to cover how R handles missing values, i.e. NA, when using conditional logic. Unlike in SAS where missing numbers are the smallest possible values such that . < 10 is true, in R any comparison involving NA returns NA as a result.

NA < 9

[1] NA

NA == 0

[1] NA

This is the same when using if_else().

adsl$AGE[1] <- NA
adsl %>% 
  mutate(AGEGR = if_else(AGE >= 65, "Elderly", "Adult")) %>% 
  select(USUBJID, AGE, AGEGR)

# A tibble: 306 × 3
   USUBJID       AGE AGEGR  
   <chr>       <int> <chr>  
 1 01-701-1015    NA <NA>   
 2 01-701-1023    64 Adult  
 3 01-701-1028    71 Elderly
 4 01-701-1033    74 Elderly
 5 01-701-1034    77 Elderly
 6 01-701-1047    85 Elderly
 7 01-701-1057    59 Adult  
 8 01-701-1097    68 Elderly
 9 01-701-1111    81 Elderly
10 01-701-1115    84 Elderly
# … with 296 more rows

To check whether a value is missing use the is.na() function.

is.na(NA)

[1] TRUE

is.na("NA")

[1] FALSE

Finally, it’s noteworthy that is are actually different types on NAs in R. We’ll make use of them next.

Types of `NA` in R
Type	Example	Missing Value
`character`	`"Brazil"`	`NA_character_`
`double`	`2.51`	`NA_real_`
`integer`	`1L`	`NA_integer_`
`logical`	`FALSE`	`NA`

If the logic is more complex than a simple if_else() than use case_when() instead.

adsl %>% 
  mutate(
    AGEGR1 = case_when(
      AGE < 18 ~ "<18",
      AGE < 45 ~ "<45",
      AGE < 65 ~ "<65",
      TRUE ~ ">=65"
    ) 
  ) %>% 
  select(USUBJID, AGE, AGEGR1)

# A tibble: 306 × 3
   USUBJID       AGE AGEGR1
   <chr>       <int> <chr> 
 1 01-701-1015    NA >=65  
 2 01-701-1023    64 <65   
 3 01-701-1028    71 >=65  
 4 01-701-1033    74 >=65  
 5 01-701-1034    77 >=65  
 6 01-701-1047    85 >=65  
 7 01-701-1057    59 <65   
 8 01-701-1097    68 >=65  
 9 01-701-1111    81 >=65  
10 01-701-1115    84 >=65  
# … with 296 more rows

The final condition TRUE is the is a catch all term and must be used with some caution. Notice what happened to the AGE of the first subject whose value we set to NA above.

To mitigate this you should either explicitly handle missing values as a separate condition or be explicit for all cases. The former would look something like this.

adsl %>% 
  mutate(
    AGEGR1 = case_when(
      is.na(AGE) ~ NA_character_,
      AGE < 18 ~ "<18",
      AGE < 45 ~ "<45",
      AGE < 65 ~ "<65",
      TRUE ~ ">=65"
    ) 
  ) %>% 
  select(USUBJID, AGE, AGEGR1)

# A tibble: 306 × 3
   USUBJID       AGE AGEGR1
   <chr>       <int> <chr> 
 1 01-701-1015    NA <NA>  
 2 01-701-1023    64 <65   
 3 01-701-1028    71 >=65  
 4 01-701-1033    74 >=65  
 5 01-701-1034    77 >=65  
 6 01-701-1047    85 >=65  
 7 01-701-1057    59 <65   
 8 01-701-1097    68 >=65  
 9 01-701-1111    81 >=65  
10 01-701-1115    84 >=65  
# … with 296 more rows

And the latter like this.

adsl %>% 
  mutate(
    AGEGR1 = case_when(
      AGE < 18 ~ "<18",
      AGE < 45 ~ "<45",
      AGE < 65 ~ "<65",
      AGE >= 65 ~ ">=65"
    ) 
  ) %>% 
  select(USUBJID, AGE, AGEGR1)

# A tibble: 306 × 3
   USUBJID       AGE AGEGR1
   <chr>       <int> <chr> 
 1 01-701-1015    NA <NA>  
 2 01-701-1023    64 <65   
 3 01-701-1028    71 >=65  
 4 01-701-1033    74 >=65  
 5 01-701-1034    77 >=65  
 6 01-701-1047    85 >=65  
 7 01-701-1057    59 <65   
 8 01-701-1097    68 >=65  
 9 01-701-1111    81 >=65  
10 01-701-1115    84 >=65  
# … with 296 more rows

Finally, note that when a value does not match any of the conditions given which may be the case when not using a final TRUE then it is assigned NA.

adsl %>% 
  mutate(
    AGEGR1 = case_when(
      AGE < 18 ~ "<18",
      AGE < 45 ~ "<45",
      AGE < 65 ~ "<65"
    ) 
  ) %>% 
  select(USUBJID, AGE, AGEGR1)

# A tibble: 306 × 3
   USUBJID       AGE AGEGR1
   <chr>       <int> <chr> 
 1 01-701-1015    NA <NA>  
 2 01-701-1023    64 <65   
 3 01-701-1028    71 <NA>  
 4 01-701-1033    74 <NA>  
 5 01-701-1034    77 <NA>  
 6 01-701-1047    85 <NA>  
 7 01-701-1057    59 <65   
 8 01-701-1097    68 <NA>  
 9 01-701-1111    81 <NA>  
10 01-701-1115    84 <NA>  
# … with 296 more rows