Add New Variable(s) to the Input Dataset Based on Variables from Another Dataset

Add new variable(s) to the input dataset based on variables from another dataset. The observations to merge can be selected by a condition (filter_add argument) and/or selecting the first or last observation for each by group (order and mode argument).

Usage

derive_vars_merged(
  dataset,
  dataset_add,
  by_vars,
  order = NULL,
  new_vars = NULL,
  filter_add = NULL,
  mode = NULL,
  exist_flag = NULL,
  true_value = "Y",
  false_value = NA_character_,
  missing_values = NULL,
  check_type = "warning",
  duplicate_msg = NULL,
  relationship = NULL
)

Arguments

dataset

Input dataset

The variables specified by the by_vars argument are expected to be in the dataset.

dataset_add

Additional dataset

The variables specified by the by_vars, the new_vars, and the order argument are expected.

by_vars

Grouping variables

The input dataset and the selected observations from the additional dataset are merged by the specified variables.

Variables can be renamed by naming the element, i.e. by_vars = exprs(<name in input dataset> = <name in additional dataset>), similar to the dplyr joins.

Permitted Values: list of variables created by exprs() e.g. exprs(USUBJID, VISIT)

order

Sort order

If the argument is set to a non-null value, for each by group the first or last observation from the additional dataset is selected with respect to the specified order.

Variables defined by the new_vars argument can be used in the sort order.

For handling of NAs in sorting variables see Sort Order.

Permitted Values: list of expressions created by exprs(), e.g., exprs(ADT, desc(AVAL)) or NULL

new_vars

Variables to add

The specified variables from the additional dataset are added to the output dataset. Variables can be renamed by naming the element, i.e., new_vars = exprs(<new name> = <old name>).

For example new_vars = exprs(var1, var2) adds variables var1 and var2 from dataset_add to the input dataset.

And new_vars = exprs(var1, new_var2 = old_var2) takes var1 and old_var2 from dataset_add and adds them to the input dataset renaming old_var2 to new_var2.

Values of the added variables can be modified by specifying an expression. For example, new_vars = LASTRSP = exprs(str_to_upper(AVALC)) adds the variable LASTRSP to the dataset and sets it to the upper case value of AVALC.

If the argument is not specified or set to NULL, all variables from the additional dataset (dataset_add) are added.

Permitted Values: list of variables or named expressions created by exprs()

filter_add

Filter for additional dataset (dataset_add)

Only observations fulfilling the specified condition are taken into account for merging. If the argument is not specified, all observations are considered.

Variables defined by the new_vars argument can be used in the filter condition.

Permitted Values: a condition

mode

Selection mode

Determines if the first or last observation is selected. If the order argument is specified, mode must be non-null.

If the order argument is not specified, the mode argument is ignored.

Permitted Values: "first", "last", NULL

exist_flag

Exist flag

If the argument is specified (e.g., exist_flag = FLAG), the specified variable (e.g., FLAG) is added to the input dataset. This variable will be the value provided in true_value for all selected records from dataset_add which are merged into the input dataset, and the value provided in false_value otherwise.

Permitted Values: Variable name

true_value

True value

The value for the specified variable exist_flag, applicable to the first or last observation (depending on the mode) of each by group.

Permitted Values: An atomic scalar

false_value

False value

The value for the specified variable exist_flag, NOT applicable to the first or last observation (depending on the mode) of each by group.

Permitted Values: An atomic scalar

missing_values

Values for non-matching observations

For observations of the input dataset (dataset) which do not have a matching observation in the additional dataset (dataset_add) the values of the specified variables are set to the specified value. Only variables specified for new_vars can be specified for missing_values.

Permitted Values: named list of expressions, e.g., exprs(BASEC = "MISSING", BASE = -1)

check_type

Check uniqueness?

If "warning", "message", or "error" is specified, the specified message is issued if the observations of the (restricted) additional dataset are not unique with respect to the by variables and the order.

If the order argument is not specified, the check_type argument is ignored: if the observations of the (restricted) additional dataset are not unique with respect to the by variables, an error is issued.

Permitted Values: "none", "message", "warning", "error"

duplicate_msg

Message of unique check

If the uniqueness check fails, the specified message is displayed.

Default:

paste(
  "Dataset {.arg dataset_add} contains duplicate records with respect to",
  "{.var {vars2chr(by_vars)}}."
)

relationship

Expected merge-relationship between the by_vars variable(s) in dataset (input dataset) and the dataset_add (additional dataset) containing the additional new_vars.

This argument is passed to the dplyr::left_join() function. See https://dplyr.tidyverse.org/reference/mutate-joins.html#arguments for more details.

Permitted Values: "one-to-one", "many-to-one", NULL.

Value

The output dataset contains all observations and variables of the input dataset and additionally the variables specified for new_vars from the additional dataset (dataset_add).

Details

The new variables (new_vars) are added to the additional dataset (dataset_add).
The records from the additional dataset (dataset_add) are restricted to those matching the filter_add condition.
If order is specified, for each by group the first or last observation (depending on mode) is selected.
The variables specified for new_vars are merged to the input dataset using left_join(). I.e., the output dataset contains all observations from the input dataset. For observations without a matching observation in the additional dataset the new variables are set as specified by missing_values (or to NA for variables not in missing_values). Observations in the additional dataset which have no matching observation in the input dataset are ignored.

Examples

library(dplyr, warn.conflicts = FALSE)
vs <- tribble(
  ~STUDYID,  ~DOMAIN,  ~USUBJID, ~VSTESTCD,      ~VISIT, ~VSSTRESN, ~VSSTRESU,       ~VSDTC,
  "PILOT01",    "VS", "01-1302",  "HEIGHT", "SCREENING",     177.8,      "cm", "2013-08-20",
  "PILOT01",    "VS", "01-1302",  "WEIGHT", "SCREENING",     81.19,      "kg", "2013-08-20",
  "PILOT01",    "VS", "01-1302",  "WEIGHT",  "BASELINE",      82.1,      "kg", "2013-08-29",
  "PILOT01",    "VS", "01-1302",  "WEIGHT",    "WEEK 2",     81.19,      "kg", "2013-09-15",
  "PILOT01",    "VS", "01-1302",  "WEIGHT",    "WEEK 4",     82.56,      "kg", "2013-09-24",
  "PILOT01",    "VS", "01-1302",  "WEIGHT",    "WEEK 6",     80.74,      "kg", "2013-10-08",
  "PILOT01",    "VS", "01-1302",  "WEIGHT",    "WEEK 8",      82.1,      "kg", "2013-10-22",
  "PILOT01",    "VS", "01-1302",  "WEIGHT",   "WEEK 12",      82.1,      "kg", "2013-11-05",
  "PILOT01",    "VS", "17-1344",  "HEIGHT", "SCREENING",     163.5,      "cm", "2014-01-01",
  "PILOT01",    "VS", "17-1344",  "WEIGHT", "SCREENING",     58.06,      "kg", "2014-01-01",
  "PILOT01",    "VS", "17-1344",  "WEIGHT",  "BASELINE",     58.06,      "kg", "2014-01-11",
  "PILOT01",    "VS", "17-1344",  "WEIGHT",    "WEEK 2",     58.97,      "kg", "2014-01-24",
  "PILOT01",    "VS", "17-1344",  "WEIGHT",    "WEEK 4",     57.97,      "kg", "2014-02-07",
  "PILOT01",    "VS", "17-1344",  "WEIGHT",    "WEEK 6",     58.97,      "kg", "2014-02-19",
  "PILOT01",    "VS", "17-1344",  "WEIGHT",    "WEEK 8",     57.79,      "kg", "2014-03-14"
)

dm <- tribble(
  ~STUDYID,  ~DOMAIN,  ~USUBJID, ~AGE,   ~AGEU,
  "PILOT01",    "DM", "01-1302",   61, "YEARS",
  "PILOT01",    "DM", "17-1344",   64, "YEARS"
)


# Merging all dm variables to vs
derive_vars_merged(
  vs,
  dataset_add = select(dm, -DOMAIN),
  by_vars = exprs(STUDYID, USUBJID)
) %>%
  select(STUDYID, USUBJID, VSTESTCD, VISIT, VSSTRESN, AGE, AGEU)
#> # A tibble: 15 × 7
#>    STUDYID USUBJID VSTESTCD VISIT     VSSTRESN   AGE AGEU 
#>    <chr>   <chr>   <chr>    <chr>        <dbl> <dbl> <chr>
#>  1 PILOT01 01-1302 HEIGHT   SCREENING    178.     61 YEARS
#>  2 PILOT01 01-1302 WEIGHT   SCREENING     81.2    61 YEARS
#>  3 PILOT01 01-1302 WEIGHT   BASELINE      82.1    61 YEARS
#>  4 PILOT01 01-1302 WEIGHT   WEEK 2        81.2    61 YEARS
#>  5 PILOT01 01-1302 WEIGHT   WEEK 4        82.6    61 YEARS
#>  6 PILOT01 01-1302 WEIGHT   WEEK 6        80.7    61 YEARS
#>  7 PILOT01 01-1302 WEIGHT   WEEK 8        82.1    61 YEARS
#>  8 PILOT01 01-1302 WEIGHT   WEEK 12       82.1    61 YEARS
#>  9 PILOT01 17-1344 HEIGHT   SCREENING    164.     64 YEARS
#> 10 PILOT01 17-1344 WEIGHT   SCREENING     58.1    64 YEARS
#> 11 PILOT01 17-1344 WEIGHT   BASELINE      58.1    64 YEARS
#> 12 PILOT01 17-1344 WEIGHT   WEEK 2        59.0    64 YEARS
#> 13 PILOT01 17-1344 WEIGHT   WEEK 4        58.0    64 YEARS
#> 14 PILOT01 17-1344 WEIGHT   WEEK 6        59.0    64 YEARS
#> 15 PILOT01 17-1344 WEIGHT   WEEK 8        57.8    64 YEARS


# Merge last weight to adsl
adsl <- tribble(
  ~STUDYID,   ~USUBJID, ~AGE,   ~AGEU,
  "PILOT01", "01-1302",   61, "YEARS",
  "PILOT01", "17-1344",   64, "YEARS"
)


derive_vars_merged(
  adsl,
  dataset_add = vs,
  by_vars = exprs(STUDYID, USUBJID),
  order = exprs(convert_dtc_to_dtm(VSDTC)),
  mode = "last",
  new_vars = exprs(LASTWGT = VSSTRESN, LASTWGTU = VSSTRESU),
  filter_add = VSTESTCD == "WEIGHT",
  exist_flag = vsdatafl
) %>%
  select(STUDYID, USUBJID, AGE, AGEU, LASTWGT, LASTWGTU, vsdatafl)
#> # A tibble: 2 × 7
#>   STUDYID USUBJID   AGE AGEU  LASTWGT LASTWGTU vsdatafl
#>   <chr>   <chr>   <dbl> <chr>   <dbl> <chr>    <chr>   
#> 1 PILOT01 01-1302    61 YEARS    82.1 kg       Y       
#> 2 PILOT01 17-1344    64 YEARS    57.8 kg       Y       


# Derive treatment start datetime (TRTSDTM)
ex <- tribble(
  ~STUDYID,  ~DOMAIN,  ~USUBJID, ~EXSTDY, ~EXENDY,     ~EXSTDTC,     ~EXENDTC,
  "PILOT01",    "EX", "01-1302",       1,      18, "2013-08-29", "2013-09-15",
  "PILOT01",    "EX", "01-1302",      19,      69, "2013-09-16", "2013-11-05",
  "PILOT01",    "EX", "17-1344",       1,      14, "2014-01-11", "2014-01-24",
  "PILOT01",    "EX", "17-1344",      15,      63, "2014-01-25", "2014-03-14"
)
## Impute exposure start date to first date/time
ex_ext <- derive_vars_dtm(
  ex,
  dtc = EXSTDTC,
  new_vars_prefix = "EXST",
  highest_imputation = "M",
)
## Add first exposure datetime and imputation flags to adsl
derive_vars_merged(
  select(dm, STUDYID, USUBJID),
  dataset_add = ex_ext,
  by_vars = exprs(STUDYID, USUBJID),
  new_vars = exprs(TRTSDTM = EXSTDTM, TRTSDTF = EXSTDTF, TRTSTMF = EXSTTMF),
  order = exprs(EXSTDTM),
  mode = "first"
)
#> # A tibble: 2 × 5
#>   STUDYID USUBJID TRTSDTM             TRTSDTF TRTSTMF
#>   <chr>   <chr>   <dttm>              <chr>   <chr>  
#> 1 PILOT01 01-1302 2013-08-29 00:00:00 NA      H      
#> 2 PILOT01 17-1344 2014-01-11 00:00:00 NA      H      

# Derive treatment end datetime (TRTEDTM)
## Impute exposure end datetime to last time, no date imputation
ex_ext <- derive_vars_dtm(
  ex,
  dtc = EXENDTC,
  new_vars_prefix = "EXEN",
  time_imputation = "last",
)
## Add last exposure datetime and imputation flag to adsl
derive_vars_merged(
  select(adsl, STUDYID, USUBJID),
  dataset_add = ex_ext,
  filter_add = !is.na(EXENDTM),
  by_vars = exprs(STUDYID, USUBJID),
  new_vars = exprs(TRTEDTM = EXENDTM, TRTETMF = EXENTMF),
  order = exprs(EXENDTM),
  mode = "last"
)
#> # A tibble: 2 × 4
#>   STUDYID USUBJID TRTEDTM             TRTETMF
#>   <chr>   <chr>   <dttm>              <chr>  
#> 1 PILOT01 01-1302 2013-11-05 23:59:59 H      
#> 2 PILOT01 17-1344 2014-03-14 23:59:59 H      
# Modify merged values and set value for non matching observations
adsl <- tribble(
  ~USUBJID, ~SEX, ~COUNTRY,
  "ST42-1", "F",  "AUT",
  "ST42-2", "M",  "MWI",
  "ST42-3", "M",  "NOR",
  "ST42-4", "F",  "UGA"
)

advs <- tribble(
  ~USUBJID, ~PARAMCD, ~AVISIT,    ~AVISITN, ~AVAL,
  "ST42-1", "WEIGHT", "BASELINE",        0,    66,
  "ST42-1", "WEIGHT", "WEEK 2",          1,    68,
  "ST42-2", "WEIGHT", "BASELINE",        0,    88,
  "ST42-3", "WEIGHT", "WEEK 2",          1,    55,
  "ST42-3", "WEIGHT", "WEEK 4",          2,    50
)

derive_vars_merged(
  adsl,
  dataset_add = advs,
  by_vars = exprs(USUBJID),
  new_vars = exprs(
    LSTVSCAT = if_else(AVISIT == "BASELINE", "BASELINE", "POST-BASELINE")
  ),
  order = exprs(AVISITN),
  mode = "last",
  missing_values = exprs(LSTVSCAT = "MISSING")
)
#> # A tibble: 4 × 4
#>   USUBJID SEX   COUNTRY LSTVSCAT     
#>   <chr>   <chr> <chr>   <chr>        
#> 1 ST42-1  F     AUT     POST-BASELINE
#> 2 ST42-2  M     MWI     BASELINE     
#> 3 ST42-3  M     NOR     POST-BASELINE
#> 4 ST42-4  F     UGA     MISSING