
Add New Variable(s) to the Input Dataset Based on Variables from Another Dataset
Source:R/derive_merged.R
derive_vars_merged.RdAdd new variable(s) to the input dataset based on variables from another
dataset. The observations to merge can be selected by a condition
(filter_add argument) and/or selecting the first or last observation for
each by group (order and mode argument).
Usage
derive_vars_merged(
dataset,
dataset_add,
by_vars,
order = NULL,
new_vars = NULL,
filter_add = NULL,
mode = NULL,
exist_flag = NULL,
true_value = "Y",
false_value = NA_character_,
missing_values = NULL,
check_type = "warning",
duplicate_msg = NULL,
relationship = NULL
)Arguments
- dataset
-
Input dataset
The variables specified by the
by_varsargument are expected to be in the dataset.- Permitted values
a dataset, i.e., a
data.frameor tibble- Default value
none
- dataset_add
-
Additional dataset
The variables specified by the
by_vars, thenew_vars, and theorderargument are expected.- Permitted values
a dataset, i.e., a
data.frameor tibble- Default value
none
- by_vars
-
Grouping variables
The input dataset and the selected observations from the additional dataset are merged by the specified variables.
Variables can be renamed by naming the element, i.e.
by_vars = exprs(<name in input dataset> = <name in additional dataset>), similar to thedplyrjoins.- Permitted values
list of variables created by
exprs(), e.g.,exprs(USUBJID, VISIT)- Default value
none
- order
-
Sort order
If the argument is set to a non-null value, for each by group the first or last observation from the additional dataset is selected with respect to the specified order.
Variables defined by the
new_varsargument can be used in the sort order.For handling of
NAs in sorting variables see Sort Order.- Permitted values
list of variables created by
exprs(), e.g.,exprs(USUBJID, VISIT)- Default value
NULL
- new_vars
-
Variables to add
The specified variables from the additional dataset are added to the output dataset. Variables can be renamed by naming the element, i.e.,
new_vars = exprs(<new name> = <old name>).For example
new_vars = exprs(var1, var2)adds variablesvar1andvar2fromdataset_addto the input dataset.And
new_vars = exprs(var1, new_var2 = old_var2)takesvar1andold_var2fromdataset_addand adds them to the input dataset renamingold_var2tonew_var2.Values of the added variables can be modified by specifying an expression. For example,
new_vars = LASTRSP = exprs(str_to_upper(AVALC))adds the variableLASTRSPto the dataset and sets it to the upper case value ofAVALC.If the argument is not specified or set to
NULL, all variables from the additional dataset (dataset_add) are added. In the case when a variable exists in both datasets, an error is issued to ensure the user either adds toby_vars, removes or renames.- Permitted values
list of variables created by
exprs(), e.g.,exprs(USUBJID, VISIT)- Default value
NULL
- filter_add
-
Filter for additional dataset (
dataset_add)Only observations fulfilling the specified condition are taken into account for merging. If the argument is not specified, all observations are considered.
Variables defined by the
new_varsargument can be used in the filter condition.- Permitted values
an unquoted condition, e.g.,
AVISIT == "BASELINE"- Default value
NULL
- mode
-
Selection mode
Determines if the first or last observation is selected. If the
orderargument is specified,modemust be non-null.If the
orderargument is not specified, themodeargument is ignored.- Permitted values
"first","last"- Default value
NULL
- exist_flag
-
Exist flag
If the argument is specified (e.g.,
exist_flag = FLAG), the specified variable (e.g.,FLAG) is added to the input dataset. This variable will be the value provided intrue_valuefor all selected records fromdataset_addwhich are merged into the input dataset, and the value provided infalse_valueotherwise.- Permitted values
an unquoted symbol, e.g.,
AVAL- Default value
NULL
- true_value
-
True value
The value for the specified variable
exist_flag, applicable to the first or last observation (depending on the mode) of each by group.- Permitted values
a character scalar, i.e., a character vector of length one
- Default value
"Y"
- false_value
-
False value
The value for the specified variable
exist_flag, NOT applicable to the first or last observation (depending on the mode) of each by group.- Permitted values
a character scalar, i.e., a character vector of length one
- Default value
NA_character_
- missing_values
-
Values for non-matching observations
For observations of the input dataset (
dataset) which do not have a matching observation in the additional dataset (dataset_add) the values of the specified variables are set to the specified value. Only variables specified fornew_varscan be specified formissing_values.- Permitted values
list of expressions created by
exprs(), e.g.,exprs(BASEC = "MISSING", BASE = -1)- Default value
NULL
- check_type
-
Check uniqueness?
If
"warning","message", or"error"is specified, the specified message is issued if the observations of the (restricted) additional dataset are not unique with respect to the by variables and the order.If the
orderargument is not specified, thecheck_typeargument is ignored: if the observations of the (restricted) additional dataset are not unique with respect to the by variables, an error is issued.- Permitted values
"none","message","warning","error"- Default value
"warning"
- duplicate_msg
-
Message of unique check
If the uniqueness check fails, the specified message is displayed.
- Permitted values
a console message to be printed, e.g.
"Attention"or for longer messages usepaste("Line 1", "Line 2")- Default value
-
paste( "Dataset {.arg dataset_add} contains duplicate records with respect to", "{.var {vars2chr(by_vars)}}." )
- relationship
-
Expected merge-relationship between the
by_varsvariable(s) indataset(input dataset) and thedataset_add(additional dataset) containing the additionalnew_vars.This argument is passed to the
dplyr::left_join()function. See https://dplyr.tidyverse.org/reference/mutate-joins.html#arguments for more details.- Permitted values
"one-to-one","many-to-one"- Default value
NULL
Value
The output dataset contains all observations and variables of the
input dataset and additionally the variables specified for new_vars from
the additional dataset (dataset_add).
Details
The new variables (
new_vars) are added to the additional dataset (dataset_add).The records from the additional dataset (
dataset_add) are restricted to those matching thefilter_addcondition.If
orderis specified, for each by group the first or last observation (depending onmode) is selected.The variables specified for
new_varsare merged to the input dataset usingleft_join(). I.e., the output dataset contains all observations from the input dataset. For observations without a matching observation in the additional dataset the new variables are set as specified bymissing_values(or toNAfor variables not inmissing_values). Observations in the additional dataset which have no matching observation in the input dataset are ignored.
See also
General Derivation Functions for all ADaMs that returns variable appended to dataset:
derive_var_extreme_flag(),
derive_var_joined_exist_flag(),
derive_var_merged_ef_msrc(),
derive_var_merged_exist_flag(),
derive_var_merged_summary(),
derive_var_obs_number(),
derive_var_relative_flag(),
derive_vars_cat(),
derive_vars_computed(),
derive_vars_joined(),
derive_vars_merged_lookup(),
derive_vars_transposed()
Examples
Note on usage versus derive_vars_joined()
The question between using derive_vars_merged() or the more powerful
derive_vars_joined() comes down to how you need to select the observations
to be merged.
If the observations from
dataset_addto merge can be selected by a condition (filter_add) using only variables fromdataset_add, then always usederive_vars_merged()as it requires less resources (time and memory). A common example of this would be a randomization date inADSL, where you are simply merging on a date fromDSaccording to a certainDSDECODcondition such asDSDECOD == "RANDOMIZATION".However, if the selection of the observations from
dataset_addcan depend on variables from both datasets, then usederive_vars_joined(). An example of this would be assigning period variables fromADSLto anADAE, where you now need to check each adverse event start date against the period start and end dates to decide which period value to join.
Basic merge of a full dataset
Merge all demographic variables onto a vital signs dataset.
The variable
DOMAINexists in both datasets so note the use ofselect(dm, -DOMAIN)in thedataset_addargument. Without this an error would be issued to notify the user.
library(tibble)
library(dplyr, warn.conflicts = FALSE)
vs <- tribble(
~DOMAIN, ~USUBJID, ~VSTESTCD, ~VISIT, ~VSSTRESN, ~VSDTC,
"VS", "01", "HEIGHT", "SCREENING", 178.0, "2013-08-20",
"VS", "01", "WEIGHT", "SCREENING", 81.9, "2013-08-20",
"VS", "01", "WEIGHT", "BASELINE", 82.1, "2013-08-29",
"VS", "01", "WEIGHT", "WEEK 2", 81.9, "2013-09-15",
"VS", "01", "WEIGHT", "WEEK 4", 82.6, "2013-09-24",
"VS", "02", "WEIGHT", "BASELINE", 58.6, "2014-01-11"
) %>%
mutate(STUDYID = "AB42")
dm <- tribble(
~DOMAIN, ~USUBJID, ~AGE, ~AGEU,
"DM", "01", 61, "YEARS",
"DM", "02", 64, "YEARS",
"DM", "03", 85, "YEARS"
) %>%
mutate(STUDYID = "AB42")
derive_vars_merged(
vs,
dataset_add = select(dm, -DOMAIN),
by_vars = exprs(STUDYID, USUBJID)
) %>%
select(USUBJID, VSTESTCD, VISIT, VSSTRESN, AGE, AGEU)
#> # A tibble: 6 × 6
#> USUBJID VSTESTCD VISIT VSSTRESN AGE AGEU
#> <chr> <chr> <chr> <dbl> <dbl> <chr>
#> 1 01 HEIGHT SCREENING 178 61 YEARS
#> 2 01 WEIGHT SCREENING 81.9 61 YEARS
#> 3 01 WEIGHT BASELINE 82.1 61 YEARS
#> 4 01 WEIGHT WEEK 2 81.9 61 YEARS
#> 5 01 WEIGHT WEEK 4 82.6 61 YEARS
#> 6 02 WEIGHT BASELINE 58.6 64 YEARS
Merge only the first/last value (order and mode)
Merge the last occurring weight for each subject to the demographics dataset.
To enable sorting by visit date
convert_dtc_to_dtm()is used to convert to a datetime, within theorderargument.Then the
modeargument is set to"last"to ensure the last sorted value is taken. Be cautious ifNAvalues are possible in theordervariables - see Sort Order.The
filter_addargument is used to restrict the vital signs records only to weight assessments.
derive_vars_merged(
dm,
dataset_add = vs,
by_vars = exprs(STUDYID, USUBJID),
order = exprs(convert_dtc_to_dtm(VSDTC)),
mode = "last",
new_vars = exprs(LSTWT = VSSTRESN),
filter_add = VSTESTCD == "WEIGHT"
) %>%
select(USUBJID, AGE, AGEU, LSTWT)
#> # A tibble: 3 × 4
#> USUBJID AGE AGEU LSTWT
#> <chr> <dbl> <chr> <dbl>
#> 1 01 61 YEARS 82.6
#> 2 02 64 YEARS 58.6
#> 3 03 85 YEARS NA
Handling duplicates (check_type)
The source records are checked regarding duplicates with respect to the by variables and the order specified. By default, a warning is issued if any duplicates are found. Note the results here with a new vital signs dataset containing a duplicate last weight assessment date.
vs_dup <- tribble(
~DOMAIN, ~USUBJID, ~VSTESTCD, ~VISIT, ~VSSTRESN, ~VSDTC,
"VS", "01", "WEIGHT", "WEEK 2", 81.1, "2013-09-24",
"VS", "01", "WEIGHT", "WEEK 4", 82.6, "2013-09-24"
) %>%
mutate(STUDYID = "AB42")
derive_vars_merged(
dm,
dataset_add = vs_dup,
by_vars = exprs(STUDYID, USUBJID),
order = exprs(convert_dtc_to_dtm(VSDTC)),
mode = "last",
new_vars = exprs(LSTWT = VSSTRESN),
filter_add = VSTESTCD == "WEIGHT"
) %>%
select(USUBJID, AGE, AGEU, LSTWT)
#> # A tibble: 3 × 4
#> USUBJID AGE AGEU LSTWT
#> <chr> <dbl> <chr> <dbl>
#> 1 01 61 YEARS 82.6
#> 2 02 64 YEARS NA
#> 3 03 85 YEARS NA
#> Warning: Dataset contains duplicate records with respect to `STUDYID`, `USUBJID`, and
#> `convert_dtc_to_dtm(VSDTC)`
#> ℹ Run `admiral::get_duplicates_dataset()` to access the duplicate records
For investigating the issue, the dataset of the duplicate source records
can be obtained by calling get_duplicates_dataset():
get_duplicates_dataset()
#> Duplicate records with respect to `STUDYID`, `USUBJID`, and
#> `convert_dtc_to_dtm(VSDTC)`.
#> # A tibble: 2 × 9
#> STUDYID USUBJID convert_dtc_to_dtm(VSDT…¹ DOMAIN VSTESTCD VISIT VSSTRESN VSDTC
#> * <chr> <chr> <dttm> <chr> <chr> <chr> <dbl> <chr>
#> 1 AB42 01 2013-09-24 00:00:00 VS WEIGHT WEEK… 81.1 2013…
#> 2 AB42 01 2013-09-24 00:00:00 VS WEIGHT WEEK… 82.6 2013…
#> # ℹ abbreviated name: ¹`convert_dtc_to_dtm(VSDTC)`
#> # ℹ 1 more variable: LSTWT <dbl>
Common options to solve the issue:
Specifying additional variables for
order- this is the most common approach, adding something like a sequence variable.Restricting the source records by specifying/updating the
filter_addargument.Setting
check_type = "none"to ignore any duplicates, but then in this case the last occurring record would be chosen according to the sort order of the inputdataset_add. This is not often advisable, unless the order has no impact on the result, as the temporary sort order can be prone to variation across an ADaM script.
Modify values dependent on the merge (new_vars and missing_values)
For the last occurring weight for each subject, add a categorization of which visit it occurred at to the demographics dataset.
In the
new_varsargument, other functions can be utilized to modify the merged values. For example, in the below case we want to categorize the visit as"BASELINE"or"POST-BASELINE"usingif_else().The
missing_valuesargument assigns a specific value for subjects with no matching observations - see subject"03"in the below example.
derive_vars_merged(
dm,
dataset_add = vs,
by_vars = exprs(STUDYID, USUBJID),
order = exprs(convert_dtc_to_dtm(VSDTC)),
mode = "last",
new_vars = exprs(
LSTWTCAT = if_else(VISIT == "BASELINE", "BASELINE", "POST-BASELINE")
),
filter_add = VSTESTCD == "WEIGHT",
missing_values = exprs(LSTWTCAT = "MISSING")
) %>%
select(USUBJID, AGE, AGEU, LSTWTCAT)
#> # A tibble: 3 × 4
#> USUBJID AGE AGEU LSTWTCAT
#> <chr> <dbl> <chr> <chr>
#> 1 01 61 YEARS POST-BASELINE
#> 2 02 64 YEARS BASELINE
#> 3 03 85 YEARS MISSING
Check existence of records to merge (exist_flag, true_value and false_value)
Similar to the above example, now we prefer to have a separate flag variable to show whether a selected record was merged.
The name of the new variable is set with the
exist_flagargument.The values of this new variable are assigned via the
true_valueandfalse_valuearguments.
derive_vars_merged(
dm,
dataset_add = vs,
by_vars = exprs(STUDYID, USUBJID),
order = exprs(convert_dtc_to_dtm(VSDTC)),
mode = "last",
new_vars = exprs(
LSTWTCAT = if_else(VISIT == "BASELINE", "BASELINE", "POST-BASELINE")
),
filter_add = VSTESTCD == "WEIGHT",
exist_flag = WTCHECK,
true_value = "Y",
false_value = "MISSING"
) %>%
select(USUBJID, AGE, AGEU, LSTWTCAT, WTCHECK)
#> # A tibble: 3 × 5
#> USUBJID AGE AGEU LSTWTCAT WTCHECK
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 01 61 YEARS POST-BASELINE Y
#> 2 02 64 YEARS BASELINE Y
#> 3 03 85 YEARS <NA> MISSING
Creating more than one variable from the merge (new_vars)
Derive treatment start datetime and associated imputation flags.
In this example we first impute exposure datetime and associated flag variables as a separate first step to be used in the
orderargument.In the
new_varsarguments, you can see how both datetime and the date and time imputation flags are all merged in one call.
ex <- tribble(
~DOMAIN, ~USUBJID, ~EXSTDTC,
"EX", "01", "2013-08-29",
"EX", "01", "2013-09-16",
"EX", "02", "2014-01-11",
"EX", "02", "2014-01-25"
) %>%
mutate(STUDYID = "AB42")
ex_ext <- derive_vars_dtm(
ex,
dtc = EXSTDTC,
new_vars_prefix = "EXST",
highest_imputation = "M"
)
derive_vars_merged(
dm,
dataset_add = ex_ext,
by_vars = exprs(STUDYID, USUBJID),
new_vars = exprs(TRTSDTM = EXSTDTM, TRTSDTF = EXSTDTF, TRTSTMF = EXSTTMF),
order = exprs(EXSTDTM),
mode = "first"
) %>%
select(USUBJID, TRTSDTM, TRTSDTF, TRTSTMF)
#> # A tibble: 3 × 4
#> USUBJID TRTSDTM TRTSDTF TRTSTMF
#> <chr> <dttm> <chr> <chr>
#> 1 01 2013-08-29 00:00:00 <NA> H
#> 2 02 2014-01-11 00:00:00 <NA> H
#> 3 03 NA <NA> <NA>
Further examples
Further example usages of this function can be found in the Generic Derivations vignette.