Add New Records Within By Groups Using Aggregation Functions

It is not uncommon to have an analysis need whereby one needs to derive an analysis value (AVAL) from multiple records. The ADaM basic dataset structure variable DTYPE is available to indicate when a new derived records has been added to a dataset.

Usage

derive_summary_records(
  dataset = NULL,
  dataset_add,
  dataset_ref = NULL,
  by_vars,
  filter_add = NULL,
  set_values_to,
  missing_values = NULL
)

Arguments

dataset

Input dataset

The variables specified by the by_vars argument are expected to be in the dataset.

dataset_add

Additional dataset

The variables specified for by_vars are expected. Observations from the specified dataset are going to be used to calculate and added as new records to the input dataset (dataset).

dataset_ref

Reference dataset

The variables specified for by_vars are expected. For each observation of the specified dataset a new observation is added to the input dataset.

by_vars

Grouping variables

Variables to consider for generation of groupwise summary records. Providing the names of variables in exprs() will create a groupwise summary and generate summary records for the specified groups.

Permitted Values: list of variables created by exprs() e.g. exprs(USUBJID, VISIT)

filter_add

Filter condition as logical expression to apply during summary calculation. By default, filtering expressions are computed within by_vars as this will help when an aggregating, lagging, or ranking function is involved.

For example,

filter_add = (AVAL > mean(AVAL, na.rm = TRUE)) will filter all AVAL values greater than mean of AVAL with in by_vars.
filter_add = (dplyr::n() > 2) will filter n count of by_vars greater than 2.

set_values_to

Variables to be set

The specified variables are set to the specified values for the new observations.

Set a list of variables to some specified value for the new records

LHS refer to a variable.
RHS refers to the values to set to the variable. This can be a string, a symbol, a numeric value, an expression or NA. If summary functions are used, the values are summarized by the variables specified for by_vars.

For example:

  set_values_to = exprs(
    AVAL = sum(AVAL),
    DTYPE = "AVERAGE",
  )

missing_values

Values for missing summary values

For observations of the reference dataset (dataset_ref) which do not have a complete mapping defined by the summarization defined in set_values_to. Only variables specified for set_values_to can be specified for missing_values.

Permitted Values: named list of expressions, e.g., exprs(AVAL = -9999)

Value

A data frame with derived records appended to original dataset.

Details

When all records have same values within by_vars then this function will retain those common values in the newly derived records. Otherwise new value will be set to NA.

Examples

library(tibble)
library(dplyr)

adeg <- tribble(
  ~USUBJID,   ~EGSEQ, ~PARAM,             ~AVISIT,    ~EGDTC,             ~AVAL, ~TRTA,
  "XYZ-1001", 1,      "QTcF Int. (msec)", "Baseline", "2016-02-24T07:50", 385,   NA_character_,
  "XYZ-1001", 2,      "QTcF Int. (msec)", "Baseline", "2016-02-24T07:52", 399,   NA_character_,
  "XYZ-1001", 3,      "QTcF Int. (msec)", "Baseline", "2016-02-24T07:56", 396,   NA_character_,
  "XYZ-1001", 4,      "QTcF Int. (msec)", "Visit 2",  "2016-03-08T09:45", 384,   "Placebo",
  "XYZ-1001", 5,      "QTcF Int. (msec)", "Visit 2",  "2016-03-08T09:48", 393,   "Placebo",
  "XYZ-1001", 6,      "QTcF Int. (msec)", "Visit 2",  "2016-03-08T09:51", 388,   "Placebo",
  "XYZ-1001", 7,      "QTcF Int. (msec)", "Visit 3",  "2016-03-22T10:45", 385,   "Placebo",
  "XYZ-1001", 8,      "QTcF Int. (msec)", "Visit 3",  "2016-03-22T10:48", 394,   "Placebo",
  "XYZ-1001", 9,      "QTcF Int. (msec)", "Visit 3",  "2016-03-22T10:51", 402,   "Placebo",
  "XYZ-1002", 1,      "QTcF Int. (msec)", "Baseline", "2016-02-22T07:58", 399,   NA_character_,
  "XYZ-1002", 2,      "QTcF Int. (msec)", "Baseline", "2016-02-22T07:58", 410,   NA_character_,
  "XYZ-1002", 3,      "QTcF Int. (msec)", "Baseline", "2016-02-22T08:01", 392,   NA_character_,
  "XYZ-1002", 4,      "QTcF Int. (msec)", "Visit 2",  "2016-03-06T09:50", 401,   "Active 20mg",
  "XYZ-1002", 5,      "QTcF Int. (msec)", "Visit 2",  "2016-03-06T09:53", 407,   "Active 20mg",
  "XYZ-1002", 6,      "QTcF Int. (msec)", "Visit 2",  "2016-03-06T09:56", 400,   "Active 20mg",
  "XYZ-1002", 7,      "QTcF Int. (msec)", "Visit 3",  "2016-03-24T10:50", 412,   "Active 20mg",
  "XYZ-1002", 8,      "QTcF Int. (msec)", "Visit 3",  "2016-03-24T10:53", 414,   "Active 20mg",
  "XYZ-1002", 9,      "QTcF Int. (msec)", "Visit 3",  "2016-03-24T10:56", 402,   "Active 20mg"
) %>%
  mutate(
    ADTM = convert_dtc_to_dtm(EGDTC)
  )

# Summarize the average of the triplicate ECG interval values (AVAL)
derive_summary_records(
  adeg,
  dataset_add = adeg,
  by_vars = exprs(USUBJID, PARAM, AVISIT),
  set_values_to = exprs(
    AVAL = mean(AVAL, na.rm = TRUE),
    DTYPE = "AVERAGE"
  )
) %>%
  arrange(USUBJID, AVISIT)
#> # A tibble: 24 × 9
#>    USUBJID  EGSEQ PARAM       AVISIT EGDTC  AVAL TRTA  ADTM                DTYPE
#>    <chr>    <dbl> <chr>       <chr>  <chr> <dbl> <chr> <dttm>              <chr>
#>  1 XYZ-1001     1 QTcF Int. … Basel… 2016…  385  NA    2016-02-24 07:50:00 NA   
#>  2 XYZ-1001     2 QTcF Int. … Basel… 2016…  399  NA    2016-02-24 07:52:00 NA   
#>  3 XYZ-1001     3 QTcF Int. … Basel… 2016…  396  NA    2016-02-24 07:56:00 NA   
#>  4 XYZ-1001    NA QTcF Int. … Basel… NA     393. NA    NA                  AVER…
#>  5 XYZ-1001     4 QTcF Int. … Visit… 2016…  384  Plac… 2016-03-08 09:45:00 NA   
#>  6 XYZ-1001     5 QTcF Int. … Visit… 2016…  393  Plac… 2016-03-08 09:48:00 NA   
#>  7 XYZ-1001     6 QTcF Int. … Visit… 2016…  388  Plac… 2016-03-08 09:51:00 NA   
#>  8 XYZ-1001    NA QTcF Int. … Visit… NA     388. NA    NA                  AVER…
#>  9 XYZ-1001     7 QTcF Int. … Visit… 2016…  385  Plac… 2016-03-22 10:45:00 NA   
#> 10 XYZ-1001     8 QTcF Int. … Visit… 2016…  394  Plac… 2016-03-22 10:48:00 NA   
#> # ℹ 14 more rows

# Derive more than one summary variable
derive_summary_records(
  adeg,
  dataset_add = adeg,
  by_vars = exprs(USUBJID, PARAM, AVISIT),
  set_values_to = exprs(
    AVAL = mean(AVAL),
    ADTM = max(ADTM),
    DTYPE = "AVERAGE"
  )
) %>%
  arrange(USUBJID, AVISIT) %>%
  select(-EGSEQ, -TRTA)
#> # A tibble: 24 × 7
#>    USUBJID  PARAM            AVISIT   EGDTC       AVAL ADTM                DTYPE
#>    <chr>    <chr>            <chr>    <chr>      <dbl> <dttm>              <chr>
#>  1 XYZ-1001 QTcF Int. (msec) Baseline 2016-02-2…  385  2016-02-24 07:50:00 NA   
#>  2 XYZ-1001 QTcF Int. (msec) Baseline 2016-02-2…  399  2016-02-24 07:52:00 NA   
#>  3 XYZ-1001 QTcF Int. (msec) Baseline 2016-02-2…  396  2016-02-24 07:56:00 NA   
#>  4 XYZ-1001 QTcF Int. (msec) Baseline NA          393. 2016-02-24 07:56:00 AVER…
#>  5 XYZ-1001 QTcF Int. (msec) Visit 2  2016-03-0…  384  2016-03-08 09:45:00 NA   
#>  6 XYZ-1001 QTcF Int. (msec) Visit 2  2016-03-0…  393  2016-03-08 09:48:00 NA   
#>  7 XYZ-1001 QTcF Int. (msec) Visit 2  2016-03-0…  388  2016-03-08 09:51:00 NA   
#>  8 XYZ-1001 QTcF Int. (msec) Visit 2  NA          388. 2016-03-08 09:51:00 AVER…
#>  9 XYZ-1001 QTcF Int. (msec) Visit 3  2016-03-2…  385  2016-03-22 10:45:00 NA   
#> 10 XYZ-1001 QTcF Int. (msec) Visit 3  2016-03-2…  394  2016-03-22 10:48:00 NA   
#> # ℹ 14 more rows

# Sample ADEG dataset with triplicate record for only AVISIT = 'Baseline'
adeg <- tribble(
  ~USUBJID,   ~EGSEQ, ~PARAM,             ~AVISIT,    ~EGDTC,             ~AVAL, ~TRTA,
  "XYZ-1001", 1,      "QTcF Int. (msec)", "Baseline", "2016-02-24T07:50", 385,   NA_character_,
  "XYZ-1001", 2,      "QTcF Int. (msec)", "Baseline", "2016-02-24T07:52", 399,   NA_character_,
  "XYZ-1001", 3,      "QTcF Int. (msec)", "Baseline", "2016-02-24T07:56", 396,   NA_character_,
  "XYZ-1001", 4,      "QTcF Int. (msec)", "Visit 2",  "2016-03-08T09:48", 393,   "Placebo",
  "XYZ-1001", 5,      "QTcF Int. (msec)", "Visit 2",  "2016-03-08T09:51", 388,   "Placebo",
  "XYZ-1001", 6,      "QTcF Int. (msec)", "Visit 3",  "2016-03-22T10:48", 394,   "Placebo",
  "XYZ-1001", 7,      "QTcF Int. (msec)", "Visit 3",  "2016-03-22T10:51", 402,   "Placebo",
  "XYZ-1002", 1,      "QTcF Int. (msec)", "Baseline", "2016-02-22T07:58", 399,   NA_character_,
  "XYZ-1002", 2,      "QTcF Int. (msec)", "Baseline", "2016-02-22T07:58", 410,   NA_character_,
  "XYZ-1002", 3,      "QTcF Int. (msec)", "Baseline", "2016-02-22T08:01", 392,   NA_character_,
  "XYZ-1002", 4,      "QTcF Int. (msec)", "Visit 2",  "2016-03-06T09:53", 407,   "Active 20mg",
  "XYZ-1002", 5,      "QTcF Int. (msec)", "Visit 2",  "2016-03-06T09:56", 400,   "Active 20mg",
  "XYZ-1002", 6,      "QTcF Int. (msec)", "Visit 3",  "2016-03-24T10:53", 414,   "Active 20mg",
  "XYZ-1002", 7,      "QTcF Int. (msec)", "Visit 3",  "2016-03-24T10:56", 402,   "Active 20mg"
)

# Compute the average of AVAL only if there are more than 2 records within the
# by group
derive_summary_records(
  adeg,
  dataset_add = adeg,
  by_vars = exprs(USUBJID, PARAM, AVISIT),
  filter_add = n() > 2,
  set_values_to = exprs(
    AVAL = mean(AVAL, na.rm = TRUE),
    DTYPE = "AVERAGE"
  )
) %>%
  arrange(USUBJID, AVISIT)
#> # A tibble: 16 × 8
#>    USUBJID  EGSEQ PARAM            AVISIT   EGDTC             AVAL TRTA    DTYPE
#>    <chr>    <dbl> <chr>            <chr>    <chr>            <dbl> <chr>   <chr>
#>  1 XYZ-1001     1 QTcF Int. (msec) Baseline 2016-02-24T07:50  385  NA      NA   
#>  2 XYZ-1001     2 QTcF Int. (msec) Baseline 2016-02-24T07:52  399  NA      NA   
#>  3 XYZ-1001     3 QTcF Int. (msec) Baseline 2016-02-24T07:56  396  NA      NA   
#>  4 XYZ-1001    NA QTcF Int. (msec) Baseline NA                393. NA      AVER…
#>  5 XYZ-1001     4 QTcF Int. (msec) Visit 2  2016-03-08T09:48  393  Placebo NA   
#>  6 XYZ-1001     5 QTcF Int. (msec) Visit 2  2016-03-08T09:51  388  Placebo NA   
#>  7 XYZ-1001     6 QTcF Int. (msec) Visit 3  2016-03-22T10:48  394  Placebo NA   
#>  8 XYZ-1001     7 QTcF Int. (msec) Visit 3  2016-03-22T10:51  402  Placebo NA   
#>  9 XYZ-1002     1 QTcF Int. (msec) Baseline 2016-02-22T07:58  399  NA      NA   
#> 10 XYZ-1002     2 QTcF Int. (msec) Baseline 2016-02-22T07:58  410  NA      NA   
#> 11 XYZ-1002     3 QTcF Int. (msec) Baseline 2016-02-22T08:01  392  NA      NA   
#> 12 XYZ-1002    NA QTcF Int. (msec) Baseline NA                400. NA      AVER…
#> 13 XYZ-1002     4 QTcF Int. (msec) Visit 2  2016-03-06T09:53  407  Active… NA   
#> 14 XYZ-1002     5 QTcF Int. (msec) Visit 2  2016-03-06T09:56  400  Active… NA   
#> 15 XYZ-1002     6 QTcF Int. (msec) Visit 3  2016-03-24T10:53  414  Active… NA   
#> 16 XYZ-1002     7 QTcF Int. (msec) Visit 3  2016-03-24T10:56  402  Active… NA