DM

Introduction

This article describes how to create a demographics (DM) domain using the {sdtm.oak} package.

Before reading this article, it is recommended that users review some of the articles in the package documentation of {sdtm.oak} to understand some of the key concepts: Algorithms & Sub-Algorithms, Creating an Interventions Domain, which provides a detailed explanation of various concepts in {sdtm.oak}, such as oak_id_vars, condition_add, etc. It also offers guidance on which mapping algorithms or functions to use for different mappings and provides a more detailed explanation of how these mapping algorithms or functions work.

In this article, we will dive directly into programming and provide further explanation only where it is required.

Programming workflow

Read in data
Create oak_id_vars
Read in CT
Create reference dates configuration file
Map Topic Variable
Map Rest of the Variables
Map Reference Date Variables
Create SDTM derived variables
Add Labels and Attributes

Read in data

Read all the raw datasets into the environment. In this example, the raw datasets needed are ec_raw, ds_raw and dm_raw. Users can read them from the {pharmaverseraw} package using the below code:

library(sdtm.oak)
library(pharmaverseraw)
library(dplyr)

dm_raw <- pharmaverseraw::dm_raw
ds_raw <- pharmaverseraw::ds_raw
ec_raw <- pharmaverseraw::ec_raw

Demographics Raw dataset.

Sample of Data

Disposition Raw dataset.

Sample of Data

Study Drug Administration Raw dataset.

Sample of Data

SDTM aCRF

SDTM annotated aCRF for the raw datasets are below:

Demographics aCRF

Exposure_as_collected aCRF

Subject_Disposition_aCRF

Create oak_id_vars

dm_raw <- dm_raw %>%
  generate_oak_id_vars(
    pat_var = "PATNUM",
    raw_src = "dm_raw"
  )

ds_raw <- ds_raw %>%
  generate_oak_id_vars(
    pat_var = "PATNUM",
    raw_src = "ds_raw"
  )

ec_raw <- ec_raw %>%
  generate_oak_id_vars(
    pat_var = "PATNUM",
    raw_src = "ec_raw"
  )

For example, Demographics Raw dataset with oak_id_vars

Sample of Data

Read in CT

Controlled Terminology is part of the SDTM specification and it is prepared by the user. In this example, the study controlled terminology name is sdtm_ct.csv. Users can read it from the package using the below code:

study_ct <- read.csv("metadata/sdtm_ct.csv")

Sample of Data

Create reference dates configuration file

Create reference date configuration file, a data frame which has the details of the variables to be used for the calculation of reference dates. The data frame should have columns listed below:

raw_dataset_name: Name of the raw dataset.
date_var: Date variable name from the raw dataset.
time_var: Time variable name from the raw dataset.
dformat: Format of the date collected in raw data.
tformat: Format of the time collected in raw data.
sdtm_var_name: Reference date variable name in DM domain where the raw variable is used.

ref_date_conf_df <- tibble::tribble(
  ~raw_dataset_name, ~date_var,     ~time_var,      ~dformat,      ~tformat, ~sdtm_var_name,
  "ec_raw",       "IT.ECSTDAT", NA_character_, "dd-mmm-yyyy", NA_character_,     "RFXSTDTC",
  "ec_raw",       "IT.ECENDAT", NA_character_, "dd-mmm-yyyy", NA_character_,     "RFXENDTC",
  "ec_raw",       "IT.ECSTDAT", NA_character_, "dd-mmm-yyyy", NA_character_,      "RFSTDTC",
  "ec_raw",       "IT.ECENDAT", NA_character_, "dd-mmm-yyyy", NA_character_,      "RFENDTC",
  "dm_raw",            "IC_DT", NA_character_,  "mm/dd/yyyy", NA_character_,      "RFICDTC",
  "ds_raw",          "DSDTCOL",     "DSTMCOL",  "mm-dd-yyyy",         "H:M",     "RFPENDTC",
  "ds_raw",          "DEATHDT", NA_character_,  "mm/dd/yyyy", NA_character_,       "DTHDTC"
)

Sample of Data

Map Topic Variable

In DM domain, SUBJID is the topic variable and it can be mapped from PATNUM using a simple dplyr::mutate() statement.

dm <- dm_raw %>%
  mutate(
    SUBJID = substr(PATNUM, 5, 8)
  ) %>%
  select(oak_id, raw_source, patient_number, SUBJID)

Sample of Data

Map Rest of the Variables

Map rest of the variables in DM domain using either sdtm.oak::assign_no_ct() or sdtm.oak::assign_ct() depending on if the variable has controlled terminologies associated.

dm <- dm %>%
  # Map AGE using assign_no_ct
  assign_no_ct(
    raw_dat = dm_raw,
    raw_var = "IT.AGE",
    tgt_var = "AGE",
    id_vars = oak_id_vars()
  ) %>%
  # Map AGEU using hardcode_ct
  hardcode_ct(
    raw_dat = dm_raw,
    raw_var = "IT.AGE",
    tgt_var = "AGEU",
    tgt_val = "Year",
    ct_spec = study_ct,
    ct_clst = "C66781",
    id_vars = oak_id_vars()
  ) %>%
  # Map SEX using assign_ct
  assign_ct(
    raw_dat = dm_raw,
    raw_var = "IT.SEX",
    tgt_var = "SEX",
    ct_spec = study_ct,
    ct_clst = "C66731",
    id_vars = oak_id_vars()
  ) %>%
  # Map ETHNIC using assign_ct
  assign_ct(
    raw_dat = dm_raw,
    raw_var = "IT.ETHNIC",
    tgt_var = "ETHNIC",
    ct_spec = study_ct,
    ct_clst = "C66790",
    id_vars = oak_id_vars()
  ) %>%
  # Map RACE using assign_ct
  assign_ct(
    raw_dat = dm_raw,
    raw_var = "IT.RACE",
    tgt_var = "RACE",
    ct_spec = study_ct,
    ct_clst = "C74457",
    id_vars = oak_id_vars()
  ) %>%
  # Map ARM using assign_ct
  assign_ct(
    raw_dat = dm_raw,
    raw_var = "PLANNED_ARM",
    tgt_var = "ARM",
    ct_spec = study_ct,
    ct_clst = "ARM",
    id_vars = oak_id_vars()
  ) %>%
  # Map ARMCD using assign_no_ct
  assign_no_ct(
    raw_dat = dm_raw,
    raw_var = "PLANNED_ARMCD",
    tgt_var = "ARMCD",
    id_vars = oak_id_vars()
  ) %>%
  # Map ACTARM using assign_ct
  assign_ct(
    raw_dat = dm_raw,
    raw_var = "ACTUAL_ARM",
    tgt_var = "ACTARM",
    ct_spec = study_ct,
    ct_clst = "ARM",
    id_vars = oak_id_vars()
  ) %>%
  # Map ACTARMCD using assign_no_ct
  assign_no_ct(
    raw_dat = dm_raw,
    raw_var = "ACTUAL_ARMCD",
    tgt_var = "ACTARMCD",
    id_vars = oak_id_vars()
  )

ℹ These terms could not be mapped per the controlled terminology: "Placebo" and "Screen Failure".
ℹ These terms could not be mapped per the controlled terminology: "Placebo" and "Screen Failure".

Sample of Data

Map Reference Date Variables

Use sdtm.oak::oak_cal_ref_dates() to calculate reference dates variables in ISO 8601 format. The function takes the raw variable names from reference date configuration file, and calculated the minimum or maximum dates based upon the min_max parameter.

Variable RFSTDTC is the reference Start Date/time for the subject in ISO 8601 character format. Usually equivalent to date/time when subject was first exposed to study treatment. So as specified in the reference date configuration file, we need to calculate the minimum date of the IT.ECSTDAT for each subject from the ec_raw dataset. Therefore, in min_max parameter, “min” is selected for the calculation.

dm <- dm %>%
  # Derive RFSTDTC using oak_cal_ref_dates
  oak_cal_ref_dates(
    ds_in = .,
    der_var = "RFSTDTC",
    min_max = "min",
    ref_date_config_df = ref_date_conf_df,
    raw_source = list(
      ec_raw = ec_raw,
      ds_raw = ds_raw,
      dm_raw = dm_raw
    )
  )

Variable RFENDTC is the Reference end date/time for the subject in ISO 8601 character format. Usually equivalent to the date/time when subject was determined to have ended the trial, and often equivalent to date/time of last exposure to study treatment. As specified in the reference date configuration file, we need to calculate the maximum date of the IT.ECENDAT for each subject from the ec_raw dataset. Therefore, in min_max parameter, “max” is selected for the calculation.

dm <- dm %>%
  # Derive RFENDTC using oak_cal_ref_dates
  oak_cal_ref_dates(
    ds_in = .,
    der_var = "RFENDTC",
    min_max = "max",
    ref_date_config_df = ref_date_conf_df,
    raw_source = list(
      ec_raw = ec_raw,
      ds_raw = ds_raw,
      dm_raw = dm_raw
    )
  )

Sample of Data

The same derivation logic is applicable to other reference date/time variables.

dm <- dm %>%
  # Derive RFXSTDTC using oak_cal_ref_dates
  oak_cal_ref_dates(
    ds_in = .,
    der_var = "RFXSTDTC",
    min_max = "min",
    ref_date_config_df = ref_date_conf_df,
    raw_source = list(
      ec_raw = ec_raw,
      ds_raw = ds_raw,
      dm_raw = dm_raw
    )
  ) %>%
  # Derive RFXENDTC using oak_cal_ref_dates
  oak_cal_ref_dates(
    ds_in = .,
    der_var = "RFXENDTC",
    min_max = "max",
    ref_date_config_df = ref_date_conf_df,
    raw_source = list(
      ec_raw = ec_raw,
      ds_raw = ds_raw,
      dm_raw = dm_raw
    )
  ) %>%
  # Derive RFICDTC using oak_cal_ref_dates
  oak_cal_ref_dates(
    ds_in = .,
    der_var = "RFICDTC",
    min_max = "min",
    ref_date_config_df = ref_date_conf_df,
    raw_source = list(
      ec_raw = ec_raw,
      ds_raw = ds_raw,
      dm_raw = dm_raw
    )
  ) %>%
  # Derive RFPENDTC using oak_cal_ref_dates
  oak_cal_ref_dates(
    ds_in = .,
    der_var = "RFPENDTC",
    min_max = "max",
    ref_date_config_df = ref_date_conf_df,
    raw_source = list(
      ec_raw = ec_raw,
      ds_raw = ds_raw,
      dm_raw = dm_raw
    )
  ) %>%
  # Map DTHDTC using oak_cal_ref_dates
  oak_cal_ref_dates(
    ds_in = .,
    der_var = "DTHDTC",
    min_max = "min",
    ref_date_config_df = ref_date_conf_df,
    raw_source = list(
      ec_raw = ec_raw,
      ds_raw = ds_raw,
      dm_raw = dm_raw
    )
  )

Create SDTM derived variables

dm <- dm %>%
  mutate(
    STUDYID = dm_raw$STUDY,
    DOMAIN = "DM",
    USUBJID = paste0("01-", dm_raw$PATNUM),
    COUNTRY = dm_raw$COUNTRY,
    DTHFL = dplyr::if_else(is.na(DTHDTC), NA_character_, "Y")
  ) %>%
  # Map DMDTC using assign_datetime
  assign_datetime(
    raw_dat = dm_raw,
    raw_var = "COL_DT",
    tgt_var = "DMDTC",
    raw_fmt = c("m/d/y"),
    id_vars = oak_id_vars()
  ) %>%
  # Derive study day
  derive_study_day(
    sdtm_in = .,
    dm_domain = .,
    tgdt = "DMDTC",
    refdt = "RFXSTDTC",
    study_day_var = "DMDY"
  )

Sample of Data

Add Labels and Attributes

Yet to be developed. Please refer to {metatools} package to investigate options.

--- title: "DM" order: 1 --- ```{r setup script, include=FALSE, purl=FALSE} invisible_hook_purl <- function(before, options, ...) { knitr::hook_purl(before, options, ...) NULL } knitr::knit_hooks$set(purl = invisible_hook_purl) source("functions/print_df.R") ``` # Introduction This article describes how to create a demographics (`DM`) domain using the `{sdtm.oak}` package. Before reading this article, it is recommended that users review some of the articles in the package documentation of `{sdtm.oak}` to understand some of the key concepts: [Algorithms & Sub-Algorithms](https://pharmaverse.github.io/sdtm.oak/articles/algorithms.html), [Creating an Interventions Domain](https://pharmaverse.github.io/sdtm.oak/articles/interventions_domain.html), which provides a detailed explanation of various concepts in {sdtm.oak}, such as `oak_id_vars`, `condition_add`, etc. It also offers guidance on which mapping algorithms or functions to use for different mappings and provides a more detailed explanation of how these mapping algorithms or functions work. In this article, we will dive directly into programming and provide further explanation only where it is required. # Programming workflow * [Read in data](#readdata) * [Create oak_id_vars](#oakidvars) * [Read in CT](#readct) * [Create reference dates configuration file](#refdates) * [Map Topic Variable](#maptopic) * [Map Rest of the Variables](#maprest) * [Map Reference Date Variables](#mapvars) * [Create SDTM derived variables](#derivedvars) * [Add Labels and Attributes](#attributes) ## Read in data {#readdata} Read all the raw datasets into the environment. In this example, the raw datasets needed are `ec_raw`, `ds_raw` and `dm_raw`. Users can read them from the `{pharmaverseraw}` package using the below code: ```{r setup, message=FALSE, warning=FALSE, results='hold'} library(sdtm.oak) library(pharmaverseraw) library(dplyr) dm_raw <- pharmaverseraw::dm_raw ds_raw <- pharmaverseraw::ds_raw ec_raw <- pharmaverseraw::ec_raw ``` #### Demographics Raw dataset. ```{r eval=TRUE, echo=FALSE, purl=FALSE} print_df(dm_raw, n = 5) ``` #### Disposition Raw dataset. ```{r eval=TRUE, echo=FALSE, purl=FALSE} print_df(ds_raw, n = 5) ``` #### Study Drug Administration Raw dataset. ```{r eval=TRUE, echo=FALSE, purl=FALSE} print_df(ec_raw, n = 5) ``` #### SDTM aCRF SDTM annotated aCRF for the raw datasets are below: [Demographics aCRF](https://github.com/pharmaverse/pharmaverseraw/blob/main/vignettes/articles/aCRFs/Demographics_aCRF.pdf) [Exposure_as_collected aCRF](https://github.com/pharmaverse/pharmaverseraw/blob/main/vignettes/articles/aCRFs/Exposure_as_collected_aCRF.pdf) [Subject_Disposition_aCRF](https://github.com/pharmaverse/pharmaverseraw/blob/main/vignettes/articles/aCRFs/Subject_Disposition_aCRF.pdf) ## Create oak_id_vars {#oakidvars} ```{r} dm_raw <- dm_raw %>% generate_oak_id_vars( pat_var = "PATNUM", raw_src = "dm_raw" ) ds_raw <- ds_raw %>% generate_oak_id_vars( pat_var = "PATNUM", raw_src = "ds_raw" ) ec_raw <- ec_raw %>% generate_oak_id_vars( pat_var = "PATNUM", raw_src = "ec_raw" ) ``` For example, Demographics Raw dataset with `oak_id_vars` ```{r eval=TRUE, echo=FALSE, purl=FALSE} print_df(dm_raw, n = 10) ``` ## Read in CT {#readct} Controlled Terminology is part of the SDTM specification and it is prepared by the user. In this example, the study controlled terminology name is `sdtm_ct.csv`. Users can read it from the package using the below code: ```{r, echo = TRUE} study_ct <- read.csv("metadata/sdtm_ct.csv") ``` ```{r eval=TRUE, echo=FALSE, purl=FALSE} print_df(study_ct, n = 10) ``` ## Create reference dates configuration file {#refdates} Create reference date configuration file, a data frame which has the details of the variables to be used for the calculation of reference dates. The data frame should have columns listed below: - **`raw_dataset_name`**: Name of the raw dataset. - **`date_var`**: Date variable name from the raw dataset. - **`time_var`**: Time variable name from the raw dataset. - **`dformat`**: Format of the date collected in raw data. - **`tformat`**: Format of the time collected in raw data. - **`sdtm_var_name`**: Reference date variable name in `DM` domain where the raw variable is used. ```{r} ref_date_conf_df <- tibble::tribble( ~raw_dataset_name, ~date_var, ~time_var, ~dformat, ~tformat, ~sdtm_var_name, "ec_raw", "IT.ECSTDAT", NA_character_, "dd-mmm-yyyy", NA_character_, "RFXSTDTC", "ec_raw", "IT.ECENDAT", NA_character_, "dd-mmm-yyyy", NA_character_, "RFXENDTC", "ec_raw", "IT.ECSTDAT", NA_character_, "dd-mmm-yyyy", NA_character_, "RFSTDTC", "ec_raw", "IT.ECENDAT", NA_character_, "dd-mmm-yyyy", NA_character_, "RFENDTC", "dm_raw", "IC_DT", NA_character_, "mm/dd/yyyy", NA_character_, "RFICDTC", "ds_raw", "DSDTCOL", "DSTMCOL", "mm-dd-yyyy", "H:M", "RFPENDTC", "ds_raw", "DEATHDT", NA_character_, "mm/dd/yyyy", NA_character_, "DTHDTC" ) ``` ```{r eval=TRUE, echo=FALSE, purl=FALSE} print_df(ref_date_conf_df) ``` ## Map Topic Variable {#maptopic} In `DM` domain, `SUBJID` is the topic variable and it can be mapped from `PATNUM` using a simple `dplyr::mutate()` statement. ```{r} dm <- dm_raw %>% mutate( SUBJID = substr(PATNUM, 5, 8) ) %>% select(oak_id, raw_source, patient_number, SUBJID) ``` ```{r eval=TRUE, echo=FALSE, purl=FALSE} print_df(dm, n = 10) ``` ## Map Rest of the Variables {#maprest} Map rest of the variables in `DM` domain using either `sdtm.oak::assign_no_ct()` or `sdtm.oak::assign_ct()` depending on if the variable has controlled terminologies associated. ```{r} dm <- dm %>% # Map AGE using assign_no_ct assign_no_ct( raw_dat = dm_raw, raw_var = "IT.AGE", tgt_var = "AGE", id_vars = oak_id_vars() ) %>% # Map AGEU using hardcode_ct hardcode_ct( raw_dat = dm_raw, raw_var = "IT.AGE", tgt_var = "AGEU", tgt_val = "Year", ct_spec = study_ct, ct_clst = "C66781", id_vars = oak_id_vars() ) %>% # Map SEX using assign_ct assign_ct( raw_dat = dm_raw, raw_var = "IT.SEX", tgt_var = "SEX", ct_spec = study_ct, ct_clst = "C66731", id_vars = oak_id_vars() ) %>% # Map ETHNIC using assign_ct assign_ct( raw_dat = dm_raw, raw_var = "IT.ETHNIC", tgt_var = "ETHNIC", ct_spec = study_ct, ct_clst = "C66790", id_vars = oak_id_vars() ) %>% # Map RACE using assign_ct assign_ct( raw_dat = dm_raw, raw_var = "IT.RACE", tgt_var = "RACE", ct_spec = study_ct, ct_clst = "C74457", id_vars = oak_id_vars() ) %>% # Map ARM using assign_ct assign_ct( raw_dat = dm_raw, raw_var = "PLANNED_ARM", tgt_var = "ARM", ct_spec = study_ct, ct_clst = "ARM", id_vars = oak_id_vars() ) %>% # Map ARMCD using assign_no_ct assign_no_ct( raw_dat = dm_raw, raw_var = "PLANNED_ARMCD", tgt_var = "ARMCD", id_vars = oak_id_vars() ) %>% # Map ACTARM using assign_ct assign_ct( raw_dat = dm_raw, raw_var = "ACTUAL_ARM", tgt_var = "ACTARM", ct_spec = study_ct, ct_clst = "ARM", id_vars = oak_id_vars() ) %>% # Map ACTARMCD using assign_no_ct assign_no_ct( raw_dat = dm_raw, raw_var = "ACTUAL_ARMCD", tgt_var = "ACTARMCD", id_vars = oak_id_vars() ) ``` ```{r eval=TRUE, echo=FALSE, purl=FALSE} print_df(dm, n = 10) ``` ## Map Reference Date Variables {#mapvars} Use `sdtm.oak::oak_cal_ref_dates()` to calculate reference dates variables in ISO 8601 format. The function takes the raw variable names from reference date configuration file, and calculated the minimum or maximum dates based upon the `min_max` parameter. Variable `RFSTDTC` is the reference Start Date/time for the subject in ISO 8601 character format. Usually equivalent to date/time when subject was first exposed to study treatment. So as specified in the reference date configuration file, we need to calculate the minimum date of the `IT.ECSTDAT` for each subject from the `ec_raw` dataset. Therefore, in `min_max` parameter, "min" is selected for the calculation. ```{r eval=TRUE} dm <- dm %>% # Derive RFSTDTC using oak_cal_ref_dates oak_cal_ref_dates( ds_in = ., der_var = "RFSTDTC", min_max = "min", ref_date_config_df = ref_date_conf_df, raw_source = list( ec_raw = ec_raw, ds_raw = ds_raw, dm_raw = dm_raw ) ) ``` Variable `RFENDTC` is the Reference end date/time for the subject in ISO 8601 character format. Usually equivalent to the date/time when subject was determined to have ended the trial, and often equivalent to date/time of last exposure to study treatment. As specified in the reference date configuration file, we need to calculate the maximum date of the `IT.ECENDAT` for each subject from the `ec_raw` dataset. Therefore, in `min_max` parameter, "max" is selected for the calculation. ```{r} dm <- dm %>% # Derive RFENDTC using oak_cal_ref_dates oak_cal_ref_dates( ds_in = ., der_var = "RFENDTC", min_max = "max", ref_date_config_df = ref_date_conf_df, raw_source = list( ec_raw = ec_raw, ds_raw = ds_raw, dm_raw = dm_raw ) ) ``` ```{r eval=TRUE, echo=FALSE, purl=FALSE} print_df(dm, n = 10) ``` The same derivation logic is applicable to other reference date/time variables. ```{r} dm <- dm %>% # Derive RFXSTDTC using oak_cal_ref_dates oak_cal_ref_dates( ds_in = ., der_var = "RFXSTDTC", min_max = "min", ref_date_config_df = ref_date_conf_df, raw_source = list( ec_raw = ec_raw, ds_raw = ds_raw, dm_raw = dm_raw ) ) %>% # Derive RFXENDTC using oak_cal_ref_dates oak_cal_ref_dates( ds_in = ., der_var = "RFXENDTC", min_max = "max", ref_date_config_df = ref_date_conf_df, raw_source = list( ec_raw = ec_raw, ds_raw = ds_raw, dm_raw = dm_raw ) ) %>% # Derive RFICDTC using oak_cal_ref_dates oak_cal_ref_dates( ds_in = ., der_var = "RFICDTC", min_max = "min", ref_date_config_df = ref_date_conf_df, raw_source = list( ec_raw = ec_raw, ds_raw = ds_raw, dm_raw = dm_raw ) ) %>% # Derive RFPENDTC using oak_cal_ref_dates oak_cal_ref_dates( ds_in = ., der_var = "RFPENDTC", min_max = "max", ref_date_config_df = ref_date_conf_df, raw_source = list( ec_raw = ec_raw, ds_raw = ds_raw, dm_raw = dm_raw ) ) %>% # Map DTHDTC using oak_cal_ref_dates oak_cal_ref_dates( ds_in = ., der_var = "DTHDTC", min_max = "min", ref_date_config_df = ref_date_conf_df, raw_source = list( ec_raw = ec_raw, ds_raw = ds_raw, dm_raw = dm_raw ) ) ``` ## Create SDTM derived variables {#derivedvars} ```{r} dm <- dm %>% mutate( STUDYID = dm_raw$STUDY, DOMAIN = "DM", USUBJID = paste0("01-", dm_raw$PATNUM), COUNTRY = dm_raw$COUNTRY, DTHFL = dplyr::if_else(is.na(DTHDTC), NA_character_, "Y") ) %>% # Map DMDTC using assign_datetime assign_datetime( raw_dat = dm_raw, raw_var = "COL_DT", tgt_var = "DMDTC", raw_fmt = c("m/d/y"), id_vars = oak_id_vars() ) %>% # Derive study day derive_study_day( sdtm_in = ., dm_domain = ., tgdt = "DMDTC", refdt = "RFXSTDTC", study_day_var = "DMDY" ) ``` ```{r eval=TRUE, echo=FALSE, purl=FALSE} print_df(dm, n = 10) ``` ## Add Labels and Attributes {#attributes} Yet to be developed. Please refer to `{metatools}` package to investigate options.