9  tidyr

Transposing data using tidyrand Joins

Author

Zelos Zhu

9.1 Transposing Data

9.1.1 Some Context

As we know, data can often be represented in several ways. Multiple observations of a variable can be organized by rows or by columns.

Table A.

ID Pre Post
x 1 2
y 3 4

Table B.

ID Time Value
x Pre 1
x Post 2
y Pre 3
y Post 4

When observations are spread along a row as multiple columns, we refer to the data as being in “wide” format (See Table A). When observations are spread along a column as multiple rows, we refer to the data as being in “long” format (See Table B). SDTM data for the most part generally adheres to the “long” structure, but as programmers we need to know how to work with both to suit our needs.

To get the desired shape of data, there are two useful functions from the tidyr package to make this transformation, aptly named: pivot_longer() and pivot_wider() . These can be seen as the R-equivalent of proc transpose in SAS.

9.1.2 Setup

library(dplyr)
library(tidyr)
suppdm <- readRDS("data/suppdm.rds") %>%
  select(USUBJID, QNAM, QVAL)

head(suppdm, 10)
# A tibble: 10 × 3
   USUBJID     QNAM     QVAL 
   <chr>       <chr>    <chr>
 1 01-701-1015 COMPLT16 Y    
 2 01-701-1015 COMPLT24 Y    
 3 01-701-1015 COMPLT8  Y    
 4 01-701-1015 EFFICACY Y    
 5 01-701-1015 ITT      Y    
 6 01-701-1015 SAFETY   Y    
 7 01-701-1023 EFFICACY Y    
 8 01-701-1023 ITT      Y    
 9 01-701-1023 SAFETY   Y    
10 01-701-1028 COMPLT16 Y    

As we see here, in our SUPPDM domain, the data is currently in the “long” format. If we wanted to transform the dataset such that each of the unique values of QNAM was their own column, we are looking to transpose the data from “long” to “wide”. In this case, we use pivot_wider().

suppdm_wide <- suppdm %>% 
  pivot_wider(
    names_from = "QNAM", # assign column names based on QNAM
    values_from = "QVAL" # retrieve values from QVAL
  )
suppdm_wide
# A tibble: 254 × 7
   USUBJID     COMPLT16 COMPLT24 COMPLT8 EFFICACY ITT   SAFETY
   <chr>       <chr>    <chr>    <chr>   <chr>    <chr> <chr> 
 1 01-701-1015 Y        Y        Y       Y        Y     Y     
 2 01-701-1023 <NA>     <NA>     <NA>    Y        Y     Y     
 3 01-701-1028 Y        Y        Y       Y        Y     Y     
 4 01-701-1033 <NA>     <NA>     <NA>    Y        Y     Y     
 5 01-701-1034 Y        Y        Y       Y        Y     Y     
 6 01-701-1047 <NA>     <NA>     <NA>    Y        Y     Y     
 7 01-701-1097 Y        Y        Y       Y        Y     Y     
 8 01-701-1111 <NA>     <NA>     <NA>    Y        Y     Y     
 9 01-701-1115 <NA>     <NA>     Y       Y        Y     Y     
10 01-701-1118 Y        Y        Y       Y        Y     Y     
# … with 244 more rows

Voila! This “wide” dataset may prove useful for joins (to be discussed later). But for now, let’s pretend that this “wide” format is how our original data came to us in. If we wanted to take these respective flagging columns and turn them into a “long” format, we use pivot_longer().

suppdm_long <- suppdm_wide %>% 
  pivot_longer(
    cols = c("COMPLT16", "COMPLT24", "COMPLT8", "EFFICACY", "ITT", "SAFETY"), 
    names_to = "QNAM",
    values_to = "QVAL"
  )
suppdm_long
# A tibble: 1,524 × 3
   USUBJID     QNAM     QVAL 
   <chr>       <chr>    <chr>
 1 01-701-1015 COMPLT16 Y    
 2 01-701-1015 COMPLT24 Y    
 3 01-701-1015 COMPLT8  Y    
 4 01-701-1015 EFFICACY Y    
 5 01-701-1015 ITT      Y    
 6 01-701-1015 SAFETY   Y    
 7 01-701-1023 COMPLT16 <NA> 
 8 01-701-1023 COMPLT24 <NA> 
 9 01-701-1023 COMPLT8  <NA> 
10 01-701-1023 EFFICACY Y    
# … with 1,514 more rows

As you can see, as we pivoted back, we didn’t come up with an exact duplicate of our original suppdm dataframe. This is because the default of pivot_longer() is not to drop NA values, which can be modified with the values_drop_na function input, just one of the many powerful additional function inputs from both of these pivoting functions. pivot_wider() and pivot_longer() were designed to handle a variety of situations when transposing data in the most flexible of ways.

suppdm_long <- suppdm_wide %>% 
  pivot_longer(
    cols = c("COMPLT16", "COMPLT24", "COMPLT8", "EFFICACY", "ITT", "SAFETY"), 
    names_to = "flag",
    values_to = "flag_value",
    values_drop_na = TRUE
  )
suppdm_long
# A tibble: 1,197 × 3
   USUBJID     flag     flag_value
   <chr>       <chr>    <chr>     
 1 01-701-1015 COMPLT16 Y         
 2 01-701-1015 COMPLT24 Y         
 3 01-701-1015 COMPLT8  Y         
 4 01-701-1015 EFFICACY Y         
 5 01-701-1015 ITT      Y         
 6 01-701-1015 SAFETY   Y         
 7 01-701-1023 EFFICACY Y         
 8 01-701-1023 ITT      Y         
 9 01-701-1023 SAFETY   Y         
10 01-701-1028 COMPLT16 Y         
# … with 1,187 more rows

Bonus Trick: The names_to/values_to function arguments can prove to be helpful as a renaming step during the data cleaning process too!

9.2 Relational Data (Joins)

When a pair of tables need to be joined together, we have a variety of functions that can achieve such a task:

  • left_join()

  • right_join()

  • full_join()

  • inner_join()

The use of these functions is very similar to proc sql in SAS. left_join() will cover most of use cases and is demonstrated below:

dm <- admiral.test::admiral_dm %>% 
  select(STUDYID, USUBJID, AGE, ARM)

dm_suppdm <- dm %>% 
  left_join(suppdm_wide, by = "USUBJID")

head(dm_suppdm)
# A tibble: 6 × 10
  STUDYID      USUBJID    AGE ARM   COMPL…¹ COMPL…² COMPLT8 EFFIC…³ ITT   SAFETY
  <chr>        <chr>    <dbl> <chr> <chr>   <chr>   <chr>   <chr>   <chr> <chr> 
1 CDISCPILOT01 01-701-…    63 Plac… Y       Y       Y       Y       Y     Y     
2 CDISCPILOT01 01-701-…    64 Plac… <NA>    <NA>    <NA>    Y       Y     Y     
3 CDISCPILOT01 01-701-…    71 Xano… Y       Y       Y       Y       Y     Y     
4 CDISCPILOT01 01-701-…    74 Xano… <NA>    <NA>    <NA>    Y       Y     Y     
5 CDISCPILOT01 01-701-…    77 Xano… Y       Y       Y       Y       Y     Y     
6 CDISCPILOT01 01-701-…    85 Plac… <NA>    <NA>    <NA>    Y       Y     Y     
# … with abbreviated variable names ¹​COMPLT16, ²​COMPLT24, ³​EFFICACY

The join can also be completed with different column names as long as you define the join-key relationship, demonstrated below:

dummy1 <- data.frame(
  STUDYID = c("TRIALX", "TRIALX"),
  USUBJID = c("1001", "1002"),
  AGE = c(18, 22)
)

dummy2 <- data.frame(
  STUDYID = c("TRIALX", "TRIALX"),
  SUBJECT = c("1001", "1002"),
  SEX = c("M", "F")
)

dummy3 <- dummy1 %>% 
  left_join(dummy2, by = c("STUDYID" = "STUDYID", "USUBJID" = "SUBJECT"))

head(dummy3)
  STUDYID USUBJID AGE SEX
1  TRIALX    1001  18   M
2  TRIALX    1002  22   F