7  summarizing data

Author

Thomas Neitmann

library(dplyr)
dm <- readRDS("data/dm.rds")
ae <- readRDS("data/ae.rds")

While mutate() adds a new variable for all existing records to a dataset, summarize() aggregates one or more columns of a dataset thereby “collapsing” it. In the simplest case a single variable is aggregated using a summary function such as mean().

dm %>% 
  summarize(avg_age = mean(AGE, na.rm = TRUE))
# A tibble: 1 × 1
  avg_age
    <dbl>
1    75.1

Just like you can create multiple variables inside a single call to mutate() you can aggregate multiple variables (or the same variable with multiple summary functions) inside summarize().

dm %>% 
  summarize(
    avg_age = mean(AGE, na.rm = TRUE),
    median_age = median(AGE, na.rm = TRUE)
  )
# A tibble: 1 × 2
  avg_age median_age
    <dbl>      <dbl>
1    75.1         77

So far we aggregated only numeric variables. Another useful aggregation is counting the number of records.

dm %>% 
  summarize(N = n())
# A tibble: 1 × 1
      N
  <int>
1   306

This becomes quite powerful when combining summarize() with group_by(). This should look rather familiar to you if you every aggregated data using proc sql.

dm %>% 
  group_by(COUNTRY) %>% 
  summarize(n = n()) %>% 
  ungroup()
# A tibble: 9 × 2
  COUNTRY         n
  <chr>       <int>
1 Canada         86
2 China          55
3 Germany        49
4 Greece         13
5 Japan          12
6 Mexico          5
7 Spain           6
8 Switzerland     9
9 USA            71

Note that it is best practice to ungroup() the dataset after you aggregated it. Failing to do so can lead to some rather bogus error when continuing to manipulate the aggregated dataset, e.g. using mutate().

group_by() and summarize() can be used with numeric variables as well. In addition one can group by more than a single variable.

dm %>% 
  group_by(ARM, COUNTRY) %>% 
  summarize(avg_age = mean(AGE, na.rm = TRUE)) %>% 
  ungroup()
# A tibble: 32 × 3
   ARM            COUNTRY     avg_age
   <chr>          <chr>         <dbl>
 1 Placebo        Canada         79.2
 2 Placebo        China          70.5
 3 Placebo        Germany        75.2
 4 Placebo        Greece         75.8
 5 Placebo        Japan          70  
 6 Placebo        Mexico         65  
 7 Placebo        Spain          83  
 8 Placebo        Switzerland    69.3
 9 Placebo        USA            74.9
10 Screen Failure Canada         75.1
# … with 22 more rows