library(dplyr)
<- readRDS("data/dm.rds")
dm <- readRDS("data/ae.rds") ae
7 summarizing data
While mutate()
adds a new variable for all existing records to a dataset, summarize()
aggregates one or more columns of a dataset thereby “collapsing” it. In the simplest case a single variable is aggregated using a summary function such as mean()
.
%>%
dm summarize(avg_age = mean(AGE, na.rm = TRUE))
# A tibble: 1 × 1
avg_age
<dbl>
1 75.1
Just like you can create multiple variables inside a single call to mutate()
you can aggregate multiple variables (or the same variable with multiple summary functions) inside summarize()
.
%>%
dm summarize(
avg_age = mean(AGE, na.rm = TRUE),
median_age = median(AGE, na.rm = TRUE)
)
# A tibble: 1 × 2
avg_age median_age
<dbl> <dbl>
1 75.1 77
So far we aggregated only numeric variables. Another useful aggregation is counting the number of records.
%>%
dm summarize(N = n())
# A tibble: 1 × 1
N
<int>
1 306
This becomes quite powerful when combining summarize()
with group_by()
. This should look rather familiar to you if you every aggregated data using proc sql
.
%>%
dm group_by(COUNTRY) %>%
summarize(n = n()) %>%
ungroup()
# A tibble: 9 × 2
COUNTRY n
<chr> <int>
1 Canada 86
2 China 55
3 Germany 49
4 Greece 13
5 Japan 12
6 Mexico 5
7 Spain 6
8 Switzerland 9
9 USA 71
Note that it is best practice to ungroup()
the dataset after you aggregated it. Failing to do so can lead to some rather bogus error when continuing to manipulate the aggregated dataset, e.g. using mutate()
.
group_by()
and summarize()
can be used with numeric variables as well. In addition one can group by more than a single variable.
%>%
dm group_by(ARM, COUNTRY) %>%
summarize(avg_age = mean(AGE, na.rm = TRUE)) %>%
ungroup()
# A tibble: 32 × 3
ARM COUNTRY avg_age
<chr> <chr> <dbl>
1 Placebo Canada 79.2
2 Placebo China 70.5
3 Placebo Germany 75.2
4 Placebo Greece 75.8
5 Placebo Japan 70
6 Placebo Mexico 65
7 Placebo Spain 83
8 Placebo Switzerland 69.3
9 Placebo USA 74.9
10 Screen Failure Canada 75.1
# … with 22 more rows