An SDTM DTC variable may include data that is represented in ISO 8601 format as a
complete date/time, a partial date/time, or an incomplete date/time.
sdtm.oak provides the create_iso8601()
function that allows flexible mapping of date and time values in various
formats to a single date-time ISO 8601 format.
Introduction
To perform conversion to the ISO 8601 format you need to pass two key arguments:
- At least one vector of dates, times, or date-times of
character
type; - A date/time format via the
.format
parameter that instructscreate_iso8601()
on which date/time components to expect.
create_iso8601("2000 01 05", .format = "y m d")
#> [1] "2000-01-05"
create_iso8601("22:35:05", .format = "H:M:S")
#> [1] "-----T22:35:05"
By default the .format
parameter understands a few
reserved characters:
-
"y"
for year -
"m"
for month -
"d"
for day -
"H"
for hours -
"M"
for minutes -
"S"
for seconds
Besides character vectors of dates and times, you may also pass a single vector of date-times, provided you adjust the format:
create_iso8601("2000-01-05 22:35:05", .format = "y-m-d H:M:S")
#> [1] "2000-01-05T22:35:05"
Multiple inputs
If you have dates and times in separate vectors then you will need to pass a format for each vector:
create_iso8601("2000-01-05", "22:35:05", .format = c("y-m-d", "H:M:S"))
#> [1] "2000-01-05T22:35:05"
In addition, like most R functions that take vectors as input,
create_iso8601()
is vectorized:
date <- c("2000-01-05", "2001-12-25", "1980-06-18", "1979-09-07")
time <- c("00:12:21", "22:35:05", "03:00:15", "07:09:00")
create_iso8601(date, time, .format = c("y-m-d", "H:M:S"))
#> [1] "2000-01-05T00:12:21" "2001-12-25T22:35:05" "1980-06-18T03:00:15"
#> [4] "1979-09-07T07:09:00"
But the number of elements in each of the inputs has to match or you will get an error:
date <- c("2000-01-05", "2001-12-25", "1980-06-18", "1979-09-07")
time <- "00:12:21"
try(create_iso8601(date, time, .format = c("y-m-d", "H:M:S")))
#> Error in create_iso8601(date, time, .format = c("y-m-d", "H:M:S")) :
#> All vectors in `...` must be of the same length.
You can combine individual date and time components coming in as separate inputs; here is a contrived example of year, month and day together, hour, and minute:
year <- c("99", "84", "00", "80", "79", "1944", "1953")
month_and_day <- c("jan 1", "apr 04", "mar 06", "jun 18", "sep 07", "sep 13", "sep 14")
hour <- c("12", "13", "05", "23", "16", "16", "19")
min <- c("0", "60", "59", "42", "44", "10", "13")
create_iso8601(year, month_and_day, hour, min, .format = c("y", "m d", "H", "M"))
#> [1] "1999-01-01T12:00" "1984-04-04T13:60" "2000-03-06T05:59" "1980-06-18T23:42"
#> [5] "1979-09-07T16:44" "1944-09-13T16:10" "1953-09-14T19:13"
The .format
argument must be always named; otherwise, it
will be treated as if it were one of the inputs and interpreted as
missing.
try(create_iso8601("2000-01-05", "y-m-d"))
#> Error in create_iso8601("2000-01-05", "y-m-d") :
#> argument ".format" is missing, with no default
Format variations
The .format
parameter can easily accommodate variations
in the format of the inputs:
create_iso8601("2000-01-05", .format = "y-m-d")
#> [1] "2000-01-05"
create_iso8601("2000 01 05", .format = "y m d")
#> [1] "2000-01-05"
create_iso8601("2000/01/05", .format = "y/m/d")
#> [1] "2000-01-05"
Individual components may come in a different order, so adjust the format accordingly:
create_iso8601("2000 01 05", .format = "y m d")
#> [1] "2000-01-05"
create_iso8601("05 01 2000", .format = "d m y")
#> [1] "2000-01-05"
create_iso8601("01 05, 2000", .format = "m d, y")
#> [1] "2000-01-05"
All other individual characters given in the format are taken strictly, e.g. the number of spaces matters:
date <- c("2000 01 05", "2000 01 05", "2000 01 05", "2000 01 05")
create_iso8601(date, .format = "y m d")
#> [1] "2000-01-05" NA NA NA
create_iso8601(date, .format = "y m d")
#> [1] NA "2000-01-05" NA NA
create_iso8601(date, .format = "y m d")
#> [1] NA NA "2000-01-05" NA
create_iso8601(date, .format = "y m d")
#> [1] NA NA NA "2000-01-05"
The format can include regular expressions though:
create_iso8601(date, .format = "y\\s+m\\s+d")
#> [1] "2000-01-05" "2000-01-05" "2000-01-05" "2000-01-05"
By default, a streak of the reserved characters is treated as if only one was provided, so these formats are equivalent:
date <- c("2000-01-05", "2001-12-25", "1980-06-18", "1979-09-07")
time <- c("00:12:21", "22:35:05", "03:00:15", "07:09:00")
create_iso8601(date, time, .format = c("y-m-d", "H:M:S"))
#> [1] "2000-01-05T00:12:21" "2001-12-25T22:35:05" "1980-06-18T03:00:15"
#> [4] "1979-09-07T07:09:00"
create_iso8601(date, time, .format = c("yyyy-mm-dd", "HH:MM:SS"))
#> [1] "2000-01-05T00:12:21" "2001-12-25T22:35:05" "1980-06-18T03:00:15"
#> [4] "1979-09-07T07:09:00"
create_iso8601(date, time, .format = c("yyyyyyyy-m-dddddd", "H:MMMMM:SSSS"))
#> [1] "2000-01-05T00:12:21" "2001-12-25T22:35:05" "1980-06-18T03:00:15"
#> [4] "1979-09-07T07:09:00"
Multiple alternative formats
When an input vector contains values with varying formats, a single format may not be adequate to encompass all variations. In such situations, it’s advisable to list multiple alternative formats. This approach ensures that each format is tried sequentially until one matches the data in the vector.
date <- c("2000/01/01", "2000-01-02", "2000 01 03", "2000/01/04")
create_iso8601(date, .format = "y-m-d")
#> [1] NA "2000-01-02" NA NA
create_iso8601(date, .format = "y m d")
#> [1] NA NA "2000-01-03" NA
create_iso8601(date, .format = "y/m/d")
#> [1] "2000-01-01" NA NA "2000-01-04"
create_iso8601(date, .format = list(c("y-m-d", "y m d", "y/m/d")))
#> [1] "2000-01-01" "2000-01-02" "2000-01-03" "2000-01-04"
Consider the order in which you supply the formats, as it can be significant. If multiple formats could potentially match, the sequence determines which format is applied first.
create_iso8601("07 04 2000", .format = list(c("d m y", "m d y")))
#> [1] "2000-04-07"
create_iso8601("07 04 2000", .format = list(c("m d y", "d m y")))
#> [1] "2000-07-04"
Note that if you are passing alternative formats, then the
.format
argument must be a list whose length matches the
number of inputs.
Parsing of date or time components
By default, date or time components are parsed as follows:
- year: either parsed from a two- or four-digit year;
- month: either as a numeric month (single or two-digit number) or as an English abbreviated month name (e.g. Jan, Jun or Dec) regardless of case;
- month day: are parsed from two-digit numbers;
- hour and minute: are parsed from single or two-digit numbers;
- second: is parsed from single or two-digit numbers with an optional fractional part.
# Years: two-digit or four-digit numbers.
years <- c("0", "1", "00", "01", "15", "30", "50", "68", "69", "80", "99")
create_iso8601(years, .format = "y")
#> [1] NA NA "2000" "2001" "2015" "2030" "2050" "2068" "1969" "1980"
#> [11] "1999"
# Adjust the point where two-digits years are mapped to 2000's or 1900's.
create_iso8601(years, .format = "y", .cutoff_2000 = 20L)
#> [1] NA NA "2000" "2001" "2015" "1930" "1950" "1968" "1969" "1980"
#> [11] "1999"
# Both numeric months (two-digit only) and abbreviated months work out of the box
months <- c("0", "00", "1", "01", "Jan", "jan")
create_iso8601(months, .format = "m")
#> [1] NA "--00" NA "--01" "--01" "--01"
# Month days: single or two-digit numbers, anything else results in NA.
create_iso8601(c("1", "01", "001", "10", "20", "31"), .format = "d")
#> [1] "----01" "----01" NA "----10" "----20" "----31"
# Hours
create_iso8601(c("1", "01", "001", "10", "20", "31"), .format = "H")
#> [1] "-----T01" "-----T01" NA "-----T10" "-----T20" "-----T31"
# Minutes
create_iso8601(c("1", "01", "001", "10", "20", "60"), .format = "M")
#> [1] "-----T-:01" "-----T-:01" NA "-----T-:10" "-----T-:20"
#> [6] "-----T-:60"
# Seconds
create_iso8601(c("1", "01", "23.04", "001", "10", "20", "60"), .format = "S")
#> [1] "-----T-:-:01" "-----T-:-:01" "-----T-:-:23.04" NA
#> [5] "-----T-:-:10" "-----T-:-:20" "-----T-:-:60"
Allowing alternative date or time values
If date or time component values include special values, e.g. values
encoding missing values, then you can indicate those values as possible
alternatives such that the parsing will tolerate them; use the
.na
argument:
create_iso8601("U DEC 2019 14:00", .format = "d m y H:M")
#> [1] NA
create_iso8601("U DEC 2019 14:00", .format = "d m y H:M", .na = "U")
#> [1] "2019-12--T14:00"
create_iso8601("U UNK 2019 14:00", .format = "d m y H:M")
#> [1] NA
create_iso8601("U UNK 2019 14:00", .format = "d m y H:M", .na = c("U", "UNK"))
#> [1] "2019----T14:00"
In this case you could achieve the same result using regexps:
create_iso8601("U UNK 2019 14:00", .format = "(d|U) (m|UNK) y H:M")
#> [1] "2019----T14:00"
Changing reserved format characters
There might be cases when the reserved characters — "y"
,
"m"
, "d"
, "H"
, "M"
,
"S"
— might get in the way of specifying an adequate
format. For example, you might be tempted to use format
"HHMM"
to try to parse a time such as
"14H00M"
. You could assume that the first “H” codes for
parsing the hour, and the second “H” to be a literal “H” but, actually,
"HH"
will be taken to mean parsing hours, and
"MM"
to parse minutes. You can use the function
fmt_cmp()
to specify alternative format regexps for the
format, replacing the default characters.
In the next example, we reassign new format strings for the hour and
minute components, thus freeing the "H"
and
"M"
patterns from being interpreted as hours and minutes,
and to be taken literally:
create_iso8601("14H00M", .format = "HHMM")
#> [1] NA
create_iso8601("14H00M", .format = "xHwM", .fmt_c = fmt_cmp(hour = "x", min = "w"))
#> [1] "-----T14:00"
Note that you need to make sure that the format component regexps are
mutually exclusive, i.e. they don’t have overlapping matches; otherwise
create_iso8601()
will fail with an error. In the next
example both months and minutes could be represented by an
"m"
in the format resulting in an ambiguous format
specification.
fmt_cmp(hour = "h", min = "m")
#> $sec
#> [1] "S+"
#>
#> $min
#> [1] "m"
#>
#> $hour
#> [1] "h"
#>
#> $mday
#> [1] "d+"
#>
#> $mon
#> [1] "m+"
#>
#> $year
#> [1] "y+"
#>
#> attr(,"class")
#> [1] "fmt_c"
try(create_iso8601("14H00M", .format = "hHmM", .fmt_c = fmt_cmp(hour = "h", min = "m")))
#> Error in purrr::map2(dots, .format, ~parse_dttm(dttm = .x, fmt = .y, na = .na, :
#> ℹ In index: 1.
#> Caused by error in `purrr::map()` at sdtm.oak/R/dtc_parse_dttm.R:78:3:
#> ℹ In index: 1.
#> Caused by error in `parse_dttm_fmt()` at sdtm.oak/R/parse_dttm_fmt.R:387:3:
#> ! Patterns in `fmt_c` have overlapping matches.