The severity of a disease
(commonly understood as the case fatality risk, CFR) might change during
the course of an outbreak for multiple biological, epidemiological and
behavioural reasons. cfr_time_varying()
offers a convenient
way to understand changes in disease severity over time using an
approach with a finer time resolution that calculates severity over a
moving time window, using methods from Nishiura
et al. (2009).
New to calculating disease severity using cfr? You might want to see the “Get started” vignette first.
There are substantial changes to the characteristics of an outbreak over time — such as the introduction of therapeutics or a changing case definition. We want to estimate how disease severity in the form of the case fatality risk (CFR) changes over time while correcting for the delay in reporting the outcomes of cases.
library(cfr)
# packages to wrangle and plot data
library(dplyr)
library(tidyr)
library(purrr)
library(scales)
library(ggplot2)
This example shows time-varying severity estimation using cfr and data from the Covid-19 pandemic in the United Kingdom.
We load example Covid-19 daily case and death data provided with the
cfr package as covid_data
, and subset for the
first year of U.K. data.
We would expect the estimated CFR to change over this period due to changes in pandemic response policy, such as changes in case definitions, implementation and relaxation of lockdowns, and new variants emerging.
# get Covid data loaded with the package
data("covid_data")
# filter for the U.K
df_covid_uk <- filter(
covid_data,
country == "United Kingdom", date <= "2020-12-31"
)
# View the first few rows and recall necessary columns: date, cases, deaths
head(df_covid_uk)
#> date country cases deaths
#> 1 2020-01-03 United Kingdom 0 0
#> 2 2020-01-04 United Kingdom 0 0
#> 3 2020-01-05 United Kingdom 0 0
#> 4 2020-01-06 United Kingdom 0 0
#> 5 2020-01-07 United Kingdom 0 0
#> 6 2020-01-08 United Kingdom 0 0
We retrieve the appropriate distribution of the duration between symptom onset and deaths reported in Linton et al. (2020); this is a lognormal distribution with μ = 2.577 and σ = 0.440.
Linton et al. (2020) fitted a discrete lognormal distribution — but we use a continuous distribution here. See the vignette on delay distributions for more on when using a continuous instead of discrete distribution is acceptable, and on using discrete distributions with cfr.
Note also that we use the central estimates for each distribution parameter, and by ignoring uncertainty in these parameters the uncertainty in the resulting CFR is likely to be underestimated.
We use the cfr_time_varying()
function within the
cfr package to calculate the time-varying CFR for the Covid-19
pandemic in the U.K., and plot the results.
The burn_in
time is used to determine how many days at
the start of the outbreak are excluded from the CFR calculation,
potentially due to poor data quality at the beginning of an outbreak.
The default value is 7, which ignores the first week of data.
The smoothing_window
is used to smooth the case and
death data using a rolling median with a window of the corresponding
size (in days) using stats::runmed()
internally — this is
disabled by default. Users should apply smoothing if there are reporting
artefacts such as lower reporting on weekends.
# calculating the naive time-varying CFR
df_covid_cfr_uk_naive <- cfr_time_varying(
df_covid_uk,
burn_in = 7,
smoothing_window = 7
)
# calculating the corrected time-varying CFR
df_covid_cfr_uk_corrected <- cfr_time_varying(
df_covid_uk,
delay_density = function(x) dlnorm(x, meanlog = 2.577, sdlog = 0.440),
burn_in = 7,
smoothing_window = 7
)
# assign method tag and plot
df_covid_cfr_uk_naive$method <- "naive"
df_covid_cfr_uk_corrected$method <- "corrected"
df_covid_cfr_uk <- bind_rows(df_covid_cfr_uk_naive, df_covid_cfr_uk_corrected)
ggplot(df_covid_cfr_uk) +
geom_ribbon(
aes(
x = date, ymin = severity_low, ymax = severity_high,
fill = method
),
alpha = 0.5
) +
geom_line(
aes(
x = date, y = severity_estimate, colour = method
)
) +
scale_x_date(
date_labels = "%b-%Y"
) +
scale_y_continuous(
labels = percent
) +
scale_fill_brewer(
palette = "Dark2",
name = NULL,
labels = c("Naive CFR", "Corrected CFR")
) +
scale_colour_brewer(
palette = "Dark2",
name = NULL,
labels = c("Naive CFR", "Corrected CFR")
) +
labs(
x = "Date", y = "CFR (%)"
) +
coord_cartesian(
expand = FALSE
) +
theme_classic() +
theme(legend.position = "top")
#> Warning: Removed 93 rows containing missing values or values outside the scale range
#> (`geom_line()`).
Note that the severity estimates and confidence
intervals in cfr_time_varying()
are obtained from a
Binomial test on deaths (treated as ‘successes’) and estimated outcomes
or cases (depending on whether delay correction is applied; treated as
‘trials’).
cfr_time_varying()
and other cfr functions can
be conveniently applied over nested data to estimate the time-varying
severity of Covid-19.
We refer the user to the book R for Data Science for a better explanation of some of the code used here, including from the packages in the Tidyverse.
We use the example Covid-19 cases and deaths data provided with the
package as covid_data
, while excluding four countries which
only provide weekly data (with zeros for dates in between).
# countries with weekly reporting
weekly_reporting <- c("France", "Germany", "Spain", "Ukraine")
covid_data <- filter(covid_data, !country %in% weekly_reporting)
# for each country, get the time-varying severity estimate,
# correcting for delays and smoothing the case and death data
# first nest the data; nest() from {tidyr}
df_covid_cfr <- nest(
covid_data,
.by = country
)
# define delay density function
delay_density <- function(x) dlnorm(x, meanlog = 2.577, sdlog = 0.440)
# to each nested data frame, apply the function `cfr_time_varying`
# overwrite the `data` column, as all data will be preserved
df_covid_cfr <- mutate(
df_covid_cfr,
# using map() from {purrr}
data = map(
.x = data, .f = cfr_time_varying,
# arguments to the function
delay_density = delay_density,
smoothing_window = 7, burn_in = 7
)
)
# unnest the cfr data; unnest() from {tidyr}
df_covid_cfr <- unnest(df_covid_cfr, cols = data)
For simplicity, we use the same delay distribution between onset and death for all countries — users should note that this likely introduces biases given inter-country differences in testing or reporting policies.
Finally we plot the time-varying CFR for a selection of three countries with large outbreaks of Covid-19: Brazil, India, and the United States.
filter(df_covid_cfr, country %in% c("Brazil", "India", "United States")) %>%
ggplot() +
geom_ribbon(
aes(
x = date, ymin = severity_low, ymax = severity_high,
group = country
),
fill = "grey"
) +
geom_line(
aes(
x = date, y = severity_estimate, colour = country
)
) +
scale_x_date(
date_labels = "%b-%Y"
) +
scale_y_continuous(
labels = percent
) +
scale_colour_brewer(
palette = "Dark2"
) +
labs(
x = "Date", y = "CFR (%)"
) +
coord_cartesian(
ylim = c(0, 0.15),
expand = FALSE
) +
theme_classic() +
theme(legend.position = "top")
#> Warning: Removed 215 rows containing missing values or values outside the scale range
#> (`geom_line()`).
cfr_time_varying()
estimates the number of cases which
have a known outcome over time following Nishiura
et al. (2009), by calculating a
quantity kt for each day
within the input data, which represents the number of cases with a known
adverse outcome (usually death), on day t.
$$ k_t = \sum_{j = 0}^t c_t f_{j - t}. $$
We then assume that the severity measure (usually CFR) is binomially distributed in the following way
$$ d_t \sim {\sf Binomial}(k_t, \theta_t). $$
We use maximum likelihood estimation to determine the value of θt for each t, where θ represents the severity measure of interest.
The precise severity measure — case fatality risk (CFR), infection fatality risk (IFR), hospitalisation fatality risk (HFR), etc. — that θ represents depends upon the input data given by the user.
Note that the function arguments
burn_in
and smoothing_window
are not
explicitly used in this calculation. burn_in
controls how
many estimates at the beginning of the outbreak are replaced with
NA
s — the calculation above is not applied to the first
burn_in
data points. The calculation is applied to the
smoothed data, if a smoothing_window
is specified.