The rate of growth or decline in the incidence of a transmissible infectious disease is often quantified as the population-level reproduction number (R). However, there may be individual-level heterogeneity in transmission that this metric does not capture.
In their landmark work, Lloyd-Smith et al. (2005) showed that variation in individual-level disease transmission often exceed that of a poisson model and in some cases that of a geometric model. They fit a negative binomial distribution to show that the distribution of secondary cases displayed overdispersion, in other words, the variance exceeded the mean. In the negative binomial distribution this is quantified by the dispersion (k) parameter being less than 1.
This vignette demonstrates how to use the {superspreading} and {fitdistrplus} R packages to estimate the parameters of individual-level transmission and select the best fitting model.
Additionally, {ggplot2} is used for plotting.
library(superspreading)
library(fitdistrplus)
#> Loading required package: MASS
#> Loading required package: survival
library(ggplot2)
For this example we use transmission chain data from Faye et al. (2015) from the Ebola virus disease outbreak in West Africa, between the period from February to August 2014. Specifically, this data reconstructs the pathway of Ebola transmission in Conakry, Guinea.
This data has been used by Althaus (2015) to show that the Ebola outbreak in Conakry displayed signatures of overdispersion (i.e. superspreading events), and by Kucharski et al. (2016) to determine the effectiveness of ring- and mass-vaccination campaigns for ongoing Ebola outbreaks.
# total number of cases (i.e. individuals in transmission chain)
n <- 152
# number of secondary cases for all cases
all_cases_transmission <- c(
1, 2, 2, 5, 14, 1, 4, 4, 1, 3, 3, 8, 2, 1, 1, 4, 9, 9, 1, 1, 17, 2, 1,
1, 1, 4, 3, 3, 4, 2, 5, 1, 2, 2, 1, 9, 1, 3, 1, 2, 1, 1, 2
)
# add zeros for each cases that did not cause a secondary case
# (i.e. cases that did not transmit)
all_cases <- c(
all_cases_transmission,
rep(0, n - length(all_cases_transmission))
)
# fit a standard set of offspring distribution models:
# - Poisson
# - Geometric
# - Negative Binomial
pois_fit <- fitdist(data = all_cases, distr = "pois")
geom_fit <- fitdist(data = all_cases, distr = "geom")
nbinom_fit <- fitdist(data = all_cases, distr = "nbinom")
model_tbl <- ic_tbl(pois_fit, geom_fit, nbinom_fit)
col.names <- gsub(
pattern = "^Delta", replacement = "$\\\\Delta$", x = colnames(model_tbl)
)
col.names <- gsub(pattern = "^w", replacement = "$w$", x = col.names)
knitr::kable(model_tbl, col.names = col.names, row.names = FALSE, digits = 1)
distribution | AIC | ΔAIC | wAIC | BIC | ΔBIC | wBIC |
---|---|---|---|---|---|---|
nbinom | 358.4 | 0.0 | 1 | 364.4 | 0.0 | 1 |
geom | 413.6 | 55.2 | 0 | 416.6 | 52.1 | 0 |
pois | 602.4 | 244.0 | 0 | 605.4 | 241.0 | 0 |
The best performing model, for both AIC and BIC comparison, is the negative binomial.
AIC and BIC are model fitting metrics (information criterion) to determine the model that best fits the data while penalising model complexity to prevent overfitting and allows comparison of models with a different number of parameters. In this case the Poisson and Geometric models have a single parameter, whereas the Negative binomial has two parameters.
The parameter for the negative binomial show that there is
overdispersion (size
is the dispersion parameter k, and mu
is the mean,
or R) in transmission and thus
the EVD transmission chain data shows that superspreading events are a
possible realisation of EVD transmission dynamics.
These values match those reported in Althaus (2015) and here we reproduce figure 1 from that paper to display the tail of the distribution from which superspreading events can be possible.
# tally cases
tally <- table(all_cases)
# pad with zeros when number of cases not in tally
num_cases <- rep(0, 21)
names(num_cases) <- as.character(seq(0, 20, 1))
num_cases[names(tally)] <- tally
# convert cases to proportional of total cases to plot on the same scale as
# the density
prop_num_cases <- num_cases / sum(num_cases)
# create data frame with proportion of cases, density and number of secondary
# cases
nbinom_data <- data.frame(
x = seq(0, 20, 1),
prop_num_cases = prop_num_cases,
density = dnbinom(
x = seq(0, 20, 1),
size = nbinom_fit$estimate[["size"]],
mu = nbinom_fit$estimate[["mu"]]
)
)
# make plot
ggplot(data = nbinom_data) +
geom_col(
mapping = aes(x = x, y = prop_num_cases),
fill = "cyan3",
colour = "black"
) +
geom_point(
mapping = aes(x = x, y = density),
colour = "#f58231",
size = 2
) +
geom_line(
mapping = aes(x = x, y = density),
colour = "#f58231"
) +
scale_x_continuous(name = "Number of secondary cases") +
scale_y_continuous(name = "Frequency") +
theme_bw()
Next we partitioned the data into index cases and secondary cases following Kucharski et al. (2016).
index_case_transmission <- c(2, 17, 5, 1, 8, 2, 14)
secondary_case_transmission <- c(
1, 2, 1, 4, 4, 1, 3, 3, 1, 1, 4, 9, 9, 1, 2, 1, 1, 1, 4, 3, 3, 4, 2,
5, 1, 2, 2, 1, 9, 1, 3, 1, 2, 1, 1, 2
)
# Format data into index and non-index cases
# total non-index cases
n_non_index <- sum(c(index_case_transmission, secondary_case_transmission))
# transmission from all non-index cases
non_index_cases <- c(
secondary_case_transmission,
rep(0, n_non_index - length(secondary_case_transmission))
)
We fit the negative binomial model for both the index and non-index case transmission events.
# Estimate R and k for index and non-index cases
param_index <- fitdist(
data = index_case_transmission,
distr = "nbinom"
)
param_index
#> Fitting of the distribution ' nbinom ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> size 1.596646 1.025029
#> mu 7.000771 2.320850
param_non_index <- fitdist(
data = non_index_cases,
distr = "nbinom"
)
param_non_index
#> Fitting of the distribution ' nbinom ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> size 0.1937490 0.05005421
#> mu 0.6619608 0.14197451
Advanced options
This section of the vignette discusses some alternative transmission models to quantify overdispersion and may be considered advanced in comparison to the above example.
Research subsequent to Lloyd-Smith et al. (2005) has shown that contact data from infectious disease outbreaks often shows signatures of overdispersion and can therefore cause superspreading events. However, overdispersion can be modelled by other distributions to the negative binomial.
Kremer et al. (2021) showed for SARS-CoV-2 that superspreading may be better modelled by a Poisson mixture distribution (or Poisson compound distribution). Four Poisson mixture distribution were applied: Poisson-Gamma, Poisson-Lognormal, Poisson-Weibull, and Poisson-Generalised-Gamma; finding the Poisson-lognormal and Poisson-Weibull fit best for three data sets.
The Poisson-Gamma distribution is the same as the negative binomial distribution. Hereafter we refer to it is the negative binomial.
R has the density functions and cumulative distribution functions for
the poisson (dpois()
and ppois()
), geometric
(dgeom()
and pgeom()
) and negative binomial
(dnbinom()
and pnbinom()
) distributions, but
does not supply Poisson compound distributions.
In the {superspreading} R package we supply the density and
cumulative distribution functions for the Poisson-lognormal
(dpoislnorm()
and ppoislnorm()
) and
Poisson-Weibull (dpoisweibull()
and
ppoisweibull()
).
These functions can be used with {fitdistrplus}, as shown for the other distributions above.
# fit an expanded set of offspring distribution models:
# - Poisson
# - Geometric
# - Negative Binomial
# - Poisson-lognormal compound
# - Poisson-Weibull compound
pois_fit <- fitdist(data = all_cases, distr = "pois")
geom_fit <- fitdist(data = all_cases, distr = "geom")
nbinom_fit <- fitdist(data = all_cases, distr = "nbinom")
poislnorm_fit <- fitdist(
data = all_cases,
distr = "poislnorm",
start = list(meanlog = 1, sdlog = 1)
)
poisweibull_fit <- fitdist(
data = all_cases,
distr = "poisweibull",
start = list(shape = 1, scale = 1)
)
model_tbl <- ic_tbl(
pois_fit, geom_fit, nbinom_fit, poislnorm_fit, poisweibull_fit
)
col.names <- gsub(
pattern = "^Delta", replacement = "$\\\\Delta$", x = colnames(model_tbl)
)
col.names <- gsub(pattern = "^w", replacement = "$w$", x = col.names)
knitr::kable(model_tbl, col.names = col.names, row.names = FALSE, digits = 1)
distribution | AIC | ΔAIC | wAIC | BIC | ΔBIC | wBIC |
---|---|---|---|---|---|---|
nbinom | 358.4 | 0.0 | 0.6 | 364.4 | 0.0 | 0.6 |
poisweibull | 359.8 | 1.4 | 0.3 | 365.9 | 1.4 | 0.3 |
poislnorm | 362.8 | 4.4 | 0.1 | 368.9 | 4.4 | 0.1 |
geom | 413.6 | 55.2 | 0.0 | 416.6 | 52.1 | 0.0 |
pois | 602.4 | 244.0 | 0.0 | 605.4 | 241.0 | 0.0 |
The negative binomial is still the best performing fitting model, even with the addition of the Poisson-lognormal and Poisson-Weibull models. Although the second best model, the Poisson-Weibull, performs similarly well to the negative binomial (ΔAIC < 2 & ΔBIC < 2).
Here we plot the density of each distribution given the maximum likelihood estimates to inspect differences between the two distributions.
# create data frame with proportion of cases, density of each distribution
dist_compare_data <- data.frame(
x = seq(0, 20, 1),
prop_num_cases = c(prop_num_cases, rep(0, 21)),
dens = c(
dnbinom(
x = seq(0, 20, 1),
size = nbinom_fit$estimate[["size"]],
mu = nbinom_fit$estimate[["mu"]]
),
poisweibull_density = dpoisweibull(
x = seq(0, 20, 1),
shape = poisweibull_fit$estimate[["shape"]],
scale = poisweibull_fit$estimate[["scale"]]
)
),
dist = c(rep("Negative Binomial", 21), rep("Poisson-Weibull", 21))
)
# make plot
ggplot(data = dist_compare_data) +
geom_col(
mapping = aes(x = x, y = prop_num_cases),
fill = "cyan3",
colour = "black"
) +
geom_point(
mapping = aes(
x = x,
y = dens,
colour = dist
),
size = 2
) +
geom_line(mapping = aes(x = x, y = dens, colour = dist)) +
scale_x_continuous(name = "Number of secondary cases") +
scale_y_continuous(name = "Frequency") +
scale_colour_manual(
name = "Distribution",
values = c("#f58231", "#ffe119")
) +
theme_bw() +
theme(legend.position = "top")
The fit of each model is very similar, as expected from the close AIC and BIC (these models have the same numbers of free parameters). The Poisson-Weibull compound distribution can adequately capture the overdispersion in the EVD transmission data. A noticeable difference between the models fitting to the number of cases that caused a single secondary cases. Here the negative binomial matches the empirical data, whereas the Poisson-Weibull slightly overestimates the number.
Overall, {superspreading} and {fitdistrplus} (or another fitting package of your choice) can be used to estimate and understand individual-level transmission dynamics when transmission chain data is available.