--- title: "Estimate individual-level transmission" output: rmarkdown::html_vignette bibliography: references.json link-citations: true vignette: > %\VignetteIndexEntry{Estimate individual-level transmission} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The rate of growth or decline in the incidence of a transmissible infectious disease is often quantified as the population-level reproduction number ($R$). However, there may be individual-level heterogeneity in transmission that this metric does not capture. In their landmark work, @lloyd-smithSuperspreadingEffectIndividual2005 showed that variation in individual-level disease transmission often exceed that of a poisson model and in some cases that of a geometric model. They fit a negative binomial distribution to show that the distribution of secondary cases displayed overdispersion, in other words, the variance exceeded the mean. In the negative binomial distribution this is quantified by the dispersion ($k$) parameter being less than 1. This vignette demonstrates how to use the {superspreading} and {fitdistrplus} R packages to estimate the parameters of individual-level transmission and select the best fitting model. Additionally, {ggplot2} is used for plotting. ```{r setup} library(superspreading) library(fitdistrplus) library(ggplot2) ``` ## Transmission data For this example we use transmission chain data from @fayeChainsTransmissionControl2015 from the Ebola virus disease outbreak in West Africa, between the period from February to August 2014. Specifically, this data reconstructs the pathway of Ebola transmission in Conakry, Guinea. This data has been used by @althausEbolaSuperspreading2015 to show that the Ebola outbreak in Conakry displayed signatures of overdispersion (i.e. superspreading events), and by @kucharskiEffectivenessRingVaccination2016 to determine the effectiveness of ring- and mass-vaccination campaigns for ongoing Ebola outbreaks. ```{r, prep-data} # total number of cases (i.e. individuals in transmission chain) n <- 152 # number of secondary cases for all cases all_cases_transmission <- c( 1, 2, 2, 5, 14, 1, 4, 4, 1, 3, 3, 8, 2, 1, 1, 4, 9, 9, 1, 1, 17, 2, 1, 1, 1, 4, 3, 3, 4, 2, 5, 1, 2, 2, 1, 9, 1, 3, 1, 2, 1, 1, 2 ) # add zeros for each cases that did not cause a secondary case # (i.e. cases that did not transmit) all_cases <- c( all_cases_transmission, rep(0, n - length(all_cases_transmission)) ) ``` ```{r, fit-dists} # fit a standard set of offspring distribution models: # - Poisson # - Geometric # - Negative Binomial pois_fit <- fitdist(data = all_cases, distr = "pois") geom_fit <- fitdist(data = all_cases, distr = "geom") nbinom_fit <- fitdist(data = all_cases, distr = "nbinom") ``` ```{r, fit-table} model_tbl <- ic_tbl(pois_fit, geom_fit, nbinom_fit) col.names <- gsub( pattern = "^Delta", replacement = "$\\\\Delta$", x = colnames(model_tbl) ) col.names <- gsub(pattern = "^w", replacement = "$w$", x = col.names) knitr::kable(model_tbl, col.names = col.names, row.names = FALSE, digits = 1) ``` The best performing model, for both AIC and BIC comparison, is the negative binomial. AIC and BIC are model fitting metrics (information criterion) to determine the model that best fits the data while penalising model complexity to prevent overfitting and allows comparison of models with a different number of parameters. In this case the Poisson and Geometric models have a single parameter, whereas the Negative binomial has two parameters. ```{r, nbinom-estim} nbinom_fit$estimate ``` The parameter for the negative binomial show that there is overdispersion (`size` is the dispersion parameter $k$, and `mu` is the mean, or $R$) in transmission and thus the EVD transmission chain data shows that superspreading events are a possible realisation of EVD transmission dynamics. These values match those reported in @althausEbolaSuperspreading2015 and here we reproduce figure 1 from that paper to display the tail of the distribution from which superspreading events can be possible. ```{r, prep-plot-nbinom} # tally cases tally <- table(all_cases) # pad with zeros when number of cases not in tally num_cases <- rep(0, 21) names(num_cases) <- as.character(seq(0, 20, 1)) num_cases[names(tally)] <- tally # convert cases to proportional of total cases to plot on the same scale as # the density prop_num_cases <- num_cases / sum(num_cases) # create data frame with proportion of cases, density and number of secondary # cases nbinom_data <- data.frame( x = seq(0, 20, 1), prop_num_cases = prop_num_cases, density = dnbinom( x = seq(0, 20, 1), size = nbinom_fit$estimate[["size"]], mu = nbinom_fit$estimate[["mu"]] ) ) ``` ```{r, plot-nbinom, fig.cap="Number of secondary cases from the empirical data (bar plot) and the density of the negative binomial with the maximum likelihood estimates when fit to the empirical data (points and line). This plot is reproduced from Althaus (2015) figure 1.", fig.width = 8, fig.height = 5} # make plot ggplot(data = nbinom_data) + geom_col( mapping = aes(x = x, y = prop_num_cases), fill = "cyan3", colour = "black" ) + geom_point( mapping = aes(x = x, y = density), colour = "#f58231", size = 2 ) + geom_line( mapping = aes(x = x, y = density), colour = "#f58231" ) + scale_x_continuous(name = "Number of secondary cases") + scale_y_continuous(name = "Frequency") + theme_bw() ``` Next we partitioned the data into index cases and secondary cases following @kucharskiEffectivenessRingVaccination2016. ```{r, split-data} index_case_transmission <- c(2, 17, 5, 1, 8, 2, 14) secondary_case_transmission <- c( 1, 2, 1, 4, 4, 1, 3, 3, 1, 1, 4, 9, 9, 1, 2, 1, 1, 1, 4, 3, 3, 4, 2, 5, 1, 2, 2, 1, 9, 1, 3, 1, 2, 1, 1, 2 ) # Format data into index and non-index cases # total non-index cases n_non_index <- sum(c(index_case_transmission, secondary_case_transmission)) # transmission from all non-index cases non_index_cases <- c( secondary_case_transmission, rep(0, n_non_index - length(secondary_case_transmission)) ) ``` We fit the negative binomial model for both the index and non-index case transmission events. ```{r, fit-nbinom} # Estimate R and k for index and non-index cases param_index <- fitdist( data = index_case_transmission, distr = "nbinom" ) param_index param_non_index <- fitdist( data = non_index_cases, distr = "nbinom" ) param_non_index ``` ## Superspreading using alternative distributions ::: {.alert .alert-primary} **Advanced options** This section of the vignette discusses some alternative transmission models to quantify overdispersion and may be considered advanced in comparison to the above example. ::: Research subsequent to @lloyd-smithSuperspreadingEffectIndividual2005 has shown that contact data from infectious disease outbreaks often shows signatures of overdispersion and can therefore cause superspreading events. However, overdispersion can be modelled by other distributions to the negative binomial. @kremerQuantifyingSuperspreadingCOVID192021 showed for SARS-CoV-2 that superspreading may be better modelled by a Poisson mixture distribution (or Poisson compound distribution). Four Poisson mixture distribution were applied: Poisson-Gamma, Poisson-Lognormal, Poisson-Weibull, and Poisson-Generalised-Gamma; finding the Poisson-lognormal and Poisson-Weibull fit best for three data sets. ::: {.alert .alert-warning} The Poisson-Gamma distribution is the same as the negative binomial distribution. Hereafter we refer to it is the negative binomial. ::: R has the density functions and cumulative distribution functions for the poisson (`dpois()` and `ppois()`), geometric (`dgeom()` and `pgeom()`) and negative binomial (`dnbinom()` and `pnbinom()`) distributions, but does not supply Poisson compound distributions. In the {superspreading} R package we supply the density and cumulative distribution functions for the Poisson-lognormal (`dpoislnorm()` and `ppoislnorm()`) and Poisson-Weibull (`dpoisweibull()` and `ppoisweibull()`). These functions can be used with {fitdistrplus}, as shown for the other distributions above. ```{r, fit-extra-dists} # fit an expanded set of offspring distribution models: # - Poisson # - Geometric # - Negative Binomial # - Poisson-lognormal compound # - Poisson-Weibull compound pois_fit <- fitdist(data = all_cases, distr = "pois") geom_fit <- fitdist(data = all_cases, distr = "geom") nbinom_fit <- fitdist(data = all_cases, distr = "nbinom") poislnorm_fit <- fitdist( data = all_cases, distr = "poislnorm", start = list(meanlog = 1, sdlog = 1) ) poisweibull_fit <- fitdist( data = all_cases, distr = "poisweibull", start = list(shape = 1, scale = 1) ) ``` ```{r, fit-extra-table} model_tbl <- ic_tbl( pois_fit, geom_fit, nbinom_fit, poislnorm_fit, poisweibull_fit ) col.names <- gsub( pattern = "^Delta", replacement = "$\\\\Delta$", x = colnames(model_tbl) ) col.names <- gsub(pattern = "^w", replacement = "$w$", x = col.names) knitr::kable(model_tbl, col.names = col.names, row.names = FALSE, digits = 1) ``` The negative binomial is still the best performing fitting model, even with the addition of the Poisson-lognormal and Poisson-Weibull models. Although the second best model, the Poisson-Weibull, performs similarly well to the negative binomial ($\Delta$AIC < 2 & $\Delta$BIC < 2). Here we plot the density of each distribution given the maximum likelihood estimates to inspect differences between the two distributions. ```{r, prep-plot-dists} # create data frame with proportion of cases, density of each distribution dist_compare_data <- data.frame( x = seq(0, 20, 1), prop_num_cases = c(prop_num_cases, rep(0, 21)), dens = c( dnbinom( x = seq(0, 20, 1), size = nbinom_fit$estimate[["size"]], mu = nbinom_fit$estimate[["mu"]] ), poisweibull_density = dpoisweibull( x = seq(0, 20, 1), shape = poisweibull_fit$estimate[["shape"]], scale = poisweibull_fit$estimate[["scale"]] ) ), dist = c(rep("Negative Binomial", 21), rep("Poisson-Weibull", 21)) ) ``` ```{r, plot-dists, fig.cap="Number of secondary cases from the empirical data (bar plot) and the density of the negative binomial (orange) and poisson-Weibull (pink) with the maximum likelihood estimates when fit to the empirical data (points and line).", fig.width = 8, fig.height = 5} # make plot ggplot(data = dist_compare_data) + geom_col( mapping = aes(x = x, y = prop_num_cases), fill = "cyan3", colour = "black" ) + geom_point( mapping = aes( x = x, y = dens, colour = dist ), size = 2 ) + geom_line(mapping = aes(x = x, y = dens, colour = dist)) + scale_x_continuous(name = "Number of secondary cases") + scale_y_continuous(name = "Frequency") + scale_colour_manual( name = "Distribution", values = c("#f58231", "#ffe119") ) + theme_bw() + theme(legend.position = "top") ``` The fit of each model is very similar, as expected from the close AIC and BIC (these models have the same numbers of free parameters). The Poisson-Weibull compound distribution can adequately capture the overdispersion in the EVD transmission data. A noticeable difference between the models fitting to the number of cases that caused a single secondary cases. Here the negative binomial matches the empirical data, whereas the Poisson-Weibull slightly overestimates the number. Overall, {superspreading} and {fitdistrplus} (or another fitting package of your choice) can be used to estimate and understand individual-level transmission dynamics when transmission chain data is available. ## References