---
title: "Estimate individual-level transmission"
output: rmarkdown::html_vignette
bibliography: references.json
link-citations: true
vignette: >
  %\VignetteIndexEntry{Estimate individual-level transmission}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The rate of growth or decline in the incidence of a transmissible infectious disease
is often quantified as the population-level reproduction number ($R$). However,
there may be individual-level heterogeneity in transmission that this metric does not capture. 

In their landmark work, @lloyd-smithSuperspreadingEffectIndividual2005 showed
that variation in individual-level disease transmission often exceed that of a poisson
model and in some cases that of a geometric model. They fit a negative binomial
distribution to show that the distribution of secondary cases displayed overdispersion,
in other words, the variance exceeded the mean. In the negative binomial distribution
this is quantified by the dispersion ($k$) parameter being less than 1.

This vignette demonstrates how to use the {superspreading} and {fitdistrplus} R
packages to estimate the parameters of individual-level transmission and select
the best fitting model.

Additionally, {ggplot2} is used for plotting.

```{r setup}
library(superspreading)
library(fitdistrplus)
library(ggplot2)
```

## Transmission data

For this example we use transmission chain data from @fayeChainsTransmissionControl2015 from the Ebola virus disease outbreak in West Africa, between the period from February to August 2014. Specifically, this data reconstructs the pathway of Ebola transmission in Conakry, Guinea.

This data has been used by @althausEbolaSuperspreading2015 to show that the Ebola outbreak in Conakry displayed signatures of overdispersion (i.e. superspreading events), and by @kucharskiEffectivenessRingVaccination2016 to determine the effectiveness of ring- and mass-vaccination campaigns for ongoing Ebola outbreaks.

```{r, prep-data}
# total number of cases (i.e. individuals in transmission chain)
n <- 152

# number of secondary cases for all cases
all_cases_transmission <- c(
  1, 2, 2, 5, 14, 1, 4, 4, 1, 3, 3, 8, 2, 1, 1, 4, 9, 9, 1, 1, 17, 2, 1,
  1, 1, 4, 3, 3, 4, 2, 5, 1, 2, 2, 1, 9, 1, 3, 1, 2, 1, 1, 2
)

# add zeros for each cases that did not cause a secondary case
# (i.e. cases that did not transmit)
all_cases <- c(
  all_cases_transmission,
  rep(0, n - length(all_cases_transmission))
)
```

```{r, fit-dists}
# fit a standard set of offspring distribution models:
# - Poisson
# - Geometric
# - Negative Binomial

pois_fit <- fitdist(data = all_cases, distr = "pois")
geom_fit <- fitdist(data = all_cases, distr = "geom")
nbinom_fit <- fitdist(data = all_cases, distr = "nbinom")
```

```{r, fit-table}
model_tbl <- ic_tbl(pois_fit, geom_fit, nbinom_fit)
col.names <- gsub(
  pattern = "^Delta", replacement = "$\\\\Delta$", x = colnames(model_tbl)
)
col.names <- gsub(pattern = "^w", replacement = "$w$", x = col.names)
knitr::kable(model_tbl, col.names = col.names, row.names = FALSE, digits = 1)
```

The best performing model, for both AIC and BIC comparison, is the negative binomial.

AIC and BIC are model fitting metrics (information criterion) to determine the model 
that best fits the data while penalising model complexity to prevent overfitting and allows
comparison of models with a different number of parameters. In this case the Poisson
and Geometric models have a single parameter, whereas the Negative binomial has two
parameters.

```{r, nbinom-estim}
nbinom_fit$estimate
```

The parameter for the negative binomial show that there is overdispersion 
(`size` is the dispersion parameter $k$, and `mu` is the mean, or $R$) in
transmission and thus the EVD transmission chain data shows that superspreading
events are a possible realisation of EVD transmission dynamics.

These values match those reported in @althausEbolaSuperspreading2015 and here 
we reproduce figure 1 from that paper to display the tail of the distribution
from which superspreading events can be possible.

```{r, prep-plot-nbinom}
# tally cases
tally <- table(all_cases)

# pad with zeros when number of cases not in tally
num_cases <- rep(0, 21)
names(num_cases) <- as.character(seq(0, 20, 1))
num_cases[names(tally)] <- tally

# convert cases to proportional of total cases to plot on the same scale as
# the density
prop_num_cases <- num_cases / sum(num_cases)

# create data frame with proportion of cases, density and number of secondary
# cases
nbinom_data <- data.frame(
  x = seq(0, 20, 1),
  prop_num_cases = prop_num_cases,
  density = dnbinom(
    x = seq(0, 20, 1),
    size = nbinom_fit$estimate[["size"]],
    mu = nbinom_fit$estimate[["mu"]]
  )
)
```

```{r, plot-nbinom, fig.cap="Number of secondary cases from the empirical data (bar plot) and the density of the negative binomial with the maximum likelihood estimates when fit to the empirical data (points and line). This plot is reproduced from Althaus (2015) figure 1.", fig.width = 8, fig.height = 5}
# make plot
ggplot(data = nbinom_data) +
  geom_col(
    mapping = aes(x = x, y = prop_num_cases),
    fill = "cyan3",
    colour = "black"
  ) +
  geom_point(
    mapping = aes(x = x, y = density),
    colour = "#f58231",
    size = 2
  ) +
  geom_line(
    mapping = aes(x = x, y = density),
    colour = "#f58231"
  ) +
  scale_x_continuous(name = "Number of secondary cases") +
  scale_y_continuous(name = "Frequency") +
  theme_bw()
```

Next we partitioned the data into index cases and secondary cases following @kucharskiEffectivenessRingVaccination2016.

```{r, split-data}
index_case_transmission <- c(2, 17, 5, 1, 8, 2, 14)
secondary_case_transmission <- c(
  1, 2, 1, 4, 4, 1, 3, 3, 1, 1, 4, 9, 9, 1, 2, 1, 1, 1, 4, 3, 3, 4, 2,
  5, 1, 2, 2, 1, 9, 1, 3, 1, 2, 1, 1, 2
)

# Format data into index and non-index cases

# total non-index cases
n_non_index <- sum(c(index_case_transmission, secondary_case_transmission))
# transmission from all non-index cases
non_index_cases <- c(
  secondary_case_transmission,
  rep(0, n_non_index - length(secondary_case_transmission))
)
```

We fit the negative binomial model for both the index and non-index case transmission events.

```{r, fit-nbinom}
# Estimate R and k for index and non-index cases
param_index <- fitdist(
  data = index_case_transmission,
  distr = "nbinom"
)
param_index
param_non_index <- fitdist(
  data = non_index_cases,
  distr = "nbinom"
)
param_non_index
```

## Superspreading using alternative distributions

::: {.alert .alert-primary}
**Advanced options** 

This section of the vignette discusses some alternative transmission models to 
quantify overdispersion and may be considered advanced in comparison to the above
example.
:::

Research subsequent to @lloyd-smithSuperspreadingEffectIndividual2005 has shown that 
contact data from infectious disease outbreaks often shows signatures of overdispersion and can therefore cause superspreading events. However, overdispersion can be modelled by other distributions to the negative
binomial. 

@kremerQuantifyingSuperspreadingCOVID192021 showed for SARS-CoV-2 that superspreading
may be better modelled by a Poisson mixture distribution (or Poisson compound distribution).
Four Poisson mixture distribution were applied: Poisson-Gamma, Poisson-Lognormal, Poisson-Weibull,
and Poisson-Generalised-Gamma; finding the Poisson-lognormal and Poisson-Weibull fit best
for three data sets.

::: {.alert .alert-warning}
The Poisson-Gamma distribution is the same as the negative binomial distribution. 
Hereafter we refer to it is the negative binomial.
:::

R has the density functions and cumulative distribution functions for the poisson 
(`dpois()` and `ppois()`), geometric (`dgeom()` and `pgeom()`) and negative binomial
(`dnbinom()` and `pnbinom()`) distributions, but does not supply Poisson compound 
distributions. 

In the {superspreading} R package we supply the density and cumulative
distribution functions for the Poisson-lognormal (`dpoislnorm()` and `ppoislnorm()`) 
and Poisson-Weibull (`dpoisweibull()` and `ppoisweibull()`).

These functions can be used with {fitdistrplus}, as shown for the other distributions above.

```{r, fit-extra-dists}
# fit an expanded set of offspring distribution models:
# - Poisson
# - Geometric
# - Negative Binomial
# - Poisson-lognormal compound
# - Poisson-Weibull compound

pois_fit <- fitdist(data = all_cases, distr = "pois")
geom_fit <- fitdist(data = all_cases, distr = "geom")
nbinom_fit <- fitdist(data = all_cases, distr = "nbinom")
poislnorm_fit <- fitdist(
  data = all_cases,
  distr = "poislnorm",
  start = list(meanlog = 1, sdlog = 1)
)
poisweibull_fit <- fitdist(
  data = all_cases,
  distr = "poisweibull",
  start = list(shape = 1, scale = 1)
)
```

```{r, fit-extra-table}
model_tbl <- ic_tbl(
  pois_fit, geom_fit, nbinom_fit, poislnorm_fit, poisweibull_fit
)
col.names <- gsub(
  pattern = "^Delta", replacement = "$\\\\Delta$", x = colnames(model_tbl)
)
col.names <- gsub(pattern = "^w", replacement = "$w$", x = col.names)
knitr::kable(model_tbl, col.names = col.names, row.names = FALSE, digits = 1)
```

The negative binomial is still the best performing fitting model, even with the addition of the Poisson-lognormal
and Poisson-Weibull models. Although the second best model, the Poisson-Weibull, performs similarly well to the 
negative binomial ($\Delta$AIC < 2 & $\Delta$BIC < 2).

Here we plot the density of each distribution given the maximum likelihood estimates to inspect differences
between the two distributions.


```{r, prep-plot-dists}
# create data frame with proportion of cases, density of each distribution
dist_compare_data <- data.frame(
  x = seq(0, 20, 1),
  prop_num_cases = c(prop_num_cases, rep(0, 21)),
  dens = c(
    dnbinom(
      x = seq(0, 20, 1),
      size = nbinom_fit$estimate[["size"]],
      mu = nbinom_fit$estimate[["mu"]]
    ),
    poisweibull_density = dpoisweibull(
      x = seq(0, 20, 1),
      shape = poisweibull_fit$estimate[["shape"]],
      scale = poisweibull_fit$estimate[["scale"]]
    )
  ),
  dist = c(rep("Negative Binomial", 21), rep("Poisson-Weibull", 21))
)
```

```{r, plot-dists, fig.cap="Number of secondary cases from the empirical data (bar plot) and the density of the negative binomial (orange) and poisson-Weibull (pink) with the maximum likelihood estimates when fit to the empirical data (points and line).", fig.width = 8, fig.height = 5}
# make plot
ggplot(data = dist_compare_data) +
  geom_col(
    mapping = aes(x = x, y = prop_num_cases),
    fill = "cyan3",
    colour = "black"
  ) +
  geom_point(
    mapping = aes(
      x = x,
      y = dens,
      colour = dist
    ),
    size = 2
  ) +
  geom_line(mapping = aes(x = x, y = dens, colour = dist)) +
  scale_x_continuous(name = "Number of secondary cases") +
  scale_y_continuous(name = "Frequency") +
  scale_colour_manual(
    name = "Distribution",
    values = c("#f58231", "#ffe119")
  ) +
  theme_bw() +
  theme(legend.position = "top")
```

The fit of each model is very similar, as expected from the close AIC and BIC (these models have the same numbers of free parameters). 
The Poisson-Weibull compound distribution can adequately capture the overdispersion in the EVD transmission data. A noticeable difference
between the models fitting to the number of cases that caused a single secondary cases. Here the negative binomial matches the empirical
data, whereas the Poisson-Weibull slightly overestimates the number.

Overall, {superspreading} and {fitdistrplus} (or another fitting package of your choice) can be used to estimate and 
understand individual-level transmission dynamics when transmission chain data is available.

## References