---
title: "Methods for calculating the proportion of transmission"
output: rmarkdown::html_vignette
bibliography: references.json
link-citations: true
vignette: >
  %\VignetteIndexEntry{Methods for calculating the proportion of transmission}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette explores the `proportion_transmission()` function in {superspreading}. The function calculates what proportion of cases we would expect to cause a certain proportion of transmission for a particular infectious disease (e.g. "how much transmission comes from the top 10% of infectious individuals?")

To perform this calculation, we assume that the offspring distribution of disease transmission depends both on the distribution of individual variability in transmissibility, which we define using a Gamma distribution with mean $R$, as well as stochastic transmission within a population, which we define using a Poisson process, following @lloyd-smithSuperspreadingEffectIndividual2005.

If we put a Gamma distributed individual transmissibility into a Poisson distribution, the result a negative binomial distribution. This is defined by two parameters: $R$, the mean of the negative binomial distribution and the average number of secondary cases caused by a typical primary case; and $k$, the dispersion parameter of the negative binomial distribution and controls the heterogeneity in transmission. A smaller $k$ results is more variability (overdispersion) in transmission and thus superspreading events are more likely. 

::: {.alert .alert-info}
Poisson and geometric offspring distributions are special cases of the negative binomial offspring distribution. By setting $k$ to `Inf` (or approximately infinite, $> 10^5$) then the offspring distribution is a Poisson distribution. By setting $k$ to 1 the offspring distribution is a geometric distribution.

It is currently not possible to calculate the proportion transmission using the Poisson-Lognormal and Poisson-Weibull distributions (whose density and cumulative distribution functions are included in the {superspreading} package).
:::

The proportion of transmission can be calculated using two methods, both of which are included in the `proportion_transmission()` function and can be changed using the `method` argument. The first method focuses on transmission as it occurs in reality, accounting both for variation in the mean number of secondary cases at the individual *and* the stochastic nature of onwards transmission within a population; the second method focuses only on variation in the mean number of secondary cases at the individual level. The first method is denoted $p_{80}$ and the second $t_{20}$. The $p_{80}$ method is the default (`method = "p_80"`). 

::: {.alert .alert-danger}
The output of `method = "p_80"` and `method = "t_20"` have different interpretations and cannot be used interchangeably without understanding the differences in what they are measuring. 

The output of `method = "p_80"` gives the proportion of cases that generate a certain proportion of realised transmission. The most common use case is calculating what proportion of cases would cause 80% of transmission during an outbreak of the infection. Thus a small proportion in the output `<data.frame>` means that there is a lot of overdispersion in individual-level transmission. The `percent_transmission` argument when `method = "p_80"` is to set the proportion of transmission.

The output of `method = "t_20"` gives the proportion of cases that we would expected to produced by a certain proportion of the most infectious individuals. This is commonly used to calculate what proportion of cases are expected to be caused by the most infectious 20% of individuals. A high proportion in the output `<data.frame>` means that there is a lot of overdispersion in the transmission. The `percent_transmission` argument when `method = "t_20"` is to set the proportion of most infectious cases to calculate their proportion of total transmission.

The key difference is that in a realised large outbreak (i.e. one that includes stochastic transmission), it is highly likely that some individuals will generate no secondary cases. This is because even a non-zero expected number of secondary cases can produce a zero when drawn from a Poisson process.
:::

## Definitions

The formula for the $p_{80}$ method as stated in @endoEstimatingOverdispersionCOVID192020 is:

$$
1 - p_{80} = \int^{X}_{0} \text{NB} ( \lfloor x \rfloor; k, \left( \frac{k}{(R_0) + k} \right) dx
$$
where $X$ satisfies

$$
1 - 0.8 = \frac{1}{R_0} \int^{X}_{0} \lfloor x \rfloor \text{NB} \left( \lfloor x \rfloor; k, \frac{k}{R_0 + k} \right) dx
$$
Additionally, @endoEstimatingOverdispersionCOVID192020 showed that:

$$
\frac{1}{R_0} \int^{X}_{0} \lfloor x \rfloor \text{NB} \left( \lfloor x \rfloor; k, \frac{k}{R_0 + k} \right) dx = \int^{X-1}_{0} \text{NB} \left( \lfloor x \rfloor; k + 1, \frac{k}{R_0 + k} \right)
$$

The $t_{20}$ method calculates, as stated by @lloyd-smithSuperspreadingEffectIndividual2005: "the expected proportion of transmission due to the most infectious 20% of cases, $t_{20}$", given by $t_{20} = 1 - F_{trans}(x_{20})$, where $F_{trans}$ is defined as:

$$
F_{trans}(x) = \frac{1}{R_0} \int^{x}_{0} u f_v (u) du
$$

$f_\nu(x)$ is the probability density function (pdf) of the gamma distribution of the individual reproduction number $\nu$.

For both methods the proportion of transmission can be modified using the `percent_transmission` argument in `proportion_transmission()` so they are not fixed at 80 and 20, respectively, for $p_{80}$ and $t_{20}$.

There are two methods for calculating the $p_{80}$ method, analytically as given by @endoEstimatingOverdispersionCOVID192020, or numerically by sampling from a negative binomial distribution (`proportion_transmission(..., simulate = TRUE)`. For the purpose of this vignette to compare all of the methods we will term the analytical calculation $p_{80}$, and the numerical calculation $p_{80}^{sim}$.

Now that the $p_{80}$ and $t_{20}$ methods have been defined we'll explore the functionality of each, their characteristics and compare them, making note of any unexpected behaviour to watch out for.

First we load the {superspreading}, {ggplot2}, {purrr} and {dplyr} R packages.

```{r setup}
library(superspreading)
library(ggplot2)
library(purrr)
library(dplyr)
```

## Exploring each method

To show the proportion of transmission using both methods we can load the estimates of $R$ and $k$ estimated from @lloyd-smithSuperspreadingEffectIndividual2005, which are stored in the library of epidemiological parameters in the {epiparameter} R package, and the interoperability of {epiparameter} and {superspreading} to directly supply the parameters for the offspring distributions (`<epiparameter>` objects) to the `offspring_dist` argument in `proportion_transmission()`.

```{r, load-offspring-dist}
library(epiparameter)
offspring_dists <- epiparameter_db(
  epi_dist = "offspring distribution"
)
```

Here we create a table with the estimates of the mean ($R$) and dispersion ($k$) for each disease from @lloyd-smithSuperspreadingEffectIndividual2005.

```{r}
diseases <- vapply(offspring_dists, `[[`, FUN.VALUE = character(1), "disease")
offspring_dists <- offspring_dists[!duplicated(diseases)]
diseases <- diseases[!duplicated(diseases)]

offspring_dist_params <- vapply(
  offspring_dists, get_parameters, FUN.VALUE = numeric(2)
)

offspring_dist_params <- data.frame(
  disease = diseases,
  t(offspring_dist_params)
)
offspring_dist_params
```

Using these parameter estimates for the negative binomial offspring distribution of each disease we can append the calculated proportion of transmission caused by the most infectious 20% of cases, using the $t_{20}$ method, and the calculated proportion of transmission causing 80% of onward transmission, using the $p_{80}$ method.

```{r}
offspring_dist_params$t20 <- do.call(
  rbind,
  apply(
    offspring_dist_params,
    MARGIN = 1,
    function(x) {
      proportion_transmission(
        R = as.numeric(x[2]), k = as.numeric(x[3]), percent_transmission = 0.2,
        method = "t_20", format_prop = FALSE
      )
    }
  )
)[, 3]

offspring_dist_params$p80 <- do.call(
  rbind,
  apply(
    offspring_dist_params,
    MARGIN = 1,
    function(x) {
      proportion_transmission(
        R = as.numeric(x[2]), k = as.numeric(x[3]), percent_transmission = 0.8,
        method = "p_80", format_prop = FALSE
      )
    }
  )
)[, 3]
offspring_dist_params
```

It can be seen in the table above that when an offspring distribution has a smaller dispersion ($k$) parameter the proportion of cases produced by the most infectious 20% of cases is high, and the proportion of cases that produce 80% of transmission is low. The estimates for SARS demonstrate this point. Both metrics lead to the same conclusion that disease transmission dynamics for the 2003 SARS outbreak in Singapore are heterogeneous and superspreading is a important aspect of the outbreak.

The variability in individual-level transmission comes from modelling the individual reproduction number, $\nu$ as a gamma distribution, instead of assuming that the individual-level reproduction number is equal for all individuals, which would result in a Poisson offspring distribution (see @lloyd-smithSuperspreadingEffectIndividual2005 [Supplementary Material](https://static-content.springer.com/esm/art%3A10.1038%2Fnature04153/MediaObjects/41586_2005_BFnature04153_MOESM1_ESM.pdf) for more information). To visualise the inherent variability resulting from the gamma-distributed reproduction number and how that compares to the assumption of a homogeneous population we can plot the expected proportion of transmission against the proportion of infectious cases. 

For this we will need to write a couple of custom functions.

```{r}
# nolint start for `:::`
get_infectious_curve <- function(R, k) {
  # upper limit of x when y = 0
  upper_u <- superspreading:::solve_for_u(prop = 0, R = R, k = k)
  upper_u <- round(upper_u)
  u_seq <- superspreading:::lseq(from = 1e-5, to = upper_u, length.out = 500)
  res <- lapply(u_seq, function(upper) {
    integrate(
      function(u) u * superspreading:::fvx(x = u, R, k),
      lower = 0, upper = upper
    )$value / R
  })
  expected_v_more_than_x <- 1 - unlist(res)
  proportion_more_than_x <- 1 - superspreading:::pgammaRk(u_seq, R = R, k = k)
  data.frame(
    exp_t = expected_v_more_than_x,
    prop_i = proportion_more_than_x
  )
}
# nolint end
```

We can now reproduce Figure 1b in @lloyd-smithSuperspreadingEffectIndividual2005 using the offspring distribution parameters obtained above.

```{r, fig.width=8}
infect_curve <- map(offspring_dist_params %>%
      group_split(disease), function(x) {
        get_infectious_curve(R = x$mean, k = x$dispersion) %>%
          mutate(
            disease = x$disease,
            R = x$mean, k = x$dispersion
          )
      })

infect_curve <- do.call(rbind, infect_curve)

ggplot(
  data = infect_curve,
  aes(x = prop_i, y = exp_t, colour = disease)
) +
  geom_line() +
  geom_abline(slope = 1) +
  geom_vline(xintercept = 0.2, linetype = 2, alpha = 0.2) +
  scale_x_continuous(breaks = seq(0, 1, by = 0.2)) +
  theme_classic() +
  theme(
    aspect.ratio = 1,
    legend.position = "right"
  ) +
  annotate(
    geom = "text",
    angle = 45,
    size = 2.5,
    x = 0.5,
    y = 0.5,
    vjust = 1.5,
    label = "Homogeneous population"
  ) +
  labs(
    x = "Proportion of infectious cases (ranked)",
    y = "Expected proportion of transmission",
    colour = ""
  ) +
  coord_cartesian(expand = FALSE)
```

This plot shows the variability in transmission owing to the gamma-distributed individual reproduction number ($\nu$). If we take a slice through the above plot when the proportion of infectious cases equals 0.2 (shown by the dashed line) we can calculate the proportion of transmission caused by the most infectious 20% of cases using the `proportion_tranmission(..., method = "t_20")` function. 

```{r, fig.width=8}
k_seq <- superspreading:::lseq(from = 0.01, to = 100, length.out = 1000) # nolint
y <- map_dbl(
  k_seq,
  function(x) {
    proportion_transmission(
      R = 2, k = x, percent_transmission = 0.2,
      method = "t_20", format_prop = FALSE
    )[, 3]
  }
)
prop_t20 <- data.frame(k_seq, y)

ggplot() +
  geom_line(data = prop_t20, mapping = aes(x = k_seq, y = y)) +
  geom_point(
  data = offspring_dist_params,
    mapping = aes(
      x = dispersion,
      y = t20,
      fill = disease
    ),
    shape = 21,
    size = 3
  ) +
  geom_hline(yintercept = c(0.2, 0.8), linetype = 2, alpha = 0.2) +
  theme_classic() +
  theme(
    aspect.ratio = 1
  ) +
  scale_y_continuous(
    name = paste(
      "Proportion of transmission expected from",
      "the most infectious 20% of cases",
      sep = " \n"
    ),
    limits = c(0, 1),
    breaks = seq(0, 1, by = 0.2)
  ) +
  scale_x_log10(name = "Dispersion parameter", expand = c(0, 0)) +
  labs(fill = "")
```

The above plot replicates Figure 1c from @lloyd-smithSuperspreadingEffectIndividual2005. It shows how different diseases have different proportion of transmission from the most infectious 20% owing to varying degrees of overdispersion ($k$).

The plot uses an $R$ of 2, however, one characteristic of the $t_{20}$ method is that different values of $R$ do not influence the proportion of transmission, so the plot would look identical with other value of $R$. This is because as shown the equation defining the $t_{20}$ method above the integral is divided by $R_0$ so that only $f_{\nu}(u)$ control the proportion of transmission.

```{r}
# For k = 0.5
proportion_transmission(
  R = 0.1, k = 0.5, percent_transmission = 0.8, method = "t_20"
)
proportion_transmission(
  R = 1, k = 0.5, percent_transmission = 0.8, method = "t_20"
)
proportion_transmission(
  R = 5, k = 0.5, percent_transmission = 0.8, method = "t_20"
)

# For k = 2
proportion_transmission(
  R = 0.1, k = 2, percent_transmission = 0.8, method = "t_20"
)
proportion_transmission(
  R = 1, k = 2, percent_transmission = 0.8, method = "t_20"
)
proportion_transmission(
  R = 5, k = 2, percent_transmission = 0.8, method = "t_20"
)
```

This is not the case for $p_{80}$, where changes in both $R$ and $k$ influence the proportion of transmission.

```{r}
# For k = 0.5
proportion_transmission(
  R = 0.1, k = 0.5, percent_transmission = 0.8, method = "p_80"
)
proportion_transmission(
  R = 1, k = 0.5, percent_transmission = 0.8, method = "p_80"
)
proportion_transmission(
  R = 5, k = 0.5, percent_transmission = 0.8, method = "p_80"
)

# For k = 2
proportion_transmission(
  R = 0.1, k = 2, percent_transmission = 0.8, method = "p_80"
)
proportion_transmission(
  R = 1, k = 2, percent_transmission = 0.8, method = "p_80"
)
proportion_transmission(
  R = 5, k = 2, percent_transmission = 0.8, method = "p_80"
)
```

One thing that was mentioned above is that the interpretation of the $p_{80}$ and $t_{20}$ methods are not interchangeable, additionally, $t_{80}$ and $p_{20}$ are not equal. Stated differently, the $p_{80}$ method to calculate the proportion of transmission that cause 20% of cases, and the $t_{20}$ method to calculate the proportion of transmission caused by the most infectious 80% are not equivalent. It is also the case that $1 - p_{80} \neq t_{20}$, thus $1 - t_{20} \neq p_{80}$.

Here we vary $R$ and $k$ and show that by setting the $p_{80}$ method to `percent_transmission = 0.2`, and the $t_{20}$ method to `percent_transmission = 0.8` the two cannot be interchangeably interpreted as outlined in the box above.

```{r}
# R = 1, k = 0.5
proportion_transmission(
  R = 1, k = 0.5, percent_transmission = 0.2, method = "p_80"
)
proportion_transmission(
  R = 1, k = 0.5, percent_transmission = 0.8, method = "t_20"
)

# R = 3, k = 2
proportion_transmission(
  R = 3, k = 2, percent_transmission = 0.2, method = "p_80"
)
proportion_transmission(
  R = 3, k = 2, percent_transmission = 0.8, method = "t_20"
)
```

Here we show that $1 - p_{80} \neq t_{20}$ and $1 - t_{20} \neq p_{80}$.

```{r}
1 - proportion_transmission(
  R = 1, k = 0.5, percent_transmission = 0.8, method = "p_80",
  format_prop = FALSE
)[, 3]
proportion_transmission(
  R = 1, k = 0.5, percent_transmission = 0.2, method = "t_20"
)

1 - proportion_transmission(
  R = 1, k = 0.5, percent_transmission = 0.2, method = "t_20",
  format_prop = FALSE
)[, 3]
proportion_transmission(
  R = 1, k = 0.5, percent_transmission = 0.8, method = "p_80"
)
```

The $t_{20}$ method allows for true homogeneity when $k \rightarrow \infty$ ($t_{20} = 20\%$), whereas the $p_{80}$ method does not allow for true homogeneity ($p_{80} \neq 80\%$).

```{r}
proportion_transmission(
  R = 1, k = Inf, percent_transmission = 0.8, method = "p_80"
)
proportion_transmission(
  R = 1, k = Inf, percent_transmission = 0.8, method = "t_20"
)
```