epichains provides methods to analyse and simulate the size and length of branching processes with an arbitrary offspring distribution. In this vignette we lay out the mathematical concepts behind the functionality available in the package.
Branching processes are a class of models that are used to model the growth of populations. They assume that each member of the population produces a number of offspring, Z, that is a random variable with probability mass function p(Z = z|θ), called the offspring distribution. Their use has a long history in epidemiology, where the population is interpreted as a pathogen, and the offspring as new hosts that it infects (Farrington, Kanaan, and Gay 2003). Below we will call these infected individuals cases but the methods could be applied in other contexts where branching processes are to be used.
To simulate from a branching process, we start with a single case and proceed in discrete steps or generations, drawing from the offspring distribution p(Z = z|θ) to generate new cases from each case.
Given an infector i and infectee j, we can additionally assign them a distribution of times T that approximates when the infection event occurred. If we define T as a random variable with distribution f(T = t; θ) we can assign each case j a time tj which, if case j has been affected by case i is given by f(tj − ti|θ). If we identify the timing of cases by the time of their symptom onset this is the serial interval, but depending on case definitions this could be another interval.
Branching process simulations end when they have gone extinct, that is, no more offspring are being produced, or because of some stopping criterion. To summarise the simulations, we either study the size or length of the resulting chain of cases. The size S of a chain is the number of cases that have occurred over the course of the simulation including the initial case so that S ≥ 1. The length L of a chain is the number of generations that have been simulated including the initial case so that L ≥ 1.
By characterising a chains of cases by their size of length we can conduct inference to learn about the underlying parameters θ from observation of chain sizes or chain lengths (Blumberg and Lloyd-Smith 2013). In general this is only possible for subcritical branching processes, i.e. ones where the mean number of offspring is less than 1, as otherwise the branching process could grow forever. However, we can expand the theory to supercritical branching processes, i.e. ones where the mean number of offspring is greater than 1, by defining a cut off of chain size or length beyond which we treat the chain as if it had infinite size or length, respectively.
We show the equations for the size and length distributions for some offspring distributions where they can be derived analytically: Poisson, negative binomial, geometric, and a gamma-Borel mixture.
If the offspring distribution is a Poisson distribution, we can interpret its rate parameter λ as the basic reproduction number R0 of the pathogen. In the more general case where the offspring definition can be overdispersed leading to superspreading we can use a negative binomial offspring distribution with mean μ and overdispersion k Blumberg and Lloyd-Smith (2013). In that case, the mean parameter μ is interpreted as the basic reproduction number R0 of the pathogen. The negative binomial distribution arises from a Poisson-gamma mixture and thus a branching process with negative binomial distributed offspring can be interpreted as one with Poisson distributed offspring where the basic reproduction number R0 itself varies according to a gamma distribution. The amount of variation in R0 is then interpreted as individual-level variation in transmission representing overdispersion or superspreading, and the degree to which this happens is given by the overdispersion parameter k.
The probability p of a chain of size S given R0 and k in a branching process with negative binomial offspring distribution is given in Eq. 9 of Blumberg and Lloyd-Smith (2013)
$$ p(S|R_0, k) = \frac{\Gamma(kS + S - 1)}{\Gamma(kS)\Gamma(S + 1)} \frac{\left(\frac{R_0}{k}\right)^{S - 1}}{\left( 1 + \frac{R_0}{k} \right)^{kS + S - 1}} $$
where Γ is the gamma function. In order to estimate S from a given R0 and k we can define a likelihood function L(S) = p(S|R0, k). The corresponding log-likelihood is
The log-likelihood for Poisson distributed offspring follows from this where k tends to infinity (corresponding to Eq. 2.2 in Farrington, Kanaan, and Gay (2003))
LL(S) = (S − 1)log R0 − SR0 + (S − 2)log S − log Γ(S)
In all cases the point estimate for the basic reproduction number $\hat{R_0}$ is related to the mean chain size S̄ by
$$ \hat{R_0} = 1 - \frac{1}{\bar{S}} $$
The cumulative mass function F(L) of observing a chain of length L when offspring is Poisson distributed is given by Eq. (2.5) in Farrington, Kanaan, and Gay (2003) (there called “outbreak duration”):
F(L) = e−R0EL(eR0e−R0)
where EL(x) is the iterated exponential function, E0(x) = 1, EL + 1(x) = xEL(x).
For geometric distributed offspring (corresponding to a negative Binomial with k = 1) this function is given by
$$ F(L) = \frac{ 1- R_0^{L + 1} } {1 - R_0^{L - 2}} $$
In both cases f(L) denotes cumulative mass functions and therefore the probability of observing a chain of length L is therefore f(L) − f(L − 1).
The probability distribution of outbreak sizes from a branching process with a Poisson offspring distribution (Eq. 2.2 in Farrington, Kanaan, and Gay (2003)) is a special case of the Borel-Tanner distribution starting with 1 individual. An alternative to the negative binomial offspring distribution which represents a Poisson-gamma mixture is a Borel-gamma mixture. This could represent situations where the variation is not at the individual level but at the chain level, i.e. transmission chains is homogeneous but there is heterogeneity between chains. In that case, it can be shown that the resulting log-likelihood of chain sizes is
When analytic likelihoods are not available a numerical approximation is used to derive the distributions. In order to do this, the simulation functionality is be used to generate n simulated chains and the value of the cumulative mass function P(S|θ) at the observed S approximated by the empirical cumulative distribution function: P(S|θ) ≈ ∑i1(xi < = S) where 1 is the indicator function and xi the i-th observed chain size (or length, if the interest is in L). In order to improve this approximation a linear approximation is applied to the values of the empirical distribution function (at the expense of normalisation to 1).
The (unnormalised) probability of observing S is then given by p(S|θ) = P(S|θ) − P(S − 1|θ) and a an equivalent relationship is used for L.