This vignette outlines the design decisions that have been taken during the development of the {simulist} R package, and provides some of the reasoning, and possible pros and cons of each decision.
This document is primarily intended to be read by those interested in understanding the code within the package and for potential package contributors.
The {simulist} package aims to simulate data on infectious disease
outbreaks, primarily line list data, but also contacts data. Each of
these output types have an associated exported function:
sim_linelist()
and sim_contacts()
. There is
also a function to simulate and output both of these data types,
sim_outbreak()
. This latter function is useful for
interoperability with the {epicontacts} R
package (see visualisation
vignette), and provides linked line list and contacts datasets,
which are common in outbreaks, such as the MERS dataset within the {outbreaks} R
package.
The simulation functions either return a
<data.frame>
or a list
of
<data.frame>
s. This consistency across functions of a
well-known data structure makes it easy to understand for users.
When using age-stratified risks of hospitalisation and deaths
(see Age-stratified hospitalisation and
death risks vignette for details) there is an interaction between
function arguments. The <data.frame>
that defines the
age-stratification in hosp_risk
,
hosp_death_risk
and non_hosp_death_risk
arguments gives the lower bound of the age groups. The upper bound of
the age groups is derived from the next lower bound, but the upper bound
oldest age group is defined by the upper age given to the
population_age
argument. This interaction of arguments is
not ideal, as it can be more difficult to understand for users (as
outlined in The
Tidy Design book), however, the interaction does not change the
interpretation of each argument which would be more convoluted. This
design decision was taken when we changed the structure of the
<data.frame>
accepted as input to the
*_risk
arguments from having two columns for a lower and
upper age limit, to a single column of lower age bounds. This change was
made in pull request
#14 and follows the design of {socialmixr}
for defining age bounds. The newer structure is judged to be preferred
as it prevents input errors by the user when the age bounds are
overlapping or non-contiguous (i.e. missing age groups).
The column names of the contact relationships (edges of the
network) are called from
and to
. Names of the
contacts table match {epicontacts} <epicontacts>
objects. If the column names of the two contacts provided to
epicontacts::make_epicontacts()
arguments from
and to
are not from
and to
they
will be silently renamed in the resulting
<epicontacts>
object. By making these column names
from
and to
when output from
sim_contacts()
or sim_outbreak()
it prevents
any confusion when used with {epicontacts}. This naming is also
preferred as they are usefully descriptive.
Exported functions that simulate data use the naming convention
sim_*()
(where *
is the placeholder). Internal
functions that simulate have a dot (.
) prefix
(e.g. .sim_internal()
). Functions that create fixed data
structures (i.e. data factory functions) have the naming convention
(create_*()
or .create_*()
).
The use of a config
argument in the simulation
function is to reduce the number of arguments in the exported functions
and provide as simple a user-interface as possible. The choice of what
gets an argument in the function body and what is confined to
config
list is based on preconceived frequency of use,
importance and technical detail. That is to say, settings that are
unlikely to be changed by the user or if they are changed require an
advanced understanding of the simulation model are placed within the
config
, and given default values with
create_config()
.
Input checking of the config
list takes place within
the call stack of exported sim_*()
functions when certain
elements of the config
list are used, and not in the
create_config()
function. Therefore, it is possible to
create an invalid config
list with
create_config()
. An example of throwing an error from an
internal function of the simulation is if config$network
is
not "adjusted"
or "unadjusted"
, or is
NULL
it will error in
.sim_network_bp()
.
The column names of the line list data produced by
sim_linelist()
and sim_outbreak()
matches the
tag names used in the {linelist} R
package (an Epiverse-TRACE R package).
This is for continuity of design more than any functional reason. The
line list data from {simulist} functions is not tagged sensu {linelist}
tagging. There is an inconsistent use of hospitalisation
and admission; in the simulated line list it is
date_admission
, but internally the package uses
hospitalisation (e.g. .add_hospitalisation()
). This is
because I think hospitalisation is more descriptive but
date_admission
is used by {linelist}.
{simulist} implements its own branching process model
(.sim_network_bp()
) which tracks contacts of infectious
individuals. This is a simple random network model, but for future
versions of {simulist} we will make the code modular in order to accept
other simulations models. This will remove the burden on {simulist} to
simulate from a range of model types.
The sim_linelist()
, sim_contacts()
and
sim_outbreak()
do not have arguments that change the
dimensions of the <data.frame>
returned by the
functions (or in the case of sim_outbreak()
a list of two
<data.frame>
s). Instead, we recommend modifying the
line list or contact tracing data after the simulation, and provide a
vignette to guide users on common data wrangling tasks in
wrangling-linelist.Rmd
. Not including arguments that can
remove or add columns to the output <data.frame>
s
reduces the complexity of the functions; and by limiting the simulation
function arguments to only parameterise, and not change the
dimensionality of, the simulated data, the package is more robust to
being used in pipelines or other automated approaches, where the data
needs to be predictably formatted.
Several parts of the {simulist} codebase use indices for
determining which individual are infected, allocation to vectors, and
other uses. R’s subsetting ([
) can use logical vectors or
numeric vectors, but in {simulist} these are differentiated by the names
*_idx
for variables holding a numeric
vector
of indices, and *_lgl_idx
for a logical
vector
of indices. This makes it safer and more readable to call functions like
sum()
or which()
on index vectors.
The aim is to restrict the number of dependencies to a minimal required set for ease of maintenance. The current hard dependencies are:
{stats} is distributed with the R language so is viewed as a lightweight dependency, that should already be installed on a user’s machine if they have R. {checkmate} is an input checking package widely used across Epiverse-TRACE packages. {epiparameter} is used to easily access epidemiological parameters from the package’s library, the package is currently unstable and actively developed, however, by using it in another package it can inform the development path of {epiparameter}. {randomNames} provides a utility function for generating random names for case and contact data. The functionality could be replicated in {simulist}, however the {randomNames} package is maintained and contains a range of name generation settings which warrants its use as a dependency.
The soft dependencies (and their minimum version requirements) are:
{knitr}, {rmarkdown}, are all used for generating documentation. {spelling} and {testthat} are used for testing the code base. {ggplot2} is used for plotting within the vignettes. {incidence2} and {epicontacts} are used in vignettes to demonstrate interoperability with downstream packages, with a focus on data visualisation.
There are no special requirements to contributing to {simulist}, please follow the package contributing guide.