This document outlines the design decisions that is guiding the development strategies of the {readepi} R package, the reasoning behind them, as well as the possible pros and cons of each decision.
Importing data from “whatever source of it” into R
environment is the first step in many workflows of outbreak analysis.
Epidemiological data are often stored in files of different formats. The
most popular among those formats being: ‘.txt’, ‘.tab’, ‘.csv’, ‘.xlxs’,
etc.
Many R packages have been developed over the years to facilitate the importation of data stored in such files. We recommend the {rio} package for importing data that are relatively of small size and the {data.table} package for large files.
To enable easy and real time access of well structured data, most organisations in the world are now storing their data in public repositories, relational database management systems (RDBMS), and health information systems (HIS) wrapped with specific Application Programming Interfaces (APIs). As such, we aim at building a centralized tool that will provide users with the possibility of importing data from various HIS and RDBMS.
Our first attempt consisted in using the currently available R packages designed to access data from specific HIS. These packages are usually tied to a single HIS and can’t be used to query others. {fingertipsR}, {REDCapR}, {godataR}, {globaldothealth} can be used to fetch data from Fingertips, REDCap, goData, and Global.Health respectively. In version 0.1.0 of the {readepi} package, when a user is requesting data from a specific HIS, its correspondent package is called internally.
As each package was designed to target a specific HIS, this approach increases our dependency to many other packages and introduces the challenge of having a unified framework for importing data from multiple HIS.
To address this challenge, PIs of the Epiverse-TRACE came with a list of potential data sources for which we aim at building a tool to request and fetch the data of interest from multiple source in the same way. The data sources include: distributed health information systems, and public databases as shown in the figure below.
The {readepi} package intends to import data from two common sources of institutional health-related data: health information systems (HIS) wrapped with specific Application programming interfaces (APIs) and relational database management systems (RDBMS) that run on specific servers.
Importing data from any of these sources requires the user to have
the right access. The user is also expected to provide the relevant
query parameters to fetch the target data. Hence, the {readepi} package
is structured around one main function (read_epidata()
) and
two auxiliary functions (authenticate()
and
get_metadata()
).
The previous version of {readepi} (0.1.0
) supports
importing data from HIS APIs such as REDCap (Research Electronic Data
Capture), DHIS2 (District Health Information System 2), and Fingertips
as well as RDBMS such as MS SQL, SQLite, MYSQL, and PostgreSQL.
In the next versions, the read_epidata()
function will
also allow data import from HIS like GoData, Globaldothealth, SORMAS,
and ODK. It will also include functionalities for importing data from
RDBMS such as MS ACCESS, and SQlite.
The read_epidata()
functions return a list
object containing one or more data frames
. Each
data frame
corresponds to the data from a specified source.
The get_metadata()
function returns a data dictionary
containing information about the data structure. The
authenticate()
function returns a connection object that is
used in the query request.
The aim of {readepi} is to simplify and standardize the process of fetching health data from APIs and servers. We ambition to make it require minimal arguments to access and pull the data of interest from the target source.
authenticate()
: a function used to authenticate a user
(check who the user is) and check whether the user is authorized to
access the requested database or API. This function is fundamental to
establish the connection to the source, ensuring the success of data
import.We will ensure that once an argument is provided for the authentication, it is easily retrieved from the connection object, and used in the 2 other functions, hence preventing from been supplied again in another request.
The type
argument refers to the name of the data source
of interest. The current version of the package will cover the
followings:
i) RDBMS: “MS SQL”, “MySQL”, “PostgreSQL”, “SQLite”, “MS ACCESS”,
ii) APIs: “REDCap”, “DHIS2”, “ODK”, “Fingertips”, “goData”, “SORMAS”
get_metadata()
: will be used to retrieve the data
schema (table list + data dictionary) from a database or the data
dictionary from an endpoint of the target API. It offers
interoperability with the clean_based_on_dictionay()
function of the {cleanepi} R
package.read_epidata()
: this is the main function of the
package. It can be used to read from both HIS and RDBMS.An internal function (read_*
where *
is the
source name) is designed for fetching data from a specific HIS. This
implementation will make it possible to perform updates on a specific
function, while conserving the remaining structures (easy
maintenance).
Note that, when reading from RDBMS, the query
argument
could be an SQL query or a list with a vector of table names,
fields and rows to subset on. For HIS, the elements of this list can
vary depending on the API that is been queried. We strongly recommend
reading the vignette on the query_parameters for more details.
To prevent from APIs security risks and issues, we will ensure that our API requests follow the practices listed below:
The read_epidata()
will rely mainly on 2
packages:
{httr2} or {data.table}: this will serve in constructing and performing the API requests, {dplyr}: will be used for its data wrangling functionalities.
The read_from_server()
will rely mainly on the
packages below:
{DBI}: for its functionalities to connect, and fetch data from a table in a database, {pool}: for its ability to handle multiple connection, {odbc}: for its drivers required to fetch data from several DBMS, {RMySQL}: for its drivers to fetch data from MySQL databases.
It also has a system dependency for OS-X and Linux users. This will be described in details in the vignette.
Both functions will require all other packages that are needed in the package development process including:
{checkmate}, {httptest2}, {bookdown}, {rmarkdown}, {testthat} (>= 3.0.0), {knitr}
There are no special requirements to contributing to {readepi}, please follow the package contributing guide.