readepi: Reading data from health information systems

Overview

Health-related data in general, and epidemiological data in particular, are stored in files, relation database management systems (RDBMS), or health information systems (HIS). Each category includes numerous options, such as various file formats, RDBMS types, and HIS APIs. Import data from such repositories involve the usage of different format-specific functions or API-specific packages, which is an exhausting task for end users.

The main objective of {readepi} package is to simplify the process of reading health-related data from diverse sources, allowing the user to focus more on downstream analysis tasks. {readepi} also streamlines the way data is read into a single function —it imports data from a specified systems (can be SQL database, or HIS API) using the readepi() function, which returns a list object containing one or more data frames.

The current implementation of {readepi} provides a function for reading data from three common HIS: (REDCap, DHIS2, and Fingertips), and RDBMS such as MS SQL, MySQL, and PostgreSQL. Other utility functions for processing and manipulating the imported data, such as row or column sub-setting, are also included in this package.

# LOAD readepi
library(readepi)

Reading data from RDBMS

Health related research data are usually stored in either relational databases or non-SQL databases. For example, at MRCG@LSHTM, projects data are stored in relational databases. A SQL-based database is run under a specific sever. The current version of the {readepi} package supports reading data from MS SQL, MySQL, and PostgreSQL servers.

To read data from a MS SQL database, the user is expected to have, at least, read access to the database. The user credential details should be stored in file. Also, users who are using an operating system other than Microsoft need to have the appropriate MS driver installed into their system. In the following section we show how to install these drivers and describe the structure of the credentials file.

MS SQL drivers for OSX-based systems

Reading data from Unix-based systems require to install an MS ODBC driver that is compatible with the version of the target MS SQL server. For example, ODBC Driver 17 is compatible with SQL server 2019, version 15.0.

Mac users can follow the instructions below to install the MS SQL ODBC driver. Choose the appropriate driver, open the terminal, and run these instructions.

  1. installation of MS SQL driver 17 on Mac
driver=17
brew install unixodbc
brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
brew update
brew install msodbcsql${driver}
brew install mssql-tools
ODBCSYSINI=/
  1. installation of MS SQL driver 13.1 on Mac
brew install unixodbc
brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
brew update
brew install [email protected]
brew install [email protected]
ODBCSYSINI=/

MS SQL drivers for Linux-based systems

Note that this requires Ubuntu 16.04 and above. Choose the appropriate driver, open the terminal and type the instructions below:

  1. installation of MS SQL driver version 17 on Ubuntu
driver=17
sudo su
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/$(lsb_release -rs)/prod.list > /etc/apt/sources.list.d/mssql-release.list
exit

sudo apt-get update
sudo ACCEPT_EULA=Y apt-get install -y msodbcsql${driver}
sudo ACCEPT_EULA=Y apt-get install -y mssql-tools
echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc
source ~/.bashrc
sudo apt-get install -y unixodbc-dev
  1. installation of MS SQL driver version 13.1 on Ubuntu
sudo su
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
exit
sudo apt-get update
sudo ACCEPT_EULA=Y apt-get install msodbcsql
sudo ACCEPT_EULA=Y apt-get install mssql-tools
echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc
source ~/.bashrc
sudo apt-get install unixodbc-dev
  1. installation of MS SQL driver on other Linux distributions

To install an MS SQL driver for other Linux distributions, such as Debian and Fedora, click here and follow the instructions. After the installation, check the installed drivers using:

odbc::odbcListDrivers()

If this command does not return the list of installed drivers or if you are facing issues during the driver installation process, consult the odbc github page or the MS documentation on this topic.

It is also important to view the data stored in the MS SQL server. For that purpose, we recommend you to install a graphical user interface (GUI) such as Azure Data Studio.

Main arguments

The readepi() function expects the following arguments:

  • data_source: (required) this is the URL of the target HIS.
  • records: (optional) a vector or a comma-separated string of subject IDs. When specified, only the records that correspond to these subjects will be imported.
  • fields: (optional) a vector or a comma-separated string of column names. If provided, only those columns will be imported.
  • id_position: (optional) the column position of the variable that unique identifies the subjects. It’s default value is 1.
  • id_col_name: (optional) the name of the column that unique identifies the subjects.

Credentials file

The access to data stored in a database management system requires valid user credentials, such as a user name and password. The {readepi} package expect that the user credential details to be saved in a tab-separated file. The path to that constitute the value for the credentials_file argument of the readepi() function.

This credential (tab-delimited) file must contain the following columns:

  1. user_name: the user name. 
  2. password: the user password (in REDCap, this corresponds to token). 
  3. host_name: the host name (in REDCap, this corresponds to the host URL). 
  4. project_id: the name of the database (for relation DB) or the name of project (for REDCap API) that contains data of interest. 
  5. comment: (optional) a brief description of the project or database of interest. 
  6. dbms: the type of the DBMS. Possible values are: REDCap when reading from REDCap; MySQL when reading from a MySQL server, SQLServer when reading from an MS SQL server, PostgreSQL when reading from a PostgreSQL system. 
  7. port: (optional) the port ID, this is only used for MS SQL servers.

To see the structure of a credentials file used as a template in the {readepi} package, use the show_example_file() function as shown below.

# DISPLAY THE STRUCTURE OF THE TEMPLATE CREDENTIALS FILE
show_example_file()
#>   user_name                         password                          host_name
#> 1     admin                         district      https://play.im.dhis2.org/dev
#> 2    rfamro                                         mysql-rfam-public.ebi.ac.uk
#> 3     kmane 9A81268476645C4E5F03428B8AC3AA7B https://bbmc.ouhsc.edu/redcap/api/
#>   project_id                          comment   dbms port
#> 1 DHIS2_DEMO      this is a test DHIS2 server  DHIS2   NA
#> 2       Rfam      this is a test MySQL server  MySQL 4497
#> 3    SD_DATA testing access to REDCap project REDCap   NA

Examples

This section contains various examples demonstrating how {readepi} imports data from databases. We assume that the credentials are stored in a file as explained above. Furthermore, these examples are based on a MySQL server that does not need the user to supply the driver name.

# DEFINING THE PATH TO THE TEMPLATE CREDENTIALS FILE
credentials_file <- system.file("extdata", "test.ini", package = "readepi")

Listing names of all tables in a database

To display list of tables from a database of interest, use:

show_tables(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  driver_name      = ""
)
# use driver_name = "ODBC Driver 17 for SQL Server" when reading from MS SQL
# server

Fetching data from a MySQL server using the table names

When importing data from MS SQL, MySQL and PostgreSQL servers, the user needs to provide the following additional arguments:

  • driver_name: the name of the MS driver (only for MS SQL servers).
  • from: the name of the table from which the data will be fetched. This can also be the SQL query to be used for fetching data from the system.
# VISUALIZING THE FIRST 5 ROWS OF THE TABLE 'author'
visualise_table(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = "author", # this is the table name
  driver_name      = ""
)

# READING ALL FIELDS AND ALL RECORDS FROM ONE TABLE (`author`)
dat <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = "author", # this is the table name
  driver_name      = ""
)

# READING SPECIFIED FIELDS AND ALL RECORDS FROM ONE TABLE
fields <- "author_id,name,last_name,initials"
dat    <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = "author", # this is the table name
  driver_name      = "",
  fields           = fields # these are  the columns of interest.
)

# READING SPECIFIED RECORDS AND ALL FIELDS FROM ONE TABLE
records <- "1, 34, 15, 70, 118, 20"
dat     <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = "author", # this is the table name
  driver_name      = "",
  records          = records,
  id_position      = 1L
)

# READING SPECIFIED FIELDS AND RECORDS ONE THE TABLE
dat <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = "author", # this is the table name
  driver_name      = "",
  records          = records,
  fields           = fields,
  id_col_name      = "author_id"
)

# READING DATA FROM SEVERAL TABLES
table_names <- c("author", "family_author")
dat <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = table_names, # this is the table name
  driver_name      = ""
)

# READING DATA FROM SEVERAL TABLES AND SUBSETTING FIELDS ACROSS TABLES
fields <- c(
  "author_id,name,last_name,initials",
  "rfam_acc,author_id"
)
# the first string in the field vector corresponds to the name of the
# columns of interest from the first table specified in the `table_names`
# argument and so on...
dat <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = table_names, # this is the table name
  driver_name      = "",
  fields           = fields
)

# READING DATA FROM SEVERAL TABLES AND SUBSETTING RECORDS ACROSS TABLES
records <- c(
  "1, 34, 15, 70, 118, 20",
  "RF00591,RF01420,RF01421"
)
# "note that first string in the records vector corresponds to the records of
# interest from the first table specified in the `table_name` argument and so
# on... when the id column is not the first column in a table,
# use the `id_position`"
dat <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = table_names, # this is the table name
  driver_name      = "",
  records          = records,
  id_position      = c(1, 1)
)

# READING DATA FROM SEVERAL TABLES AND SUBSETTING RECORDS AND FIELDS ACROSS
# TABLES
dat <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = table_names, # this is the table name
  driver_name      = "",
  records          = records,
  fields           = fields,
  id_col_name      = c("author_id", "rfam_acc")
)

Fetching data using an SQL query

# SELECT FEW COLUMNS FROM ONE TABLE AND LEFT JOIN WITH ANOTHER TABLE
dat <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = "select author.author_id, author.name,
  family_author.author_id from author left join family_author on
  author.author_id = family_author.author_id",
  driver_name      = ""
)

# SELECT ALL DATA FROM THE author TABLE
dat <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = "select * from author",
  driver_name      = ""
)

# SELECT FEW COLUMNS FROM THE author TABLE
dat <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = "select author_id, name, last_name from author",
  driver_name      = ""
)

# SELECT FEW RECORDS FROM THE author TABLE
dat <- readepi(
  data_source = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  source = "select * from author where author_id in ('1','20','50')",
  driver_name = ""
)

# SELECT FEW RECORDS AND FIELDS FROM THE author TABLE
dat <- readepi(
  data_source      = "mysql-rfam-public.ebi.ac.uk",
  credentials_file = credentials_file,
  from             = "select author_id, name, last_name from author where
  author_id in ('1','20','50')",
  driver_name      = ""
)

Reading data from HIS

The current version of {readepi} supports reading data from three common HIS: REDCap, DHIS2, and Fingertips.

Importing data from REDCap

Research Electronic Data Capture (REDCap) is a web-based application and workflow methodology for designing clinical and translational research databases. To import data from a REDCap-based repo, the readepi() function takes the following arguments:

  • data_source: the URL of the REDCap server (mandatory)
  • credentials_file: the credentials file (mandatory)
  • records: the list of the desired records (optional)
  • fields: the list of the desired columns (optional)

Both the data and its associated metadata will be returned after a successful import.

# READING ALL FIELDS AND RECORDS FROM A REDCap PROJECT
dat <- readepi(
  data_source      = "https://bbmc.ouhsc.edu/redcap/api/",
  credentials_file = credentials_file
)
project_data     <- dat[["data"]]
project_metadeta <- dat[["metadata"]]

# READING SPECIFIC FIELDS AND ALL RECORDS FROM THE PROJECT
fields <- c("record_id", "name_first", "age", "bmi")
dat    <- readepi(
  data_source      = "https://bbmc.ouhsc.edu/redcap/api/",
  credentials_file = credentials_file,
  fields           = fields
)

# READING SPECIFIC RECORDS AND ALL FIELDS FROM THE PROJECT
records <- c("1", "3", "5")
dat     <- readepi(
  data_source      = "https://bbmc.ouhsc.edu/redcap/api/",
  credentials_file = credentials_file,
  records          = records,
  id_col_name      = "record_id"
)

# READING SPECIFIC RECORDS AND FIELDS FROM THE PROJECT
dat <- readepi(
  data_source      = "https://bbmc.ouhsc.edu/redcap/api/",
  credentials_file = credentials_file,
  records          = records,
  fields           = fields,
  id_col_name      = "record_id"
)
project_data     <- dat[["data"]]
project_metadeta <- dat[["metadata"]]

Importing data from DHIS2

DHIS2 is an open source software that has transformed global health information management. The readepi() function can import data from DHIS2-based repositories with following arguments:

  • data_source: the URL of the target DHIS2 system (required)
  • credentials_file: the credentials file (required)
  • dataset: a vector or a comma-separated list of dataset identifier (optional)
  • organisation_unit: a vector or a comma-separated list of organisation unit identifier (optional)
  • data_element_group: a vector or a comma-separated list of data element group (optional)
  • start_date: the start date for the time span of the values to export (required)
  • end_date: the end date for the time span of the values to export (required)
  • id_col_name: the column name with the records of interest (optional)
  • records: the list of the desired records (optional)
  • fields: the list of the desired columns (optional)
# GETTING THE DATA ELEMENT IDENTIFIERS AND NAMES
data_elements <- dhis2_get_attributes(
  base_url  = "https://play.dhis2.org/dev/",
  user_name = "admin",
  password  = "district",
  which     = "dataElements"
)

# GETTING THE DATASET IDENTIFIERS AND NAMES
datasets <- dhis2_get_attributes(
  base_url  = "https://play.dhis2.org/dev/",
  user_name = "admin",
  password  = "district",
  which     = "dataSets"
)

# GETTING THE ORGANISATION UNIT IDENTIFIERS AND NAMES
organisation_units <- dhis2_get_attributes(
  base_url  = "https://play.dhis2.org/dev/",
  user_name = "admin",
  password  = "district",
  which     = "organisationUnits"
)

# GETTING THE DATA ELEMENT GROUP IDENTIFIERS AND NAMES
data_element_groups <- dhis2_get_attributes(
  base_url  = "https://play.dhis2.org/dev/",
  user_name = "admin",
  password  = "district",
  which     = "dataElementGroups"
)

# READING THE DATASET ID `BfMAe6Itzgt`
dat <- readepi(
  data_source        = "https://play.dhis2.org/dev",
  credentials_file   = credentials_file,
  dataset            = "BfMAe6Itzgt",
  organisation_unit  = "Umh4HKqqFp6",
  data_element_group = NULL,
  start_date         = "2014",
  end_date           = "2023"
)

# READING DATA FROM 2 DATASETS `BfMAe6Itzgt`
dat <- readepi(
  data_source        = "https://play.dhis2.org/dev",
  credentials_file   = credentials_file,
  dataset            = "BfMAe6Itzgt,BfMAe6Itzgt",
  organisation_unit  = "Umh4HKqqFp6",
  data_element_group = NULL,
  start_date         = "2014",
  end_date           = "2023"
)

# READING SPECIFIC DATA ELEMENTS FROM THE DATASET ID `BfMAe6Itzgt`
data_elts <- c("FTRrcoaog83", "eY5ehpbEsB7", "Ix2HsbDMLea")
dat <- readepi(
  data_source        = "https://play.dhis2.org/dev",
  credentials_file   = credentials_file,
  dataset            = "BfMAe6Itzgt",
  organisation_unit  = "Umh4HKqqFp6",
  data_element_group = NULL,
  start_date         = "2014",
  end_date           = "2023",
  records            = data_elts,
  id_col_name        = "dataElement"
)

# READING SPECIFIC COLUMNS FROM A DATASET
dat <- readepi(
  data_source        = "https://play.dhis2.org/dev",
  credentials_file   = credentials_file,
  dataset            = "BfMAe6Itzgt,BfMAe6Itzgt",
  organisation_unit  = "Umh4HKqqFp6",
  data_element_group = NULL,
  start_date         = "2014",
  end_date           = "2023",
  fields             = c("dataElement", "period", "value")
)
test_data <- data[["data"]]

Importing data from Fingertips

Fingertips is a repository for public health data indicators in England, in which data is organised into themed profiles. The readepi() function allows you to import data from Fingertips-based APIs using the following arguments:

  • indicator_id: the indicator ID
  • indicator_name: the indicator name
  • area_type_id: the area type ID. This determines the geographic area for the imported data (required)
  • parent_area_type_id: the parent area type code of the specified area type ID
  • profile_id: the profile ID
  • profile_name: the profile name
  • domain_id: the domain ID
  • domain_name: the domain name
  • records: the list of the desired records
  • fields: the list of the desired columns
  • id_col_name: the column name with the records of interest
  • id_position: the column position of the variable that unique identifies the subjects. default is 1.

It is worth noting that, while reading such data, the readepi() function makes wrapper-calls to the major functions in fingertipsR, a pre-existing R package that only reads data from Fingertips databases.

Examples

This section provides various examples demonstrating how {readepi} imports data from Fingertips-based APIs.

# GET THE INFORMATION ABOUT THE INDICATOR PROFILES, DOMAIN, AREA TYPE, ...
metadata <- fingertips_get_metadata()
head(metadata[["indicator_profile_domain"]])
head(metadata[["indicator_ids_names"]])
head(metadata[["area_type"]])

# IMPORTING DATA USING THE INDICATOR ID
dat <- readepi(
  indicator_id = 90362L,
  area_type_id = 202L
)

# IMPORTING DATA USING THE INDICATOR NAME
dat <- readepi(
  indicator_name = "Healthy life expectancy at birth",
  area_type_id   = 202L
)

# IMPORTING DATA USING THE DOMAIN NAME
dat <- readepi(
  domain_name  = "A. Overarching indicators",
  area_type_id = 202L
)

dat <- readepi(
  indicator_name = "Healthy life expectancy at birth",
  area_type_id   = 202L,
  domain_name    = "A. Overarching indicators"
)

# IMPORTING DATA USING THE PROFILE ID
dat <- readepi(
  profile_id   = 19L,
  area_type_id = 202L
)

# IMPORTING DATA FROM SPECIFIC INDICATOR, DOMAIN, PROFILE, AREA TYPE
dat <- readepi(
  indicator_id        = 90362L,
  indicator_name      = "Healthy life expectancy at birth",
  area_type_id        = 202L,
  parent_area_type_id = 6L,
  profile_id          = 19L,
  profile_name        = "Public Health Outcomes Framework",
  domain_id           = 1000049L,
  domain_name         = "A. Overarching indicators",
  fields              = NULL,
  records             = NULL,
  id_position         = NULL,
  id_col_name         = NULL
)

# IMPORTING DATA AND SUBSETTING SPECIFIC RECORDS AND FIELDS
dat <- readepi(
  indicator_id = 90362L,
  area_type_id = 202L,
  fields       = c("IndicatorID", "AreaCode", "Age", "Value"),
  records      = c("E92000001", "E12000002", "E12000009"),
  id_col_name  = "AreaCode"
)