library(khisr)
library(tidyverse)
library(janitor)
library(naniar)
library(here)Extracting and Cleaning DHIS2 Data with R
Accessing Health Information Systems Using the khisr Package
Project Summary
Goal: Extract family planning service delivery data from a FPAM’s DHIS2 instance using R, clean and reshape the raw output, create derived indicators (service type, client category, CYP), and produce an analysis-ready dataset.
Context: DHIS2 is the most widely used health management information system in low- and middle-income countries. Manual data exports from the DHIS2 web interface are slow, error-prone, and not reproducible. This project demonstrates a fully programmatic pipeline from API connection to clean a CSV output.
Pipeline: API authentication, metadata exploration (org units, data elements, indicators), bulk analytics extraction, string parsing of composite category fields, indicator classification, and CYP (Couple Years of Protection) calculation.
Tools: R, khisr, tidyverse, janitor, naniar, here
Step 1: Load Packages
Step 2: Connect to DHIS2
The khisr package provides R bindings to the DHIS2 Web API. Authentication uses a base URL, username, and password. Credentials are stored securely and never hardcoded in shared scripts like this one.
# Credentials stored in .Renviron for security
# DHIS2_BASE_URL, DHIS2_USERNAME, DHIS2_PASSWORD
base_url <- Sys.getenv("FPAM_DHIS2_BASE_URL")
username <- Sys.getenv("FPAM_DHIS2_USERNAME")
password <- Sys.getenv("FPAM_DHIS2_PASSWORD")
dhis2_connection <- khis_cred(
username = username,
password = password,
server = base_url
)In practice, I set these environment variables in my .Renviron file so credentials stay out of version control.
Step 3: Explore DHIS2 Metadata
Before extracting data, I explore the DHIS2 metadata to identify the correct IDs for organisation units, data elements, indicators, and category combinations. This step is necessary because DHIS2 stores everything by UID, not by name.
Organisation Units
Organisation units represent the geographic and administrative hierarchy: national, regional, district, and service delivery point (SDP) levels.
org_units <- get_organisation_units()
# Preview the org unit hierarchy
org_units |>
head(20)I filtered the full list down to 59 SDPs by selecting only relevant organisation units ids to our projects:
org_id <- c("oFHX9KdPmvq", "pWgcJqTDaR6", "ZBJZ84GRg8T", "FioTrY26WbQ", "qADMYLLGeof", "KtvkHQyYVrM", "FFB1iq38BbR", "k0lDJ1PtPic", "JqBOUCD9fy6", "uYGZAP1vriT", "Oay6QzUgYuL", "Ny1hpCtux3y", "PU2zDIptR7U", "EB7epqcMzF4", "D5qiBUI3k8G", "ike32nYVxfD", "VQY6DZDlC29", "wNY0hZm1WOI", "CM2IVSh3Atq", "iwmLCeYmDms",
"BG1pb5DtYhg", "AzeIukoY6lH", "Ro9iFw5Kyd8", "jXpsTENuryd", "kzTLZPcEuIJ", "KBa6xDhcUnO", "AOeFNxC7zRn", "ycdjNkkY8LK", "tWrBEq7DpeQ", "xqKcIvkUpDr", "mNOLTn5nj71", "i2smYapUm85", "Uu3xAwwG6Q6", "ouj5uTN3ecR", "ISyZ4sSvcAN", "n3ROWDE4oxm", "Y0eMusceYzl", "GAmB6ajj1E0", "StMM0959QJU", "iaV1nG9GEPS", "mXwvMIb1z40", "lWpazEuoOil", "YcUVBPS3DYO", "LgAGspUhatq", "iAFjbgJukkc", "rLBCWv3QtOM", "E4IIABpkuT1", "pd6s2OsPnrs",
"x0Bo9zDgE85", "U5FGVOxMHPG", "z64iYPQFfgz", "qF5j8wHj8ax", "nxSuQduPyAI", "Pk43HlRvlxK", "QiiJs70qbiR", "BJ7NARvScbH", "xeZFQsKcoQL", "Khd1okEVaHz", "b53t9dVC7g1")Dimensions and Category Combinations
DHIS2 disaggregates data by category combinations (age group, client type, gender). Understanding these is necessary for parsing extracted data.
dimensions <- get_dimensions()
dimensions
cat_combos <- get_category_option_combos()
cat_combos |>
head(20)Indicators and Data Elements
Indicators are calculated values (e.g., percentages). Data elements are raw reported values (e.g., number of clients served).
indicators <- get_indicators()
indicators |>
head(20)data_element_group <- get_data_element_groups()
data_element_group |>
head(20)
# 61 data element IDs covering SRV (service) and Items (commodities) data
d_elements_ids <- c("pFMjZXlfAf7", "Krdr1AlUQ8N", "aOz9WW0pPO4", "C5P131CTInb", "N057bnpkZkx", "vMPhO6JOlnk", "fG5OLScYkeK", "aPUhz2LnX5X","icy8aA0g0kl", "ksuWY4XNe2T", "Wt9yNZBo8nP", "ve4iq3vU00H", "vTjYQWr90ry", "wD2kg2pbHpx", "gVIx8BrlUne", "L5yXk6pkft7", "QJsqAcTX3GQ", "Kh7aznnxzgp", "CBbJFfDj55n", "J1xTOgyF6uy", "mSIGrBYiRGe", "dINzbge9RMX", "N1sig4akeCt", "tDhThTdnFXZ", "Stj1UElUNLq", "zezfiM6WS8c", "j9OOIxKU1Il", "P6RUHBiShIk", "wqkK7PhnyQ9", "yuoB7pCFfsG", "lCyEE5JH8Dz", "gXki5ZovDDB", "erac6GM8IoY", "WPGYUvPGant", "UGF1QZsTdlW", "xzdCm0QxCO5", "We91kItyQ7B", "a65L6xsaSGc", "XZgcxmmwSdr", "lAJqjUjXqoU", "rYj5ng5yFZF", "wMn75u7Ng7t", "VnOGlJ5KaJS", "zaFd5ecHFT9", "fpNvBVNaS4m", "O5hC9FtmEbe", "KncZyhzMrN0", "oUNffjs9rOa", "feSIdEJFMHR", "tLqkd0OcZ4k", "pcDpDC9obr8", "DsTTM3lqCW4", "txTi3OALHRL", "EMmmVwlCTRp", "EAe58lagJ0W", "r5UEluRz1TA", "tyFGIFRt2S2", "kULUnkOQtaF", "cu0ukG91Zc0", "GT09A1kzoAg", "rOHUZld4vO6")Step 4: Extract Data from the Analytics API
The get_analytics_by_level() function pulls aggregated data across all specified org units, data elements, and time periods in a single API call. Level 5 corresponds to individual service delivery points.
fpam_dhis2 <- get_analytics_by_level(
element_ids = d_elements_ids,
start_date = "2023-01-01",
end_date = NULL,
level = 5,
org_ids = org_id)
# Check the disaggregation categories returned
unique(fpam_dhis2$category)The raw output contains one row per data element, per org unit, per period, per category combination. The category column is a comma-separated string combining age group, client type (New/Revisit), and gender.
Step 5: Clean and Reshape
5.1 Parse the Category Column
The category field after extraction arrives as a single string (e.g., “15-19, RVT, Female”). I split this into three separate columns using separate() and then standardize the values with case_when().
fpam_clean <- fpam_dhis2 |>
select(year, sdp, ta_town, district, region, national,
element, category, period, month, value) |>
separate(category,
into = c("age_raw", "client_raw", "gender_raw"),
sep = ", ",
fill = "right",
extra = "drop") |>
mutate(age_category = age_raw,
client_category = case_when(client_raw %in% c("RVT", "NU") ~ client_raw,
TRUE ~ NA_character_),
gender = case_when(client_raw %in% c("Male", "Female") ~ client_raw,
gender_raw %in% c("Male", "Female") ~ gender_raw,
TRUE ~ NA_character_)) |>
mutate(client_category = case_when(
client_category == "NU" ~ "New",
client_category == "RVT" ~ "Revisit",
.default = NA))5.2 Classify Contraceptive Methods
Each data element name contains the method type. I use str_detect() to group these into broad method categories.
fpam_clean <- fpam_clean |>
mutate(methods_group = case_when(
str_detect(element, "Implant") ~ "Implants",
str_detect(element, "Injectable") ~ "Injectables",
str_detect(element, "Oral Contraceptives") ~ "Pills",
str_detect(element, "Pills") ~ "Pills",
str_detect(element, "condom") ~ "Condoms",
str_detect(element, "IUCD") ~ "IUD",
str_detect(element, "MVSC") ~ "Surgical Contraception",
str_detect(element, "FVSC") ~ "Surgical Contraception"))5.3 Create Service Indicators
I classify each data element into a high-level service indicator (Family Planning, Abortion, HIV, STI, etc.) for reporting.
FP <- c(
"SRV - FP - Injectable - Consultation ( 3 month)-DEPO",
"SRV - FP - Implant - Consultation (3 years)",
"SRV - FP - Injectable - Consultation -3 month ( Sayana press)",
"SRV - FP - Male condom - Consultation",
"SRV - FP - Implant - Consultation (4 years)",
"SRV - FP - Implant - Consultation - Removal - 5 Yrs",
"SRV - FP - Implant - Consultation (5 years)",
"SRV - FP - Oral Contraceptives - Consultation-COC",
"SRV - FP - Implant - Consultation - Removal -3 Yrs",
"SRV - FP - Implant - Consultation - Removal - 4 Yrs",
"SRV - FP - Injectable - Consultation ( 2 month)",
"SRV - FP - IUCD - Consultation (10 years)",
"SRV - FP - EC - Consultation - Pills",
"SRV - FP - Oral Contraceptives - Consultation-POP",
"SRV - FP - IUCD - Consultation - Removal - 10 Yrs",
"SRV - FP - IUCD - Consultation (5 years)",
"SRV - FP - EC - Consultation - IUCD",
"SRV - FP - FVSC - Consultation",
"SRV - FP - Female condom - Consultation",
"SRV - FP - IUCD - Consultation - Removal - 5 Yrs",
"SRV - FP - Injectable - Consultation (1 month)",
"SRV - FP - MVSC - Consultation")
fpam_clean <- fpam_clean |>
mutate(indicator = case_when(element %in% FP ~ "Family Planning",
str_detect(element, "Abortion") ~ "Abortion",
str_detect(element, "Subfertility") ~ "Infertility",
str_detect(element, "HIV and AIDS") ~ "HIV",
str_detect(element, "STI/RTI") ~ "STI",
str_detect(element, "Gynaecology") ~ "Cervical Cancer Screening",
str_detect(element, "Other") ~ "GBV Screening",
str_detect(element, "Obstetrics") ~ "Pregnancy Test"))5.4 Classify Service Delivery Points
SDPs are grouped by type: static clinics, associated facilities (public health facilities supported by FPAM), outreach teams, and community-based reproductive health providers (CRHP).
static_clinics <- c("Kasungu FPAM Static", "Dowa Static", "Ntcheu Static", "Mangochi Static", "Kawale Static", "Mzuzu Static", "Dedza Static")
associated <- c("Nandumbo Static", "Mwima Static", "Namanolo Static", "Kalembo Static", "Neno District Hospital", "Ntcheu District Hospital")
outreach <- c("Salima Outreach", "Kawale Outreach Team B", "Chitipa Outreach Team A", "Ntcheu Outreach Team A", "Mzuzu Outreach Team B", "Ntcheu Outreach Team B","Dedza Outreach Team C", "Dedza Outreach Team A", "Dedza Outreach Team B","Kasungu Outreach Team D", "Mchinji Outreach Team A", "Mangochi Outreach Team A",
"Kasungu Outreach Team C", "Kasungu Outreach Team B", "Kasungu Outreach", "Mchinji Outreach Team B", "Mangochi Outreach Team B", "Kawale Outreach Team C", "Kawale Outreach Team E", "Dowa Outreach", "Balaka Outreach", "Karonga Outreach Team A", "Dowa Outreach Team C", "Dowa Outreach Team B", "Salima Outreach Team C", "Mzuzu Outreach", "Salima Outreach Team B", "Kawale Outreach Team A", "Kawale Outreach Team D", "Karonga Outreach Team B")
CRHP <- c("Ntcheu CBD - CRHP 2", "Ntcheu CBD - CRHP 3", "Ntcheu CBD - CRHP 1","Salima CBDA - CRHP", "Neno CBD - CRHP", "Mchinji CBD-CRHP", "Mangochi CBD - CRHP", "Balaka CBD", "Dedza CBD - CRHP", "Mzuzu CBD - CRHP", "Dowa CBD-CRHP", "Salima CBDA - CRHP 2")
fpam_clean <- fpam_clean |>
mutate(sdp_group = case_when(
sdp %in% static_clinics ~ "Static Clinic",
sdp %in% associated ~ "Associated Facility",
sdp %in% outreach ~ "Outreach",
sdp %in% CRHP ~ "CRHP"))5.5 Calculate Couple Years of Protection (CYP)
CYP is a standard family planning metric defined by IPPF and other SRH organisations. Each contraceptive method has a fixed conversion factor. In this analysis each data element has been mapped to its CYP value.
fpam_cleaner <- fpam_clean |>
mutate(CYP = case_when(
element == "Items - FP - BTL-Caesarian Section" ~ 10,
element == "Items - FP - BTL-Interval" ~ 10,
element == "Items - FP - BTL-Postpartum" ~ 10,
element == "Items - FP - EC - IUCD" ~ 4.6,
element == "Items - FP - EC - Pills" ~ 0.05,
element == "Items - FP - Female condom (Registered FP)" ~ 0.00833,
element == "Items - FP - Female condom back up" ~ 0.00833,
element == "Items - FP - Implant - 3 years-Implanon-Insertion" ~ 2.5,
element == "Items - FP - Implant - 4 years- Levoplant-Insertion" ~ 2.5,
element == "Items - FP - Implant - 5 years-Jadelle Insertion" ~ 3.8,
element == "Items - FP - Injectable - 1 month" ~ 0.25,
element == "Items - FP - Injectable - 2 month" ~ 0.25,
element == "Items - FP - Injectable - 3 month Sayana press" ~ 0.25,
element == "Items - FP - Injectable - 3 month-DEPO-IM" ~ 0.25,
element == "Items - FP - Injectable - 3 month-DEPO-SC P" ~ 0.25,
element == "Items - FP - Injectable - 3 month-DEPO-SC SI"~ 0.25,
element == "Items - FP - IUCD Insertion- (5 years)" ~ 4.6,
element == "Items - FP - IUCD Insertion- (10 years)" ~ 4.6,
element == "Items - FP - Male condom (Registered FP)"~ 0.00833,
element == "Items - FP - Male condom back up" ~ 0.00833,
element == "Items - FP - Oral Contraceptives (COC)" ~ 0.0667,
element == "Items - FP - Oral Contraceptives (POP)" ~ 0.0833,
element == "Items - FP - Vasectomy" ~ 10))Step 6: Final Dataset
Subsetting columns analysis.
fpam_cleaner <- fpam_cleaner |>
select(region, district, ta_town, sdp, sdp_group,
methods_group, element, indicator, gender,
client_category, age_category,
year, month, period, CYP, value) |>
mutate(across(c(district, age_category, methods_group,
indicator, gender, client_category),as.factor))write_csv(fpam_cleaner, here("fpam_data.csv"))Summary
The complete pipeline moves from raw DHIS2 API output to an analysis-ready dataset in six steps:
- Connect to the DHIS2 API using secure credentials
- Explore metadata to identify relevant org units, data elements, and category combinations
- Extract disaggregated analytics data for 59 SDPs and 61 data elements from January 2023 up to date.
- Parsing of the the category column into age group, client type, and gender
- Classifying each record by contraceptive method group, service indicator, and SDP type
- Calculating and assigning CYP conversion factors for each commodity line
The output CSV feeds directely into downstream dashboards (Shiny, Power BI) and routine programme reports. This approach replaces manual DHIS2 exports, reduces errors, and makes the extraction fully reproducible.