David Mwale
Data Scientist | MSc Data Science for Health and Social Care | University of Edinburgh
I build data science products that improve health outcomes. With over 10 years of experience working with health, agriculture, and nutrition data across Malawi, I apply statistical modelling, machine learning, and reproducible research to solve real problems.
Currently finalising my MSc at the University of Edinburgh in Data Science for Health and Social Care. Open to Data Science roles.
Featured Projects
Data Cleaning and Imputation
Demonstrated a full data cleaning pipeline on deliberately messed Iris data (using messy package): standardized over 68 “messed” species strings, converted data types, assessed missinginess with naniar, and imputed missing values using Random Forest (missForest).Finally, I Validated that imputed distributions almost matched the original dataset.
Predicting Diabetes with Machine Learning
Built logistic regression and Random Forest classifiers on NHANES data using tidymodels. Random Forest improved sensitivity from 4% to 34% for positive diabetes cases. Model evaluation showed a ROC-AUC of 0.89, indicating better performance interms of distinguishing non diabetic, and diabetic participants.
Risk Factors for Arthritis: Statistical Modelling
Investigated associations between arthritis and risk factors (sex, age, BMI, physical activity) using 360,000 observations from the CDC BRFSS survey. Applied chi-square tests, Wilcoxon rank-sum tests, relative risk ratios, and logistic regression. Females had 54% higher adjusted odds of arthritis (OR = 1.54, 95% CI: 1.52-1.57).
Interactive Shiny Dashboard
A live, interactive R Shiny application deployed on shinyapps.io. to enable realtime tracking and fastracking project decisions.
DHIS2 Data Pipeline
Built a reproducible R pipeline to extract, clean, and structure family planning service data from DHIS2 using the khisr package. Parsed composite DHIS2 category column with strings, mapped 61 data elements to respective FP method groups, calculated Couple Years of Protection (CYP), and exported analysis-ready data for 59 service delivery points across Malawi.
Technical Skills
Languages: R (tidyverse, tidymodels, ggplot2, Shiny), Python (pandas, matplotlib, seaborn), SQL (SQLite, MySQL, PostgreSQL)
Tools: Git/GitHub, Quarto, RMarkdown, Microsoft Azure Databricks, Power BI
Methods: Logistic regression, random forests, data wrangling, reproducible research, epidemiological analysis