In this project we will use the NHANES dataset to predict diabetes given the available risk factors. The National Health ans Nutrition Survey is a a program in the US designed to assess the health and nutritional status pf adults, and children in the US. The data includes demographic, socio-economic, dietary, and health-related information.
Loading packages including the dataset.
Inspecting the dataset
We are going to save the dataset into the nhanes_df object to maintain the original dataset intact.
nhanes_df<-NHANES |>select(Diabetes,DirectChol,BMI,MaritalStatus,Age,Gender) |>drop_na() |>clean_names()# Changing the levels for appropriate analysisnhanes_df<-nhanes_df |>mutate(diabetes=factor(diabetes, levels =c("Yes", "No"))) |>glimpse()
Rows: 6,786
Columns: 6
$ diabetes <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No,…
$ direct_chol <dbl> 1.29, 1.29, 1.29, 1.16, 2.12, 2.12, 2.12, 0.67, 0.96, 1…
$ bmi <dbl> 32.22, 32.22, 32.22, 30.57, 27.24, 27.24, 27.24, 23.67,…
$ marital_status <fct> Married, Married, Married, LivePartner, Married, Marrie…
$ age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 58, 50, 33, 60,…
$ gender <fct> male, male, male, female, female, female, female, male,…
In the code below, we are splitting our data into training and testing sets (0.8, 0.2) and stratify by the target variable so that we do not end up having all the data from the target variable.
Using the hypothetical threshold of 0.8, we can conclude that the predictors are not collerated. In the code below, we are going to fit both models using the fit function. After which we are going to collect and combine predictions, and load them.
In the code below we are going to specify a recipe object after which we will add steps for engineering our features (feature engineering). The steps are to preprocess the data into a form that will allegedly improve our analysis.
set.seed(123)lr_recipe<-recipe(diabetes~.,data = ml_training) |>step_log(all_numeric()) |>step_normalize(all_numeric()) |>#Centering and scalingstep_dummy(all_nominal(), -all_outcomes())
set.seed(123)heatmap_lr<-conf_mat(lr_resultss, truth = diabetes, estimate = .pred_class) |>autoplot(type ="heatmap")mosaic_lr<-conf_mat(lr_resultss, truth = diabetes, estimate = .pred_class) |>autoplot(type ="mosaic")cowplot::plot_grid(mosaic_lr,heatmap_lr)
The results from the confusion matrix, metrics and plots show that the model is excellent at predicting people that do not have diabetes, hence it has a low false positive rate. Even though the accuracy of the model is 91.1%, the model struggles to correctly predict people that actually have diabetes, making accuracy not the ideal measure in this case. Out of 120 positive cases, the model only predicts 4 cases. To add more nuance to the results we will also plot the ROC curve, and check the area under the curve which shows the models discriminative ability.
The random Forest model performs better that logistic regression my almost all metrics.
Accuracy : 0.93
Sensitivity : 0.35
Specificity : 0.991
ROC-AUC : 0.89
The random forest model improves sensitivity from 4.07% (in logistic regression) to 34% (Random Forest), meaning that the model is relatively better at identifying positive cases compared to logistic regression, even as it still fails to predict around 64% of the positive cases correctly.
Note that other model buiding practices such as hyperparameter tuning (k-fold cross validation) have been skipped.
Given the dataset, we can predict that our patient (patient 1) does not have diabetes. Remember the model has a high accuracy and high specificity (excels at identifying negative cases). Since there are fewer patients with diabetes than those without, it means the dataset is imbalanced. We will adjust our threshold from the default 0.5 by over sampling the rare class or undersampling the majority class.