stat_modelling

1.0 Introduction

In this project we will be using a subset of the data from the Behavioral Risk Factor Surveillance System (BRFSS) from the CDC to investigate possible risk factors for arthritis. The BRFSS is a system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. It was established in 1984 with 15 states, BRFSS now collects data in all 50 states. This dataset contains health and social information about non-institutionalized adults in the US in 2013. There are 359,925 individuals included (rows) and 16 variables (columns). The goal is of the study is to investigate risk factors associated with arthritis.

Data preparation

To prepare the dataset for analysis, we used a variety of functions to explore its structure, variability, and completeness. The map_df() function was used to check the number of unique values for each variable, dim() to confirm the dimensions (359,925 rows, and 16 columns), and the summary() function was used to obtain summary statistics.

map_df(df_brfss,n_distinct)
# A tibble: 1 × 16
  fruits veggies   age under30 age30to64 age65plus arthritis female genhealth
   <int>   <int> <int>   <int>     <int>     <int>     <int>  <int>     <int>
1     10      10    13       2         2         2         2      2         5
# ℹ 7 more variables: education <int>, income <int>, active <int>,
#   active1 <int>, bmi <int>, bmicat <int>, activetimes <int>
df_brfss |> 
  dim() # checking dimensions of the dataset
[1] 359925     16

The dataset has 359,925 rows, and 16 columns

df_brfss |> 
  summary()
     fruits          veggies           age            under30      
 Min.   :0.0000   Min.   :0.000   Min.   : 1.000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.000   1st Qu.: 5.000   1st Qu.:0.0000  
 Median :1.0000   Median :1.000   Median : 8.000   Median :0.0000  
 Mean   :0.9969   Mean   :0.805   Mean   : 7.495   Mean   :0.0949  
 3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:10.000   3rd Qu.:0.0000  
 Max.   :9.0000   Max.   :9.000   Max.   :13.000   Max.   :1.0000  
   age30to64        age65plus        arthritis          female      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.0000   Median :0.0000   Median :0.0000   Median :1.0000  
 Mean   :0.6004   Mean   :0.3047   Mean   :0.3336   Mean   :0.5679  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
   genhealth       education         income          active      
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :0.0000  
 1st Qu.:2.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:0.0000  
 Median :2.000   Median :5.000   Median :6.000   Median :1.0000  
 Mean   :2.536   Mean   :4.935   Mean   :5.674   Mean   :0.7388  
 3rd Qu.:3.000   3rd Qu.:6.000   3rd Qu.:8.000   3rd Qu.:1.0000  
 Max.   :5.000   Max.   :6.000   Max.   :8.000   Max.   :1.0000  
    active1          bmi            bmicat       activetimes    
 Min.   :0.00   Min.   :12.02   Min.   :1.000   Min.   :  0.00  
 1st Qu.:0.00   1st Qu.:23.75   1st Qu.:2.000   1st Qu.:  0.00  
 Median :1.00   Median :26.96   Median :3.000   Median : 12.00  
 Mean   :0.94   Mean   :27.97   Mean   :2.942   Mean   : 11.83  
 3rd Qu.:1.00   3rd Qu.:30.99   3rd Qu.:4.000   3rd Qu.: 20.00  
 Max.   :2.00   Max.   :93.55   Max.   :4.000   Max.   :396.00  
df_brfss |> 
  head()
# A tibble: 6 × 16
  fruits veggies   age under30 age30to64 age65plus arthritis female genhealth
   <dbl>   <dbl> <dbl>   <dbl>     <dbl>     <dbl>     <dbl>  <dbl>     <dbl>
1      0       0     7       0         1         0         0      1         3
2      0       1     8       0         1         0         1      1         3
3      0       1     9       0         1         0         0      1         2
4      0       1    10       0         0         1         0      0         3
5      1       1     6       0         1         0         0      1         2
6      1       1     7       0         1         0         1      0         1
# ℹ 7 more variables: education <dbl>, income <dbl>, active <dbl>,
#   active1 <dbl>, bmi <dbl>, bmicat <dbl>, activetimes <dbl>

In the code below we are going to subset the dataset to include the following variables of importance:

- Arthritis

- Physical activity

- BMI

- Age(65 or over)

  • Sex
# Sub-setting and inspecting the dataset
lean_brfss<-df_brfss |> 
  select(arthritis,female,age65plus,active,bmi) |> 
  mutate(arthritis=factor(if_else(arthritis==1,"Yes","No"),
                          levels = c("Yes","No")),
         age_65_or_over=factor(if_else(age65plus==1,"Yes","No"),
                               levels = c("Yes","No")),
         Sex=factor(if_else(female==1,"Female","Male"),levels = c("Female","Male")),
         physical_activity=factor(if_else(active==1,"Yes","No"))) |> 
  glimpse()
Rows: 359,925
Columns: 8
$ arthritis         <fct> No, Yes, No, No, No, Yes, No, No, Yes, No, Yes, No, …
$ female            <dbl> 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0…
$ age65plus         <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0…
$ active            <dbl> 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1…
$ bmi               <dbl> 18.22, 27.46, 21.97, 35.94, 39.86, 30.17, 28.29, 29.…
$ age_65_or_over    <fct> No, No, No, Yes, No, No, Yes, No, No, Yes, Yes, No, …
$ Sex               <fct> Female, Female, Female, Male, Female, Male, Female, …
$ physical_activity <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, No, No, Yes, Y…

Exploratory Data Analysis

In this section we will conduct an exploratory data analysis which is a way of investigating and understanding the structure of our data. To do this I used a range of graphical, and numerical methods including means, histograms, Q-Q plots, and contingency tables. Histograms have been selected because they clearly show the distribution of data (bmi in this case) and provide a quick perspective of whether or not data is skewed. Since one graphical method is rarely sufficient, Q-Q plots have been used to add a further perspective in assessing the normality of the data. Finally, Boxplots will also be used as they conspicuously indicate whether or not we have outliers in a dataset.

lean_brfss |> 
  is.na() |> 
  sum()
[1] 0

The dataset does not have missing values as the sum of all missing values is zero. The visual below show distribution of the data by target variable.

lean_brfss %>% 
  ggplot(aes(x = arthritis, fill = arthritis))+
  geom_bar(aes(y = after_stat(count/sum(count))), stat = "count") +
  theme_bw() +
  labs(title = "Arthritis prevalence",
       x = "Arthritis Status",
       y = "Proportion",
       fill = "Arthritis") +
   scale_y_continuous(labels = scales::percent_format(scale = 100))

lean_brfss |> 
  ggplot(aes(x = bmi)) +
  geom_boxplot(aes(),color = "steelblue") +
  coord_flip() +
  #facet_grid(~arthritis)+
  theme_bw()

Seeing from the output, the variable bmi, has significant outliers which could affected the analysis, and result in biased outputs. In the subsequent section, we are going to remove outliers.

Removing outliers

Q <- quantile(lean_brfss$bmi, probs=c(.25, .75), na.rm = FALSE) # Specifying the lower and upper quatiles
iqr <- IQR(lean_brfss$bmi) # Interquatile range
up <-  Q[2]+1.5*iqr # Upper Range  
low<- Q[1]-1.5*iqr # Lower Range
brfss_eliminated<- subset(lean_brfss, lean_brfss$bmi > (Q[1] - 1.5*iqr) & lean_brfss$bmi < (Q[2]+1.5*iqr))
with_outliers<-lean_brfss |> 
  ggplot(aes(x = bmi)) +
  geom_boxplot(aes(),color = "steelblue") +
  coord_flip() +
  #facet_grid(~arthritis)+
  theme_bw() +
  labs(title = "With outliers")
without_outliers<-brfss_eliminated |> 
  ggplot(aes(x = bmi)) +
  geom_boxplot(aes(),color = "steelblue") +
  coord_flip() +
  #facet_grid(~arthritis)+
  theme_bw() +
  labs(title = "Without outliers")
cowplot::plot_grid(with_outliers, without_outliers)

brfss_eliminated |> 
  ggplot(aes(x = bmi))+
  geom_histogram(aes(y = after_stat(density), fill = ..count..),bins = 30)+
  facet_grid(~arthritis) +
  theme_bw()
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.

sex_subt<-brfss_eliminated |> 
  select(Sex,arthritis)

activity_subt<-brfss_eliminated |> 
  select(physical_activity,arthritis)

age_subt<-brfss_eliminated |> 
  select(age_65_or_over,arthritis)
sex_subt |> 
  tbl_cross(row = Sex,
            col = arthritis,
            percent = "column") |> 
  bold_labels()
arthritis
Total
Yes No
Sex


    Female 71,737 (63%) 124,693 (53%) 196,430 (56%)
    Male 42,144 (37%) 109,887 (47%) 152,031 (44%)
Total 113,881 (100%) 234,580 (100%) 348,461 (100%)
activity_subt |> 
  tbl_cross(row = physical_activity,
            col = arthritis,
            percent = "column") |> 
  bold_labels()
arthritis
Total
Yes No
physical_activity


    No 36,915 (32%) 51,544 (22%) 88,459 (25%)
    Yes 76,966 (68%) 183,036 (78%) 260,002 (75%)
Total 113,881 (100%) 234,580 (100%) 348,461 (100%)
age_subt |> 
  tbl_cross(row = age_65_or_over,
            col = arthritis,
            percent = "column") |> 
  bold_labels()
arthritis
Total
Yes No
age_65_or_over


    Yes 56,240 (49%) 51,271 (22%) 107,511 (31%)
    No 57,641 (51%) 183,309 (78%) 240,950 (69%)
Total 113,881 (100%) 234,580 (100%) 348,461 (100%)
brfss_eliminated %>% 
  count(Sex, arthritis) %>%
  ggplot(aes(x = Sex, 
             y = arthritis)) +
  geom_tile(aes(fill=n)) +
  labs(x = "Sex",
       y = "Arthritis",
       fill = "Count") +
  scale_fill_distiller(palette = "RdPu") +
  theme_classic(base_size = 16)

Observations

The box plots show higher median BMI in the arthritis group with just fewer outliers after filtering. Histograms show a right skewed distribution where as contingency tables suggest higher arthritis prevalence among females (63%), those aged 65 or over (49% vs 22%), and those with no physical activity. the distribution of BMI as a variable indicates skewness and presence of outliers. This variables will be transformed prior to analysis.

Analysis plan

The analysis aims to investigate and quantify associations between arthritis (Yes/No) and risk factors: Sex (Female, Male), age (Less or More than 65), physical activity (Yes/No), and BMI. We will use both descriptive, and inferential methods to address the research questions. Descriptive statistics will be used as they are appropriate for generating summary statistics (means, medians) which will be useful in identifying preliminary patterns.

Chi square tests will be performed to assess the association between arthritis and each categorical variable (physical exercise, sex, age category).

We will also employ t tests or Wilcoxon Rank Sum test (depending on normality) to compare the means of BMI for those that have arthritis, and those that do not.BMI is numeric and continuous variable, and arthritis is a categorical variable, requiring a a two group comparison, and this makes t-tests or Wilcoxon Rank Sum Test a method of choice.

We will also calculate the relative risk ratio to assess the relative risk of having arthritis depending on the sex, age, and physical activity of an individual. Relative Risk is preferred over odds ration in this case due to direct interpretation as a risk.

Finally, we will run a logistic regression model to understand the collective relative risk of sex, age, physical activity, BMI, on the target variable (arthritis).

Data dictionary

  • Arthritis: A binary outcome variable taking the values of Yes(1), and No(2).
  • Physical activity: A binary explanatory variable taking the values of Yes(1), or No(2) to indicate whether an individual is exercising or not respectively.
  • Age 65 or over: A binary explanatory variable taking the values of Yes(1), or No(2) to indicate whether or not the individual is above or below 65 years of age respectively.
  • Sex: A binary explanatory variable taking the values of Female(1), or Male(2) to indicate whether or not the individual is Female or Male respectively.
  • BMI: A numeric continuous variable showing the Body Mass Index (BMI) for an individual.

Hypothesis

In our analysis we are going to make the following hypotheses: ## Chi-square tests. - Null: There is no association between arthritis, and Sex of an individual. - Alternative: There is an association between arthritis, and Sex of an individual.

  • Null: There is no association between arthritis, and physical activity of an individual.

  • Alternative: There is an association between arthritis, and physical activity of an individual.

  • Null: There is no association between arthritis, and of an individual.

  • Alternative: There is an association between arthritis, and Sex of an individual.

  • Null: There is no association between arthritis, and age catagory (65 or over) of an individual.

  • Alternative: There is an association between arthritis, and age category(65 or over) of an individual.

Normality-tests

  • Null: The sample data comes from a normally distributed population.
  • Alternative: The sample data does not come from a normally distributed population

Investigation of assumptions

In this analysis, we are making the following assumptions about our data: - The data is normally distributed - The expected values in the contingency table for categorical variables are not less than 5 in each cell. - ## Visual inspection using graphs

lean_brfss |> 
  select(arthritis,bmi) |> 
  ggplot(aes(sample = bmi))+
  geom_qq()+
  geom_qq_line(colour = "red", linewidth = 1) +
  facet_grid(~arthritis) +
  theme_bw() 

brfss_eliminated |> 
  ggplot(aes(x = bmi))+
  geom_histogram(aes(y = after_stat(density), 
                     fill = ..count..))+
  #geom_density(aes(y = after_stat(density)), color = "red", linewidth = 1) +
  geom_vline(xintercept = mean(lean_brfss$bmi), 
             color = "purple", 
             linewidth = 0.8)+
  geom_vline(xintercept = median(lean_brfss$bmi), 
             color = "orange", 
             linewidth = 0.8)+
  facet_grid(~arthritis) +
  theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graphical analysis shows that bmi variable is not normally distributed. Both the qq-plot and histograms show that data is skewed. For the qq plots, the points deviate upwards from the red line the for the arthritis variable indicating violations of normality. The histogram for the same shows that the mean and median is not equal, and hence skewed. We will attempt to transform the bmi variable by taking the log to base 10 of the bmi.

Data Transformation

Observations from the visual inspection of the data, show that the bmi variable is not normally distributed, as the histograms and Q-Q plots have longer tails to the right, and data points deviating from red line respectively. Below, we are methods to transform the variable (bmi) using the three methods.

Log transformation

lean_brfss |> 
  ggplot(aes(x = log(bmi)))+
  geom_histogram(aes(y = after_stat(density), 
                     fill = ..count..))+
  #geom_density(aes(y = after_stat(density)), color = "red", linewidth = 1) +
  geom_vline(xintercept = mean(log(lean_brfss$bmi)), 
             color = "purple", linewidth = 0.8)+
  geom_vline(xintercept = median(log(lean_brfss$bmi)), 
             color = "orange", linewidth = 0.8)+
  facet_grid(~arthritis) +
  theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Square root transformation

lean_brfss |> 
  ggplot(aes(x = sqrt(bmi)))+
  geom_histogram(aes(y = after_stat(density), 
                     fill = ..count..))+
  #geom_density(aes(y = after_stat(density)), color = "red", linewidth = 1) +
  geom_vline(xintercept = mean(sqrt(lean_brfss$bmi)), 
             color = "purple", linewidth = 0.8)+
  geom_vline(xintercept = median(sqrt(lean_brfss$bmi)), 
             color = "orange", linewidth = 0.8)+
  facet_grid(~arthritis) +
  theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Square transformation

lean_brfss |> 
  ggplot(aes(x = (bmi^2)))+
  geom_histogram(aes(y = after_stat(density), 
                     fill = ..count..))+
  #geom_density(aes(y = after_stat(density)), color = "red", linewidth = 1) +
  geom_vline(xintercept = mean((lean_brfss$bmi^2)), 
             color = "purple", linewidth = 0.8)+
  geom_vline(xintercept = median((lean_brfss$bmi^2)), 
             color = "orange", linewidth = 0.8)+
  facet_grid(~arthritis) +
  theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the above outputs we can see that all methods of data transformation have not been able to result in normal distribution of bmi. Since graphic visualizations of tests are necessary but not sufficient we have further used the Shapiro test of significance below.

Test of signifcance

Using a sample of 4999 observations from the main dataset, we test for normality using the Shapiro Wilks test.

set.seed(1234) # for reproducibility
bmi_sample<-lean_brfss |>
  select(bmi) |> 
  drop_na(bmi) |> 
  sample_n(4999) 

bmi_sample |> 
  pull(bmi) |> 
  shapiro.test()

    Shapiro-Wilk normality test

data:  pull(bmi_sample, bmi)
W = 0.91633, p-value < 2.2e-16

The test results from the test show that firstly, the test statistic of 0.69345 is far from 1, indicating the deviation from the mean. The p-value is significant, is extremely low showing that there is strong evidence to reject the null hypothesis, and conclude that data for bmi is not coming from a normally distributed population.

All the assumptions except for chi-square test of independence have not been met for both the graphics, and significance tests. Since parametric methods such as students t tests, anova require data to be normally distributed, we are going to use non-parametric methods to analyse the dataset even as these methods have less power.

Primary data analysis

Assessing associations

sex_subt |> 
  table() |> 
  chisq.test()

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(sex_subt)
X-squared = 3016, df = 1, p-value < 2.2e-16
age_subt |> 
  table() |> 
  chisq.test()

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(age_subt)
X-squared = 27231, df = 1, p-value < 2.2e-16
activity_subt |> 
  table() |> 
  chisq.test()

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(activity_subt)
X-squared = 4413, df = 1, p-value < 2.2e-16
wilcox.test(bmi~arthritis,data=brfss_eliminated,conf.int=T)

    Wilcoxon rank sum test with continuity correction

data:  bmi by arthritis
W = 1.5633e+10, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 1.489961 1.559952
sample estimates:
difference in location 
              1.520008 

Assessing relative risk (RR)

sex_subt |> 
  table() |> 
  epitools::epitab(method="riskratio")
$tab
        arthritis
Sex        Yes        p0     No        p1 riskratio    lower    upper p.value
  Female 71737 0.3652039 124693 0.6347961  1.000000       NA       NA      NA
  Male   42144 0.2772066 109887 0.7227934  1.138623 1.133424 1.143845       0

$measure
[1] "wald"

$conf.level
[1] 0.95

$pvalue
[1] "fisher.exact"
age_subt |> 
  table() |> 
  epitools::epitab(method="riskratio")
$tab
              arthritis
age_65_or_over   Yes        p0     No        p1 riskratio    lower    upper
           Yes 56240 0.5231093  51271 0.4768907  1.000000       NA       NA
           No  57641 0.2392239 183309 0.7607761  1.595284 1.584712 1.605926
              arthritis
age_65_or_over p.value
           Yes      NA
           No        0

$measure
[1] "wald"

$conf.level
[1] 0.95

$pvalue
[1] "fisher.exact"
activity_subt |> 
  table()|> 
  epitools::epitab(method="riskratio")
$tab
                 arthritis
physical_activity   Yes        p0     No        p1 riskratio  lower    upper
              No  36915 0.4173120  51544 0.5826880  1.000000     NA       NA
              Yes 76966 0.2960208 183036 0.7039792  1.208158 1.2008 1.215561
                 arthritis
physical_activity p.value
              No       NA
              Yes       0

$measure
[1] "wald"

$conf.level
[1] 0.95

$pvalue
[1] "fisher.exact"
model_data<-brfss_eliminated |> 
  select(arthritis,bmi,age_65_or_over,Sex,physical_activity)

model_lr<-glm(arthritis~.,data = model_data,family = binomial("logit"))
summary(model_lr)

Call:
glm(formula = arthritis ~ ., family = binomial("logit"), data = model_data)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)           1.2621793  0.0228869   55.15   <2e-16 ***
bmi                  -0.0660347  0.0007623  -86.62   <2e-16 ***
age_65_or_overNo      1.2707663  0.0079701  159.44   <2e-16 ***
SexMale               0.4330701  0.0078421   55.22   <2e-16 ***
physical_activityYes  0.3712731  0.0085998   43.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 440383  on 348460  degrees of freedom
Residual deviance: 400874  on 348456  degrees of freedom
AIC: 400884

Number of Fisher Scoring iterations: 4
model_lr |> 
  tbl_regression(exponentiate = TRUE, conf.int = TRUE) |> 
  add_global_p() |> 
  bold_labels() |> 
  add_glance_table(include = c(nobs,BIC,AIC,logLik))
Characteristic OR 95% CI p-value
bmi 0.94 0.93, 0.94 <0.001
age_65_or_over

<0.001
    Yes
    No 3.56 3.51, 3.62
Sex

<0.001
    Female
    Male 1.54 1.52, 1.57
physical_activity

<0.001
    No
    Yes 1.45 1.43, 1.47
No. Obs. 348,461

BIC 400,938

AIC 400,884

Log-likelihood -200,437

Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Using the finalfit package

dependent <- "arthritis"
explanatory_multi <- c("bmi","age_65_or_over","Sex","physical_activity")
model_data %>% 
  finalfit::or_plot(dependent, explanatory_multi,
          breaks = c(0.5, 1, 2, 3, 4,5),
          table_text_size = 3.5,
          title_text_size = 16)
Waiting for profiling to be done...
Waiting for profiling to be done...
Waiting for profiling to be done...
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_errorbarh()`).

Interpretation

The results from both bivariate, and multivariate analysis present strong evidence of association between arthritis, and explanatory variables (sex, age, physical activity, and BMI):

1. Sex, and Arthritis

  • Chi-square test of indeoendence shows a statistically significant relationship between sex and arthritis (X-squared = 27231 = 3016, df = 1, p<0.001), indicating that arthritis prevalence is different in males and females.

  • Risk Ratio: Males have a lower risk of arthritis comppared to females (RR = 0.86, 95% CI: 0.86-0.88, p<0.001). This means that the risk of arthritis in males is about 14% lower than in females.

  • Logistic regression: Adjusted OR = 1.54 (95% CI: 1.52-1.57), p<0.001. This means that after adjusting for BMI, age, and physical activity, females have 54% higher odds of having arthritis compared to males. The 95% confidence interval does not include 1, and extremely narrow indicating that the estimate is quite precise.

2. Age and Arthritis

  • Chi-square test results show that there is a strong evidence of association between age (65and over) and arthritis (X-squared = 27231, df = 1, p < 0.001). The p-value is less than 0.05, we therefore reject the null hypothesis, and conclude that the association between age and arthritis is not by chance.

  • Risk Ratio: Individuals aged 65 or older have a relative risk of 2.18 (95% CI: 2.15-2.20, p < 0.001), suggesting that they are more than twice as likely to have arthritis compared to individuals below 65.

Logistic regression: Adjusted OR = 3.56 (95% CI: 3.51-3.62, p < 0.001), showing that being 65 or older increases the odds of having arthritis by 3.5 times after adjusting for other variables. Again, the low p - value (less than 0.05) as well as the narrow confidence interval shows that the estimate is both significant and quite precise.

3. Physical Activity and Arthritis

  • Chi-square test: The association between physical activity, and arthritis is significant (X-squared = 4413, df = 1, p < 0.001). Given the small p value (less than 0.05), we reject the null hypothesis, and conclude that the difference is unlikely to be due to random variation alone.

  • Risk Ratio: Individuals who are physically inactive have higher risk of developing athritis (RR = 1.21, 95% CI: 1.20-1.22, p < 0.001), showing that there is a 21% increased risk compared to physically active individuals.

  • Logistic regression: With Adjusted OR = 1.45 (95% CI: 1.43-1.47, p < 0.001). This means that physical inactivity increases the odds of arthritis by 45% holding other variables constant.The p value is less than 0.05 (given alpha level), and therefore significant. Note that the confidence interval is narrow, and does not include 1 making the estimate so certain and precise.

4. BMI and Arthritis

  • Wilcoxon Rank-Sum Test (Mann-Whitney U Test): Since BMI is numeric and skewed, this test is ideal because it is non-parametric. The p - value is less than 0.05 (0.001), and the confidence interval is narrow(W-Statistic 1.52, 95% CI:1.49-1.56, p < 0.001) and does not include 0, we reject the null hypothesis, and conclude that individuals with arthritis tend to have higher BMI than those without arthritis.

  • Logistic regression: The adjusted OR = 0.94 (95% CI: 0.93-0.94, p < 0.001) suggests that there is a 6% decrease in the odds of arthritis for every unit of BMI gained. Even as this is significant the result is inverse relationship could suggest possible confounding in our model.

Machine Learning

Loading >>>>>>>>>>>>>>>>>>>>>>>>>>>>>