Load packages

library(ggplot2)
library(dplyr)

Load data :

load("brfss2013.RData")

Part 1: About the Data :

The Behavioral Risk Surveillance System (BRFSS) is an health related telephone survey that collects data of US Residents regarding their health related risk behaviors.The BRFSS was established in 1984 and as of 2013 has collected data on nearly 500,000 U.S. residents.To qualify for the surveys, prospective respondents most be a non-institutionalized adult, aged 18 years or older, currently residing in a private or college housing, and currently using a land-line phone

The objective of the BRFSS is to collect uniform and state-specific data on preventive health practices and risk behaviors that may affect chronic diseases, injuries, and preventable infectious diseases.

Factors assessed by the BRFSS include: Tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.

Collection Of Data :

The data collection was collected through a cross sectional telephone survey as mentioned in following link (https://www.cdc.gov/brfss/data_documentation/index.htm) that was conducted monthly over landline telephones and cellular telephones. The dataset we are working on contains 330 variables for a total of 491, 775 samples in 2013. The missing values denoted by “NA”.

Generalizability :

The data can be generalize to the population. It is based on stratified random sampling. Although there may be a chance of bias due to non response. Also several of the records in the dataset are incomplete due to incomplete surveys. This lack of data may possibly skew our results.

Causality :

As mention that it is a cross-sectional study (observational study) therefore we cannot establish causation. Since it is an observational study we can establish association/correlation b/w variables


Part 2: Research questions

Research question 1: There has been various studies that whether is there any association b/w alcohol and sleep duration and similarly between depression and sleep duration. One of the research done was by NCBI(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3181883/). So our first research question is finding out that whether sleep duration has any association with depression and alcohol

Research question 2: As a second question we would like to know the relationship b/w Dark green vegetables and heart attack. There have been many researches stating that dark green vegetables reduce the risk of heart risk. Similarly there have been research that eating dark green vegetables “may” reduce the risk of stroke. So our second question is does Dark green vegetables reduces the risk of heat disease and stroke

Research question 3: Now we are interested in seeing the relationship b/w income and health coverage of the respondents


Part 3: Exploratory data analysis

Research question 1:

brfss2013 <- brfss2013%>%
  filter(!is.na(avedrnk2), !is.na(sleptim1))

ggplot(brfss2013, aes(avedrnk2, sleptim1)) +
  geom_point(color = "steelblue", alpha = 0.25) + labs(x = " Average Drink", y = "Sleep Duration", title = "Alcohol Consumption V/s Respondents Sleep Duration")

As it is observable that respondents within the range of 0-20 average drink have better sleep duration as compared to the ones who drink drink alcohol within the range of 40-60(average). Therefore we can say that people with high alcohol consumption don’t sleep well enough

Next we would like to know out of this how many of them are males and females

brfss2013 <- brfss2013%>%
  filter(!is.na(sex))


ggplot(brfss2013, aes(avedrnk2, sleptim1)) +
  geom_point(aes(color = sex),size = 1.5) + labs(x = " Average Drink", y = "Sleep Duration", title = "Alcohol Consumption V/s Respondents Sleep Duration") + scale_color_manual(values = c("darkblue", "green"))

The above graph shows us the same results but her we are able to see on gender basis. For better picture below I have calculated the number of male and female respondents who have avg drink of more than or equal to 20 respectively.

brfss2013%>%
  filter(!is.na(sex), avedrnk2 >= 20) %>%
  select(sex,avedrnk2, sleptim1) %>%
  group_by(sex) %>%
  summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   sex    count
##   <fct>  <int>
## 1 Male     469
## 2 Female   138

We can see that there is 469 male and 138 female respondents who have average drink of more than or equal to 20.

Now we want to know the association between depression and sleep duration

brfss2013%>%
  filter(!is.na(addepev2)) %>%
  select(addepev2, sleptim1) %>%
  group_by(addepev2) %>%
  summarise(avg_sleep = mean(sleptim1), median = median(sleptim1), maximum = max(sleptim1), minimum = min(sleptim1), inter_quart = IQR(sleptim1))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 6
##   addepev2 avg_sleep median maximum minimum inter_quart
##   <fct>        <dbl>  <dbl>   <int>   <int>       <dbl>
## 1 Yes           6.92      7      24       1           2
## 2 No            7.06      7      24       1           2

As we can see the average sleep duration of respondents with depression is 6.92 which is less than the average sleep duration of respondents with no depression. So we cant say that there is any association b/w depression and sleep duration according to this sample

Plotting the above information for better picture

brfss2013 <- brfss2013%>%
  filter(!is.na(addepev2), !is.na(sleptim1))

ggplot(brfss2013, aes(addepev2, sleptim1)) +
  geom_boxplot(aes(fill = addepev2)) + coord_flip() + labs(x = "Depressed(Yes/No)",
                                                                          y = "Sleep Duration", title = "Association b/w Depression and Sleep Duration", fill = "Depressed") + scale_fill_manual(values = c( "#E7B800", "#FC4E07"))

The above graph shows us that whether is there any relationship or not b/w depression and sleep duration

Research question 2:

brfss2013 <- brfss2013 %>%
  filter(!is.na(fvgreen),!is.na(cvdinfr4)) 

ggplot(brfss2013, aes(cvdinfr4, fvgreen)) +
  geom_boxplot(aes(fill = cvdinfr4)) + labs(x= "Heart Disease", y= "No of times Dark Green vegetables ate", title = "Relationship b/w Dark green vegetables V/s Heart Disease", fill = "Heart Disease")

Now by seeing above plot we know that our median is somewhere around 210 for whole data and with the help of this number we can know whether eating dark green vegetables above 210 reduces heart disease or not. Therefore now I will calculate the same below

brfss2013 %>%
  filter(fvgreen >= 210) %>%
  select(fvgreen, cvdinfr4) %>%
  group_by(cvdinfr4) %>%
  summarise(avg = mean(fvgreen), med = median(fvgreen), inter_quart = IQR(fvgreen), count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 5
##   cvdinfr4   avg   med inter_quart count
##   <fct>    <dbl> <dbl>       <dbl> <int>
## 1 Yes       311.   310          11  3948
## 2 No        312.   310          15 86747

As we can’t say anything b/w the relationship of heart attack and dark green vegetables since the number of respondents are so less above so we cant say. Therefore due to low response we cannot draw any conclusion based on this sample

Next we would like to see the association b/w dark green vegetables and stroke

brfss2013%>%
  filter(!is.na(cvdstrk3), fvgreen >= 210) %>%
  select(fvgreen,cvdstrk3) %>%
  group_by(cvdstrk3) %>%
  summarise(count = n(), avg = mean(fvgreen), med = median(fvgreen), std = sd(fvgreen))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 5
##   cvdstrk3 count   avg   med   std
##   <fct>    <int> <dbl> <dbl> <dbl>
## 1 Yes       2327  311.   308  11.1
## 2 No       88216  312.   310  12.2

Research question 3:

plot(brfss2013$income2, brfss2013$hlthpln1, xlab = 'Income Level', ylab = 'Health Care Coverage', main =
'Income Level V/s Health Care Coverage')

In general, higher income respondents are more likely to have health care coverage then those of lower income respondents.

Conclusion :

One last note, when we analyze health survey data, we must be aware that self-reported prevalence may be biased because respondents may not be aware of their risk status. Therefore, to achieve more precise estimates, researchers are using laboratory tests as well as self-reported data.