Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys.

The Behavioral Risk Factor Surveillance System (BRFSS) used a disproportionate stratified sample design to conduct the landline telephone survey. In addition, interviewers collect data from a randomly selected adult in a household. For the cellular telephone survey, the sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combination, and interviewers collect data from a adult respondent who participates by using a cellular telephone and resides in a private residence or college housing.

There are large amount of data, 491773 respondents, collected across all of the states in the United States and participating US territories. This large number of random sampling makes it generalizable to the adult population in the United States.

The BRFSS is an observational study of random sampling conducted through landline telephone survey and cellular telephone survey, there is no random assignment in the survey. In the absence of random assignment to experimental group, casuality can not be established.


Part 2: Research questions

Research question 1: What is the relationship between educational level and income level? I am interested to see if respondents with higher education will have a better income.

Research question 2: How is general health impacted by hours of sleep based on gender? I am interested to see if the amount of hours slept by a person impact their general health and if there are any difference in hours slept between genders.

Research question 3: How can the time spend by doing physical exercise and the general health status impact on life satisfaction? I am interested to see more physical exercise matches with better general health and life satisfaction.


Part 3: Exploratory data analysis

Research question 1:What is the relationship between educational level and income level?

First we check if there are any NA or unknown values in each column of the variables.

brfss2013%>%
group_by(educa)%>%
summarise(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
##   educa                                                         count
##   <fct>                                                         <int>
## 1 Never attended school or only kindergarten                      677
## 2 Grades 1 through 8 (Elementary)                               13395
## 3 Grades 9 though 11 (Some high school)                         28141
## 4 Grade 12 or GED (High school graduate)                       142971
## 5 College 1 year to 3 years (Some college or technical school) 134197
## 6 College 4 years or more (College graduate)                   170120
## 7 <NA>                                                           2274
brfss2013%>%
group_by(income2)%>%
summarise(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 2
##   income2            count
##   <fct>              <int>
## 1 Less than $10,000  25441
## 2 Less than $15,000  26794
## 3 Less than $20,000  34873
## 4 Less than $25,000  41732
## 5 Less than $35,000  48867
## 6 Less than $50,000  61509
## 7 Less than $75,000  65231
## 8 $75,000 or more   115902
## 9 <NA>               71426

There are NA values in both column of the variables. We will need to clean up these values.

q1<-brfss2013%>%
 group_by(educa, income2)%>%
 filter(educa!="NA",income2!="NA")%>%
 summarise(count=n())
## `summarise()` regrouping output by 'educa' (override with `.groups` argument)
q1
## # A tibble: 48 x 3
## # Groups:   educa [6]
##    educa                                      income2           count
##    <fct>                                      <fct>             <int>
##  1 Never attended school or only kindergarten Less than $10,000   121
##  2 Never attended school or only kindergarten Less than $15,000    73
##  3 Never attended school or only kindergarten Less than $20,000    78
##  4 Never attended school or only kindergarten Less than $25,000    50
##  5 Never attended school or only kindergarten Less than $35,000    52
##  6 Never attended school or only kindergarten Less than $50,000    31
##  7 Never attended school or only kindergarten Less than $75,000    23
##  8 Never attended school or only kindergarten $75,000 or more      28
##  9 Grades 1 through 8 (Elementary)            Less than $10,000  2600
## 10 Grades 1 through 8 (Elementary)            Less than $15,000  1993
## # ... with 38 more rows
ggplot(data=q1)+geom_count(aes(x=income2, y=educa, size=count))+ggtitle("Educational Level vs Income Level")+xlab("Income Level")+ylab("Educational Level")+theme(axis.title.x = element_text(size=13),axis.title.y = element_text(size = 13),plot.title = element_text(size = 15),axis.text.x = element_text(size = 8, angle=90), axis.text.y = element_text(size=8))

The plot above shows that there is a correlation between educational level and income level. We can not establish casuality form the survey data as there was no random assignment, only random sampling. However according the plot we can see that a better educational level matches with a better income. Those with a degree will most likely have an income of $75,000 or more. The final conclusion is that the income level increases when te educational level increases as well.

Research question 2:How is general health impacted by hours of sleep based on gender?

q2<-brfss2013%>%
group_by(genhlth, sex, sleptim1)%>%
filter(genhlth!="NA", sex!="NA", sleptim1!="NA")%>%
summarise(count=n())
## `summarise()` regrouping output by 'genhlth', 'sex' (override with `.groups` argument)
q2
## # A tibble: 217 x 4
## # Groups:   genhlth, sex [10]
##    genhlth   sex   sleptim1 count
##    <fct>     <fct>    <int> <int>
##  1 Excellent Male         1    13
##  2 Excellent Male         2    40
##  3 Excellent Male         3   121
##  4 Excellent Male         4   607
##  5 Excellent Male         5  1722
##  6 Excellent Male         6  7034
##  7 Excellent Male         7 12178
##  8 Excellent Male         8 11222
##  9 Excellent Male         9  1687
## 10 Excellent Male        10   633
## # ... with 207 more rows
ggplot(data=q2)+geom_boxplot(aes(x=genhlth, y=sleptim1, colour=sex))+ggtitle("General Health based on Hours Sleep")+theme(axis.title.x = element_text(size=13), axis.title.y = element_text(size = 13), axis.text.x = element_text(size=10), axis.text.y = element_text(size = 10), plot.title = element_text(size=16))+xlab("General Health")+ylab("Numbers of Hours Sleep")

The plot above shows that there isn’t much difference between hours of sleep depending on gender or the different general health statements. There is a slight increase in hours slept for poor health but not so much. In general the IQR is between 6 - 16,5 hours across the board. This leads us to the conclusion that there doesn’t seem to be a correlation between General Health and hours of sleep for both genders. We can see laso that there isn’t a big difference in hours slept between genders. Looking at the plot we can only say that females sleeps a little more but it isn’t a relevant difference.

Research question 3:How can the time spend by doing physical exercise and general health impact on life satisfaction?

q3<-brfss2013%>%
group_by(exerhmm1, lsatisfy,genhlth)%>%
filter(exerhmm1!="NA",genhlth!="NA",lsatisfy!="NA")%>%
summarise(count=n())
## `summarise()` regrouping output by 'exerhmm1', 'lsatisfy' (override with `.groups` argument)
q3
## # A tibble: 381 x 4
## # Groups:   exerhmm1, lsatisfy [140]
##    exerhmm1 lsatisfy       genhlth   count
##       <int> <fct>          <fct>     <int>
##  1        1 Very satisfied Very good     2
##  2        1 Satisfied      Good          1
##  3        2 Very satisfied Very good     2
##  4        2 Very satisfied Good          1
##  5        2 Satisfied      Very good     1
##  6        2 Satisfied      Good          2
##  7        2 Satisfied      Fair          2
##  8        2 Dissatisfied   Very good     1
##  9        3 Very satisfied Excellent     1
## 10        3 Very satisfied Fair          1
## # ... with 371 more rows
ggplot(data=q3)+geom_col(aes(x=genhlth, y=exerhmm1, fill=lsatisfy))+ggtitle("Physical Exercise and Quality of Life")+xlab("General Health")+ylab("Time of Physical Exercise")+scale_fill_discrete(name="Life Satisfaction")+theme(axis.title.x = element_text(size = 13), axis.title.y = element_text(size=13), axis.text.x = element_text(size = 10), axis.text.y = element_text(size = 10), plot.title = element_text(size = 16))

Looking at the plot above, we can see how time spend by doing physical exercise impacts and general health status impact on life satisfaction. In fact we can see how people that don’t much exercise and are in a poor condition of health may be prone to little life satisfaction, while people in excellent health wille be most likely satisfied and very satisfied.