The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys.
The Behavioral Risk Factor Surveillance System (BRFSS) used a disproportionate stratified sample design to conduct the landline telephone survey. In addition, interviewers collect data from a randomly selected adult in a household. For the cellular telephone survey, the sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combination, and interviewers collect data from a adult respondent who participates by using a cellular telephone and resides in a private residence or college housing.
There are large amount of data, 491773 respondents, collected across all of the states in the United States and participating US territories. This large number of random sampling makes it generalizable to the adult population in the United States.
The BRFSS is an observational study of random sampling conducted through landline telephone survey and cellular telephone survey, there is no random assignment in the survey. In the absence of random assignment to experimental group, casuality can not be established.
Research question 1: What is the relationship between educational level and income level? I am interested to see if respondents with higher education will have a better income.
Research question 2: How is general health impacted by hours of sleep based on gender? I am interested to see if the amount of hours slept by a person impact their general health and if there are any difference in hours slept between genders.
Research question 3: How can the time spend by doing physical exercise and the general health status impact on life satisfaction? I am interested to see more physical exercise matches with better general health and life satisfaction.
Research question 1:What is the relationship between educational level and income level?
First we check if there are any NA or unknown values in each column of the variables.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
## educa count
## <fct> <int>
## 1 Never attended school or only kindergarten 677
## 2 Grades 1 through 8 (Elementary) 13395
## 3 Grades 9 though 11 (Some high school) 28141
## 4 Grade 12 or GED (High school graduate) 142971
## 5 College 1 year to 3 years (Some college or technical school) 134197
## 6 College 4 years or more (College graduate) 170120
## 7 <NA> 2274
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 2
## income2 count
## <fct> <int>
## 1 Less than $10,000 25441
## 2 Less than $15,000 26794
## 3 Less than $20,000 34873
## 4 Less than $25,000 41732
## 5 Less than $35,000 48867
## 6 Less than $50,000 61509
## 7 Less than $75,000 65231
## 8 $75,000 or more 115902
## 9 <NA> 71426
There are NA values in both column of the variables. We will need to clean up these values.
q1<-brfss2013%>%
group_by(educa, income2)%>%
filter(educa!="NA",income2!="NA")%>%
summarise(count=n())## `summarise()` regrouping output by 'educa' (override with `.groups` argument)
## # A tibble: 48 x 3
## # Groups: educa [6]
## educa income2 count
## <fct> <fct> <int>
## 1 Never attended school or only kindergarten Less than $10,000 121
## 2 Never attended school or only kindergarten Less than $15,000 73
## 3 Never attended school or only kindergarten Less than $20,000 78
## 4 Never attended school or only kindergarten Less than $25,000 50
## 5 Never attended school or only kindergarten Less than $35,000 52
## 6 Never attended school or only kindergarten Less than $50,000 31
## 7 Never attended school or only kindergarten Less than $75,000 23
## 8 Never attended school or only kindergarten $75,000 or more 28
## 9 Grades 1 through 8 (Elementary) Less than $10,000 2600
## 10 Grades 1 through 8 (Elementary) Less than $15,000 1993
## # ... with 38 more rows
ggplot(data=q1)+geom_count(aes(x=income2, y=educa, size=count))+ggtitle("Educational Level vs Income Level")+xlab("Income Level")+ylab("Educational Level")+theme(axis.title.x = element_text(size=13),axis.title.y = element_text(size = 13),plot.title = element_text(size = 15),axis.text.x = element_text(size = 8, angle=90), axis.text.y = element_text(size=8))The plot above shows that there is a correlation between educational level and income level. We can not establish casuality form the survey data as there was no random assignment, only random sampling. However according the plot we can see that a better educational level matches with a better income. Those with a degree will most likely have an income of $75,000 or more. The final conclusion is that the income level increases when te educational level increases as well.
Research question 2:How is general health impacted by hours of sleep based on gender?
q2<-brfss2013%>%
group_by(genhlth, sex, sleptim1)%>%
filter(genhlth!="NA", sex!="NA", sleptim1!="NA")%>%
summarise(count=n())## `summarise()` regrouping output by 'genhlth', 'sex' (override with `.groups` argument)
## # A tibble: 217 x 4
## # Groups: genhlth, sex [10]
## genhlth sex sleptim1 count
## <fct> <fct> <int> <int>
## 1 Excellent Male 1 13
## 2 Excellent Male 2 40
## 3 Excellent Male 3 121
## 4 Excellent Male 4 607
## 5 Excellent Male 5 1722
## 6 Excellent Male 6 7034
## 7 Excellent Male 7 12178
## 8 Excellent Male 8 11222
## 9 Excellent Male 9 1687
## 10 Excellent Male 10 633
## # ... with 207 more rows
ggplot(data=q2)+geom_boxplot(aes(x=genhlth, y=sleptim1, colour=sex))+ggtitle("General Health based on Hours Sleep")+theme(axis.title.x = element_text(size=13), axis.title.y = element_text(size = 13), axis.text.x = element_text(size=10), axis.text.y = element_text(size = 10), plot.title = element_text(size=16))+xlab("General Health")+ylab("Numbers of Hours Sleep")The plot above shows that there isn’t much difference between hours of sleep depending on gender or the different general health statements. There is a slight increase in hours slept for poor health but not so much. In general the IQR is between 6 - 16,5 hours across the board. This leads us to the conclusion that there doesn’t seem to be a correlation between General Health and hours of sleep for both genders. We can see laso that there isn’t a big difference in hours slept between genders. Looking at the plot we can only say that females sleeps a little more but it isn’t a relevant difference.
Research question 3:How can the time spend by doing physical exercise and general health impact on life satisfaction?
q3<-brfss2013%>%
group_by(exerhmm1, lsatisfy,genhlth)%>%
filter(exerhmm1!="NA",genhlth!="NA",lsatisfy!="NA")%>%
summarise(count=n())## `summarise()` regrouping output by 'exerhmm1', 'lsatisfy' (override with `.groups` argument)
## # A tibble: 381 x 4
## # Groups: exerhmm1, lsatisfy [140]
## exerhmm1 lsatisfy genhlth count
## <int> <fct> <fct> <int>
## 1 1 Very satisfied Very good 2
## 2 1 Satisfied Good 1
## 3 2 Very satisfied Very good 2
## 4 2 Very satisfied Good 1
## 5 2 Satisfied Very good 1
## 6 2 Satisfied Good 2
## 7 2 Satisfied Fair 2
## 8 2 Dissatisfied Very good 1
## 9 3 Very satisfied Excellent 1
## 10 3 Very satisfied Fair 1
## # ... with 371 more rows
ggplot(data=q3)+geom_col(aes(x=genhlth, y=exerhmm1, fill=lsatisfy))+ggtitle("Physical Exercise and Quality of Life")+xlab("General Health")+ylab("Time of Physical Exercise")+scale_fill_discrete(name="Life Satisfaction")+theme(axis.title.x = element_text(size = 13), axis.title.y = element_text(size=13), axis.text.x = element_text(size = 10), axis.text.y = element_text(size = 10), plot.title = element_text(size = 16))Looking at the plot above, we can see how time spend by doing physical exercise impacts and general health status impact on life satisfaction. In fact we can see how people that don’t much exercise and are in a poor condition of health may be prone to little life satisfaction, while people in excellent health wille be most likely satisfied and very satisfied.