Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called brfss2013
. Delete this note when before you submit your work.
In this project i will briefly discuss about a data set including how the observation in the sample are collected, the implications of data collection method based on its generalizability and causality using the Behavior Risk Factor Surveillance System (BRFSS).
BRFSS is a phone survey (telephones and cellphone) that is conducted monthly with standardized questionnaire across America. According to BRFSS, in the year of 2011, more than 500000 candidates were conducted, this obey law of large number.
Furthermore, according to “User Guide June 2013”, methods BRFSS uses to eliminate bias is comprised and not limited : variety of questionnaire constriction (for covering different life styles), variety of phone interview time and 15 calling attempts (to avoiding non-response and convince bias) and data weighting.
Conclusion: by using random assignment, multistage sampling and multiple methods to omit bias, BRFFS data set has generabizability and causality.
Research quesion 1: What’s the relationship between tobacco use and education level? Male and female? Variables: smokday2, educa, sex.
Research quesion 2: How many female have college or higher degree compare to male? Variables: sex, educa.
Research quesion 3: What’s the relationship between What’s the relationship between sleep and education level? Male and female? Variables: sleptim1, educa, sex.
Research quesion 1:
# create new data 'smoke_edu' set with three variables I want
smoke_edu <- brfss2013%>%
select(smokday2, educa, sex)
#filter out NA, count educa level group by educa
smoke_edu %>%
filter(!is.na(educa))%>%
group_by(educa)%>%
summarise(count=n())
## # A tibble: 6 x 2
## educa count
## <fct> <int>
## 1 Never attended school or only kindergarten 677
## 2 Grades 1 through 8 (Elementary) 13395
## 3 Grades 9 though 11 (Some high school) 28141
## 4 Grade 12 or GED (High school graduate) 142971
## 5 College 1 year to 3 years (Some college or technical school) 134197
## 6 College 4 years or more (College graduate) 170120
# filter out NA, count smoke day group by smoke day
smoke_edu%>%
filter(!is.na(smokday2))%>%
group_by(smokday2)%>%
summarise(count = n())
## # A tibble: 3 x 2
## smokday2 count
## <fct> <int>
## 1 Every day 55163
## 2 Some days 21494
## 3 Not at all 138135
# create new variable 'smoker'
smoke_edu <- smoke_edu%>%
mutate(smoker = ifelse(smokday2 == 'Not at all', 'No','Yes'))
# count how many people in each education level as table1
table1<- smoke_edu%>%
filter(!is.na(educa), !is.na(smokday2))%>%
group_by(educa)%>%
summarise(count = n())
# count how many smokers in each education level as table2
table2<- smoke_edu%>%
filter(smoker == "Yes",!is.na(educa), !is.na(smokday2))%>%
group_by(smoker, educa)%>%
summarise(count = n())
# calcluate the percantage_of_smoker in each eduaction level
table1<-table1%>%
mutate(percentage_of_smoker = table2$count/table1$count)
# check
table1%>%
select(educa, count, percentage_of_smoker)
## # A tibble: 6 x 3
## educa count percentage_of_smok~
## <fct> <int> <dbl>
## 1 Never attended school or only kindergarten 222 0.338
## 2 Grades 1 through 8 (Elementary) 5956 0.390
## 3 Grades 9 though 11 (Some high school) 16219 0.521
## 4 Grade 12 or GED (High school graduate) 70714 0.417
## 5 College 1 year to 3 years (Some college or technica~ 63166 0.370
## 6 College 4 years or more (College graduate) 58024 0.222
# use ggplot to visualize data analysis
p<-ggplot(table1, aes(x = educa, y = percentage_of_smoker))+geom_point()
p + scale_x_discrete(breaks=c("Never attended school or only kindergarten", "Grades 1 through 8 (Elementary)", "Grades 9 though 11 (Some high school)","Grade 12 or GED (High school graduate)","College 1 year to 3 years (Some college or technical school)","College 4 years or more (College graduate)"),
labels=c("Kin", "Ele", "<HS","HS",'<CG','>CG'))
As far as we can tell from the result, smoking habit and education level is somewhat related. When people have education level less than collage, the percentage of smoker (ever smoked) is much high and reaches its peak at population have “some high school” education. As biases may exist, further study may needed.
As we can see from the plot, there are more female smoker than male, the difference is less than 10000
Research quesion 2:
# select data we want, save them into a new table 'sex_edu'
sex_edu<- brfss2013%>%
select(sex, educa)
# clean and aggregate data
sex_edu<-sex_edu%>%
filter(educa == 'College 4 years or more (College graduate)')%>%
group_by(sex)%>%
summarise(count = n())
# use ggplot to visualize data analysis
ggplot(sex_edu, aes(x = sex, y = count))+geom_bar(stat = 'identity')
As the plot shows above, there are more female than male have a college or higher degree. Approximately 22568.
Research quesion 3:
# select data we want and save as new table 'sleep_edu'
sleep_edu<- brfss2013%>%
select(sleptim1, educa, sex)
# calculating benchmark of lack of sleep or not
median(sleep_edu$sleptim1, na.rm = TRUE)
## [1] 7
# create new variable 'sleep_enough'
sleep_edu<- sleep_edu%>%
mutate(sleep_enough = ifelse(sleptim1 > 7, "Yes","No"))%>%
filter(!is.na(sleep_enough), educa == 'College 4 years or more (College graduate)')
# use ggplot to visualize data
sleep_edu$sleep_enough <-as.factor(sleep_edu$sleep_enough)
sleep_edu$sex <- as.factor(sleep_edu$sex)
ggplot(sleep_edu, aes(x=sleep_enough, fill = sex)) +
geom_bar()
Based on the the plot, there are more people have a college or higher degree sleep less than 7 hours than those who sleep more than 7 hours. From gender preservative, more female tend to have high degree than male, regardless of sleep time.