Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: About the Data

The General Social Survey (GSS) is a nationally representative survey of adults in the United States conducted since 1972. The GSS collects data on contemporary American society in order to monitor and explain trends in opinions, attitudes and behaviors. The GSS has adopted questions from earlier surveys which allows researchers to conduct comparisons for up to 80 years.

The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

The GSS aims to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

Collection of Data :

The GSS is a personal interview survey and collects information on a wide range of demographic characteristics of respondents and their parents (https://en.wikipedia.org/wiki/General_Social_Survey). Since 1994, it has been conducted every other year. The target population is adults (18+) living in households in the United States. Respondents are random sampled from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary.

Generalizability

The General Social Survey (GSS) is an area-probability sample that uses the NORC National Sampling Frame for an equal-probability multi-stage cluster sample of housing units for the entire United States. Since the sample for the GSS is a cluster random sample therefore we can generalize the sample to the population and we can make association/correlation. As stated above that participation in the study is strictly voluntary therefore we can state that the may may be a biaseness due to non-response and voluntary response too.

Causality

Causality cannot be inferred from the sample because this is not an experiment study instead this is an observational study, therefore we can’t say that causality can be inferred from the sample

Part 2: Research question

Research Question 1: During 2008-09 we know that there was a financial crisis wherein many people lost their homes, jobs, money, etc. Even in the following links it shows that Americans have lose confidence in banks (https://news.gallup.com/poll/171995/confidence-banks-remains-low.aspx).So here we need to find out what proportion of Americans had confidence in banks and financial institution during 2008 -09 financial crisis and also we wanna see that did peoples perception was still being impacted in 2012 by this financial crisis. So seeing that is there has been any change from 2008 to 2012

Research Question 2: As we know there are relations b/w job satistafcation and age or similarly job security and age of an individual.Even in the following link(https://smallbusiness.com/employees/tech-worker-ageism/) it has been stated that 43% of tech workers fears that they will lose their job due to age which is 40+. Also there is an act for age discrimination namely “The Age Discrimination in Employment Act (ADEA)”. So our next question is to see that whether people with the age of 40+ are more likely to lose job as compared to the people b/w the age of 18-39.

Reserach Question 3: According to wikipedia(https://en.wikipedia.org/wiki/Mid-Atlantic_(United_States)) Mid Atlantic is the wealthiest region so here we want to know the income of all the Americans in Mid Atlantic region based on the data given to us.

Part 3: Exploratory data analysis

Research Question 1:

Before starting with our analysis first we will store the data we need and will be removing all the NA values if any since not removing NA will make our analysis incorrect

conf_bank <- gss%>%
  filter(year == "2008" | year == "2012", confinan == "Hardly Any" | confinan == "A Great Deal")

Now we will calculate EDA for both 2008 and 2012 years and we will view only the extreme opinions only, that is we will take into account only 2 variables that are “Hardly Any Confidence” & “A great Deal of confidence” because we want to know the people who had mostly or totally lost confidence in banks

conf_bnk_2008 <- conf_bank%>%
  filter(year == "2008") %>%
  group_by(confinan) %>%
  summarise(count = n()) %>%
  mutate(percent_conf_2008 =  count/sum(count)*100)

## `summarise()` ungrouping output (override with `.groups` argument)

conf_bnk_2008

## # A tibble: 2 x 3
##   confinan     count percent_conf_2008
##   <fct>        <int>             <dbl>
## 1 A Great Deal   263              48.3
## 2 Hardly Any     281              51.7

Below we have shown the above summary with the chart :

ggplot(conf_bnk_2008, aes(x = confinan, y= percent_conf_2008)) +
  geom_col(fill = c("black", "green")) + labs(x = "Confidace In Banks & FI", y= "No of people (%)", title = "People Confidence in Bank & FI in 2008")

So we have made a column chart here wherein it is showing the % of people who selected an option regarding their confidence in banks and financial institution and as from summary statistic and this chart we can see only 48.3% of people still believe and 51.6% of people have hardly any confidence in banks

Now similarly we will do for year 2012

conf_bank_2012 <- conf_bank %>%
  filter(confinan == "Hardly Any" | confinan == "A Great Deal", year == "2012") %>%
  group_by(confinan)%>%
  summarise(count= n()) %>%
  mutate(percent_conf_2012 =  count/sum(count)*100)

## `summarise()` ungrouping output (override with `.groups` argument)

conf_bank_2012

## # A tibble: 2 x 3
##   confinan     count percent_conf_2012
##   <fct>        <int>             <dbl>
## 1 A Great Deal   149              23.0
## 2 Hardly Any     498              77.0

As we can see that the % of people has increased to 76% in Hardly Any option where as the other option has declined to 23.02 %. Lets get a clear picture of the above chnages by plotting the above statistic

ggplot(conf_bank_2012, aes(x = confinan, y= percent_conf_2012)) +
  geom_col(fill = c("blue", "red")) + labs(x = "Confidace In Banks & FI", y= "No of people (%)", title = "People Confidence in Bank & FI in 2012")

The above graph shows the same thing. But as we see from the graph Aand compare it with 2008 one we still can’t make any decisions that people perception was impacted till year 2012

Research Question 2:

First we will create a vector storing the age categories in it and all the people b/w the age 0f 19 and 39 would be categorized as “18-39” and all the people with age of 40 or 40+ will be categorized as “40+”

age_category <- cut(gss$age, c(18,39,89), labels = c("18-39", "40+"))

After creating this vector we will add it into our age_JL data and to reflect recent trends we will take data from the year 2006

age_JL <- cbind(gss, age_category)

age_JL <- age_JL%>%
  filter(!is.na(age_category),!is.na(joblose), year >= "2006")

Now calculating summary statistics

age_JL_stat1 <- age_JL%>%
  filter(age_category == "18-39") %>%
  group_by(joblose) %>%
   summarise(count= n()) %>%
  mutate(percent =  count/sum(count)*100)

## `summarise()` ungrouping output (override with `.groups` argument)

age_JL_stat1

## # A tibble: 4 x 3
##   joblose        count percent
##   <fct>          <int>   <dbl>
## 1 Very Likely       97    6.33
## 2 Fairly Likely    105    6.85
## 3 Not Too Likely   459   30.0 
## 4 Not Likely       871   56.9

Lets see this above statistic with graph for a more clear picture :

ggplot(age_JL_stat1, aes(x = joblose, y= percent, fill = joblose)) +
  geom_col() + labs(x = "Chance of losing Job ", y= "Chance of losing Job in %", title = "18-39 age People view on their job lose chances")

As we can see from this graph and summary statistic calculated above we can state that 56% of the people b/w the age 18 & 39 think they are not likely to lose a job whereas only 6% of the people state that the chances of their job lose is very likely.

Similarly now we calculate for the people with the age of 40+

age_JL_stat2 <- age_JL%>%
  filter(age_category == "40+") %>%
  group_by(joblose) %>%
   summarise(count= n()) %>%
  mutate(percent =  count/sum(count)*100)

## `summarise()` ungrouping output (override with `.groups` argument)

age_JL_stat2

## # A tibble: 4 x 3
##   joblose        count percent
##   <fct>          <int>   <dbl>
## 1 Very Likely       98    4.77
## 2 Fairly Likely    137    6.67
## 3 Not Too Likely   577   28.1 
## 4 Not Likely      1243   60.5

ggplot(age_JL_stat2, aes(x = joblose, y= percent, fill = joblose)) +
  geom_col() + labs(x = "Chance of losing Job ", y= "Chance of losing Job in %", title = "40+ age People view on their job lose chances")

As we can see that 60% of people above the age of 40 think they are not likely to lose their jobs whereas only 4% think that the chances of losing their job is very likely.

Although it has been seen that after the age of 40 people started to worry about job lose and as we can see here it is only 4% (Very Likely) or 6% (Fairly Likely) but still we cant make any inferences based on these values

Research Question 3:

We will take the middle atlantic region here and since we want to see recent data will calculate median of income from the year 2008 till 2012

wealth  <- gss %>%
  filter(!is.na(coninc), year >= 2008, region == "Middle Atlantic")

Now we will be calculating the median income for mid-atlantic region from the year 2002 till 2012

wealth2 <- wealth %>%
  filter(!is.na(coninc), !is.na(region)) %>%
  group_by(region) %>%
  summarise(Med = median(coninc),Maximum = max(coninc))

## `summarise()` ungrouping output (override with `.groups` argument)

wealth2

## # A tibble: 1 x 3
##   region            Med Maximum
##   <fct>           <dbl>   <int>
## 1 Middle Atlantic 45678  178712

So if we see here the median income of the Middle-Atlantic region is $45,678 and maximum income is $1,78,712

ggplot(data = wealth, aes(x = coninc, y = region)) +
  geom_boxplot(fill = "darkblue") + labs(x = "Income", 
  title ="Income of people in Middle-Atlantic",  y = "Region")

The above boxplot is right skewed and also the median income from the year 2008 is 45,678 of all the respondents. Our 1st quartile is 22000(approx) and 3rd quartile here is 76000(approx)

Part 4: Inference

Research Question 1:

State Hypothesis:

Null Hypothesis : People’s confidence in bank in 2008 is same as peoples confidence in 2012

Alternative Hypothesis : People’s confidence in bank in 2008 is less from the year 2012

Check Conditions:

Independence : within groups - 1) random sample/assignment - The survey respondents were random sample. 2) 10% condition is met for both the sample, 544 and 647 < 10% of population between groups - Sampled year 2008 and 2012 are independent of each other
Sample Size/Skew : Both the sample meet the success-failure condition

State methods to be used and why and how:

We will be using z test here because we want to know that did people’s confidence in bank was less as compared to the people’s confidence in bank in the year 2012 which means we have to categorical variable with 2 levels. Therefor using z test

Before doing inference we must drop the unused levels first

conf_bank <- droplevels(conf_bank)

Perform Inference:

inference(y= confinan, x= year, data = conf_bank, success = "Hardly Any", type = "ht",
          method = "theoretical", statistic = "proportion", alternative = "less")

## Warning: Explanatory variable was numerical, it has been converted
##               to categorical. In order to avoid this warning, first convert
##               your explanatory variable to a categorical variable using the
##               as.factor() function

## Warning: Missing null value, set to 0

## Response variable: categorical (2 levels, success: Hardly Any)
## Explanatory variable: categorical (2 levels) 
## n_2008 = 544, p_hat_2008 = 0.5165
## n_2012 = 647, p_hat_2012 = 0.7697
## H0: p_2008 =  p_2012
## HA: p_2008 < p_2012
## z = -9.1493
## p_value = < 0.0001

Interpret Results:

With such a small P value, we reject null hypothesis in favor of alternative hypothesis and conclude that people’s confidence in banks in 2008 was less as compared to the year of 2012.

Confidence Interval:

But now we want to know that by how much proportion was people’s confidence was lower in 2008 as compared to the year of 2012

inference(y= confinan, x= as.factor(year), data = conf_bank, success = "Hardly Any", type = "ci",
          method = "theoretical", statistic = "proportion")

## Response variable: categorical (2 levels, success: Hardly Any)
## Explanatory variable: categorical (2 levels) 
## n_2008 = 544, p_hat_2008 = 0.5165
## n_2012 = 647, p_hat_2012 = 0.7697
## 95% CI (2008 - 2012): (-0.3062 , -0.2001)

Interpret Results:

Based on this Confidence Interval we are 95% confident that the proportion of people who hardly had any confidence in banks in 2008 is 30.62% - 20.01% lower than the proportion of people who hardly had any confidence in 2012.

Also we can see that our CI and Hypothesis test results shows us the same conclusion.

Research Question 2:

State Hypothesis:

Null Hypothesis : % of job lose and age are independent. People expecting job lose doesn’t depend on age

Alternative Hypothesis : % of job lose and age are dependent. % of People expecting job lose does depend on the age of a person

Check conditions:

Independence :
1. The survey respondents were random sampled, we can assume the independence.
2. If sampling without replacement, n < 10% of population. The 1398 observations met this requirement.
3. Each case only contribute to one cell in the table. Because the independence requirement has been met, we can check this as well.
Each cell must have at least 5 expected cases. From below table, it is clear that it meets the requirement.

table(age_JL$joblose, age_JL$age_category)

##                      
##                       18-39  40+
##   Very Likely            97   98
##   Fairly Likely         105  137
##   Not Too Likely        459  577
##   Not Likely            871 1243
##   Leaving Labor Force     0    0

State methods to be used and why and how:

The method used here is Chi-Square Independence Test since we are quantifying how different the observed counts are from the expected counts and we are evaluating relationship between two categorical variables.

Perform Inference:

inference(y = age_category, x= joblose ,data = age_JL, type = "ht", statistic = "proportion",
          method = "theoretical",alternative = "greater")

## Response variable: categorical (2 levels) 
## Explanatory variable: categorical (5 levels) 
## Observed:
##                 y
## x                18-39  40+
##   Very Likely       97   98
##   Fairly Likely    105  137
##   Not Too Likely   459  577
##   Not Likely       871 1243
## 
## Expected:
##                 y
## x                    18-39       40+
##   Very Likely     83.28408  111.7159
##   Fairly Likely  103.35768  138.6423
##   Not Too Likely 442.47338  593.5266
##   Not Likely     902.88486 1211.1151
## 
## H0: joblose and age_category are independent
## HA: joblose and age_category are dependent
## chi_sq = 7.0313, df = 3, p_value = 0.0709

Interpret Results :

The Chi-Square statistic is 7.0313, degree of freedom is 3 and the associated P value is larger than the significance level of 0.05. Therefore We failed to reject null hypothesis, the data does not provide convincing evidence that the % of people expecting to lose job and their age are associated. We can only state here about association and not about causation since it is an observational study

Research Question 3:

Since we just want to know the median income of all the population in Mid Atlantic therefore there would be no hypothesis. Next we will see our conditions

Check Conditions:

If bootstrap distribution is extremely skewed or sparse, the bootstrap interval might be unreliable. Our data isn’t extremely skewed therefore this wont be a problem
If sample is biased then the resulting estimate from this sample will also be biased but since the respondent were sampled randomly so this wont be a problem for us too

State the method and why and how:

The method we will be using here is Bootstrapping method since our sample distribution is skewed to the right and the statistic median would be a better option here to create CI.

Perform Inference :

inference(y = coninc, data = wealth, boot_method = "se", type = "ci", 
          method = "simulation", statistic = "median", nsim = 15000, seed = 02022000)

## Single numerical variable
## n = 616, y_med = 45678, Q1 = 22083, Q3 = 76600
## 95% CI: (39407.0781 , 51948.9219)

Interpret Result:

Here we have used Standard error method since it is more accurate method. Based on this Confidence Interval we are 95% confident that the median income of people living in mid atlantic region is b/w 39,407 - 51,948.

Here we used bootstrap method since we our population parameter was median and in bootstrap we used simulation based methods where we did 15000 no of simulation and the boot_method used here was “SE”

References:

https://en.wikipedia.org/wiki/General_Social_Survey

https://news.gallup.com/poll/171995/confidence-banks-remains-low.aspx

http://gss.norc.org/

https://en.wikipedia.org/wiki/Mid-Atlantic_(United_States)