Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

The data in the General Social Survey (GSS) is collected from an independently drawn sample from a population of adults (18+) by personally interviewing people living in households in the United States. Hence, this is an observational study as it uses random sampling technique and so, we can generalize it to the whole population.

Since, it is not an experimental study with randomized assignment to treatment, we cannot use it to determine causality between variables in the data.

Part 2: Research question

Would there be any significant difference between the drug dependency and sex?

We would like to find out if there’s a chance of difference between men and women addiction to drug then,we can put in the resources for further research to determine the actual causes of this difference, as many factors such as biological and social differences do affect these.

Part 3: Exploratory data analysis

We are dealing with the data of 2012 and the variables of interest are as follows:

1). sex : Respondent’s sex(Respondent’s background variable) 2). natdrug : Dealing with drug addiction(Attitudnal measure)

dataset <- filter(gss, year == 2012) %>%
  select(sex,natdrug)

str(dataset)

## 'data.frame':    1974 obs. of  2 variables:
##  $ sex    : Factor w/ 2 levels "Male","Female": 1 1 1 2 2 2 2 2 2 2 ...
##  $ natdrug: Factor w/ 3 levels "Too Little","About Right",..: NA NA NA NA NA 1 2 NA 2 1 ...

dim(dataset)

## [1] 1974    2

Gives the dimensions of our dataset.

levels(dataset$sex)

## [1] "Male"   "Female"

levels(dataset$natdrug)

## [1] "Too Little"  "About Right" "Too Much"

This indicates that we have two levels of sex and three of natdrug.

ggplot(dataset, aes(x=natdrug)) + geom_bar() + ggtitle('Addiction to drugs based on sex  ') + xlab('Drug Consumption') + theme_bw()

We now form a new distribution to omit the non-respondent values.

dataset <- na.omit(dataset)

table(dataset)

##         natdrug
## sex      Too Little About Right Too Much
##   Male          206         112       53
##   Female        338         178       57

Unanswered values are removed.

It is shown that in our sample, there are more females than males that are more addicted to drugs in all the three categories of consumption level.

prop.table(table(dataset))

##         natdrug
## sex      Too Little About Right   Too Much
##   Male   0.21822034  0.11864407 0.05614407
##   Female 0.35805085  0.18855932 0.06038136

This is the joint probability table.

mosaicplot(prop.table(table(dataset),1), main = 'DRUGS CONSUMPTION BASED ON SEX')

It is a graphical representation of contingency table which pictographically represents the relation among two or more categorical variables.

We can infer from the above mosaic plot that there are more number of females dealing with the national problem of drug addiction than males in all the respective categories.

Part 4: Inference

 Hypotheses testing:-

In our study we form null hypothesis that sex and drug consumption are independent of each other and so, there is no significant difference between the two. H0 : The two attributes are independent.

Against, The alternate hypothesis stating that there is a significant difference between the two variables of interest. H1 : The two attributes are related to each other.That is,there is association between the two.

 Stating the method used:-

Since, we are dealing with categorical data atleast one of whose variable has more than two levels, the test used would be the test for independence of attributes.

For each cell we look at the observed minus the expected square, divide by the expected counts and we add this over for each of the cells.

 Checking conditions for the test:-

1). Independence: Since,the sample is drawn from a large population by the means of random sampling,sampled observations are likely to be independent of each other.

2). If sampling is done without replacement, we want to make sure that our sample size is less than 10% of our population which in our case is obviously fulfilling the condition.

3). We ensure that each case should have contributed to only one cell.

4). Sample size to have atleast five expected counts. In our case, it is meeting this vary requirement.

Now, since the table has sufficient frequencies for the chi-square test of independence.

From the above conditions,all the requirements for inference on the test of independence of attributes are met.

 Performing inference:-

chisq.test(table(dataset))

## 
##  Pearson's Chi-squared test
## 
## data:  table(dataset)
## X-squared = 4.1615, df = 2, p-value = 0.1248

Here, the degree of freedom is 2 because we have r=2,c=3 and df=((r-1)*(c-1))for the contingency table . Hence, only two expected frequencies will be calculated.

fisher.test(table(dataset))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(dataset)
## p-value = 0.1276
## alternative hypothesis: two.sided

Checking the relationship status between the two at 5% level of significance.

 Result:-

p-value(0.1248) is higher than the level of significance(alpha)=0.05, we would fail to reject the null hypothesis. We state that obtained results are not statistically significant. Non-significant results indicate that the probability of getting the result is greater than 5% if the null hypothesis was true.Hence, the chance of independence of two attributes is greater that 5%.

As far as the research question is concerned, we would like to conclude that there is no significant difference between the two attributes of interest and hence they are not associated with each other.Hence they are independent.

But, this result cannot be put to use to determine causality because Gss is an observational study and does not involve experiment with randomized assignment to treatment.