HW8 IS DUE THURSDAY (Nov. 9) 11.59 PM!
The goal of doing chi-square test:
to see if the two categorical variables are correlated, given the randomly sampled data
Chi-square test only shows if variables are correlated or not
What chi-square test can’t do:
tell you the direction of correlation
tell you how strong the correlation is
Generic code in R:
TABOBJECT is an object that represents a two- way
table
For example, we want to test if using SNAPCHAT is correlated with age.
Null hypothesis:
There is no relationship between the usage of Snapchat and the usage of Instagram in the population.
Alternative hypothesis:
There is a relationship between the usage of Snapchat and the usage of Instagram in the population.
First, we need to make a two-waytable between SNAPCHAT variable and categorical AGE varialbe (AGE2).
We store the results of
table(socmedia$SNAPCHAT, socmedia$AGE2) command into a
separate object.
We can print the object called “table1” to see if the two-way table has been stored successfully.
##      
##       18-34 35-54 55+
##   NO    237   443 357
##   YES   216    74  18
Then we can run the chi-square test with the stored object of “twowaytable”
## 
##  Pearson's Chi-squared test
## 
## data:  twowaytable
## X-squared = 248.75, df = 2, p-value < 2.2e-16
We can also save results of the chi-square test into a separate object.
Then we can look for the observed and expected counts in the chi-square test:
##      
##       18-34 35-54 55+
##   NO    237   443 357
##   YES   216    74  18
##      
##          18-34    35-54       55+
##   NO  349.2647 398.6089 289.12639
##   YES 103.7353 118.3911  85.87361
We can use the observed and expected counts to calculate how each cell contributes to the chi- square statistics in R:
\(\chi^2 = \sum{\frac{(O_i - E_i)^2}{E_i}}\)
##      
##            18-34      35-54        55+
##   NO   36.085410   4.943612  15.933607
##   YES 121.495357  16.644563  53.646593
## [1] 248.7491
You will analyze General Social Survey data that you can find on
Canvas (project.RData).
For a detailed description of each variable, please refer to the
Codebook.txt file.
Choose 2 dependent and 2 independent variables (four in total)
2 dependent variables: one quantitative and one categorical
2 independent variables: one quantitative and one categorical
If you want, you may create your own secondary variables
Sections (for full description refer to the file
Final project.docx):
Background
Univariate distributions
Bivariate relationships (you need to do all of these things for all four variables)
Conclusion (10pt)
R appendix (20pt)
Two quantitative variables - create a summary table that includes:
Measures of center (mean and median)
Measures of variation (standard deviation and five number summary)
Lower and upper bounds of 95% confidence intervals for the means
Two categorical variables - create a summary table that includes:
Categories for each categorical variable
Percentages of the sample in each category
Lower and upper bounds of 95% confidence intervals for the percentage in each category
All four variables – create univariate graphs (one for each variable). Provide an appropriate graph showing the distribution of each variable. These graphs may be histograms, boxplots, or bar graphs and should be made in R (refer to Lab 11).
Which graph should I create?
For quantitative variable, create
- histogram or box plot
For categorical variable, create
- bar chart or pie chart
For the guide to make these charts in R, check out Lab 11.
Bivariate relationships (you need to do all of these things for all four variables)
Categorical IV and DV
ctable() from
Lab
Categorical and Quantitative pairs
Quantitative IV and DV
In the final project, you will analyze data with missing cases
In most cases, missing data would not affect your coding in R
The summary(), table(), plot(), hist(), boxplot(), and barplot() commands are not affected by missing data
However, some other commands are affected by missing data:
mean()
sd()
cor()
cor() use
cor(data$var1, data$var2, use = "complete.obs")For statistical tests, you don’t need to worry about missing data
Exception: when you need to specify number of observations (n), such
as in the prop.test() function, you need to look for the
valid number of observations and exclude the missing cases
There are two parts:
• Part 1: Question 1-3
• Part 2 (R part): Question 4-8
Q4. Read the codebook provide and construct a table for each of the
following variables: KNWEXEC, KNWCLENR,
SMALLGAP. Label each column with the meaning of the code,
not the numeric code itself. (9 pts)
Hint: use the table() function to find out the
number of observations in each response option
To find the meanings of each response category in the codebook.txt
Q5. Construct a two-way table for KNWEXEC and
KNWCLENR and store it in a variable tab. Print
the table below. (3 pts)
Generic code:
replace CATVAR1 and CATVAR2 with KNWEXEC and
KNWCLENR
you can give any name to your TABOBJECT
Q6. Run a chi square test on the table and store it in a variable
xtest. (9 pts)
use TABOBJECT that you created for the Q5
you can give any name to your TABOBJECT
Print out xtest object to see the results of the
test.
Q7. Repeat 5 and 6 to test the relationship between
SMALLGAP and KNWCLENR. (12 pts)
Q8. Repeat 5 and 6 to test the relationship between
SMALLGAP and KNWEXEC. (12 pts)
Follow the directions on two previous slides