S371: Lab 12

Lab Instructor: Katya Baldina ()

2023-11-08

Announcement

HW8 IS DUE THURSDAY (Nov. 9) 11.59 PM!

Chi-square test

The goal of doing chi-square test:

to see if the two categorical variables are correlated, given the randomly sampled data

Chi-square test only shows if variables are correlated or not

What chi-square test can’t do:

Chi-square test

Generic code in R:

TABOBJECT = table(CATVAR1, CATVAR2)
chisq.test(TABOBJECT)

TABOBJECT is an object that represents a two- way table

Chi-square test

For example, we want to test if using SNAPCHAT is correlated with age.

Null hypothesis:

There is no relationship between the usage of Snapchat and the usage of Instagram in the population.

Alternative hypothesis:

There is a relationship between the usage of Snapchat and the usage of Instagram in the population.

Chi-square test

First, we need to make a two-waytable between SNAPCHAT variable and categorical AGE varialbe (AGE2).

We store the results of table(socmedia$SNAPCHAT, socmedia$AGE2) command into a separate object.

We can print the object called “table1” to see if the two-way table has been stored successfully.

twowaytable = table(socmedia$SNAPCHAT, socmedia$AGE2)
twowaytable
##      
##       18-34 35-54 55+
##   NO    237   443 357
##   YES   216    74  18

Then we can run the chi-square test with the stored object of “twowaytable”

chisq.test(twowaytable)
## 
##  Pearson's Chi-squared test
## 
## data:  twowaytable
## X-squared = 248.75, df = 2, p-value < 2.2e-16

Chi-square test

We can also save results of the chi-square test into a separate object.

xtest <- chisq.test(twowaytable)

Then we can look for the observed and expected counts in the chi-square test:

xtest$observed
##      
##       18-34 35-54 55+
##   NO    237   443 357
##   YES   216    74  18
xtest$expected
##      
##          18-34    35-54       55+
##   NO  349.2647 398.6089 289.12639
##   YES 103.7353 118.3911  85.87361

Chi-square test

We can use the observed and expected counts to calculate how each cell contributes to the chi- square statistics in R:

\(\chi^2 = \sum{\frac{(O_i - E_i)^2}{E_i}}\)

(xtest$observed - xtest$expected)^2/xtest$expected
##      
##            18-34      35-54        55+
##   NO   36.085410   4.943612  15.933607
##   YES 121.495357  16.644563  53.646593
sum((xtest$observed - xtest$expected)^2/xtest$expected)
## [1] 248.7491

Final Project

You will analyze General Social Survey data that you can find on Canvas (project.RData).

For a detailed description of each variable, please refer to the Codebook.txt file.

Final Project

Choose 2 dependent and 2 independent variables (four in total)

2 dependent variables: one quantitative and one categorical

2 independent variables: one quantitative and one categorical

If you want, you may create your own secondary variables

Final Project

Sections (for full description refer to the file Final project.docx):

  1. Background

  2. Univariate distributions

  3. Bivariate relationships (you need to do all of these things for all four variables)

  4. Conclusion (10pt)

  5. R appendix (20pt)

Final Project

  1. Univariate distributions

Final Project

Which graph should I create?

For quantitative variable, create

- histogram or box plot

For categorical variable, create

- bar chart or pie chart

For the guide to make these charts in R, check out Lab 11.

Final Project

  1. Bivariate relationships (you need to do all of these things for all four variables)

    • Categorical IV and DV

      • Two-way table with percentages (refer to ctable() from Lab
      • Chi-square test (if at least one variable has more than one category) or two-sample test of proportions (if variables have two categories)
    • Categorical and Quantitative pairs

      • two-sample t-test (if categorical variable has two categories)
      • if categorical variable has more than two categories, choose one category as reference (refer to Lab 10) and perform two-sample t-test
    • Quantitative IV and DV

      • Scatterplot (we will learn it on the next week or after that; include the graph)
      • Correlation
      • Regression

Missing data

In the final project, you will analyze data with missing cases

In most cases, missing data would not affect your coding in R

The summary(), table(), plot(), hist(), boxplot(), and barplot() commands are not affected by missing data

However, some other commands are affected by missing data:

Missing data

For statistical tests, you don’t need to worry about missing data

Exception: when you need to specify number of observations (n), such as in the prop.test() function, you need to look for the valid number of observations and exclude the missing cases

HW 8 Guide

There are two parts:

• Part 1: Question 1-3

• Part 2 (R part): Question 4-8

HW 8 Guide

Q4. Read the codebook provide and construct a table for each of the following variables: KNWEXEC, KNWCLENR, SMALLGAP. Label each column with the meaning of the code, not the numeric code itself. (9 pts)

Hint: use the table() function to find out the number of observations in each response option

To find the meanings of each response category in the codebook.txt

HW 8 Guide

Q5. Construct a two-way table for KNWEXEC and KNWCLENR and store it in a variable tab. Print the table below. (3 pts)

Generic code:

TABOBJECT = table(CATVAR1, CATVAR2)

replace CATVAR1 and CATVAR2 with KNWEXEC and KNWCLENR

you can give any name to your TABOBJECT

HW 8 Guide

Q6. Run a chi square test on the table and store it in a variable xtest. (9 pts)

xtest <- chisq.test(TABOBJECT)

use TABOBJECT that you created for the Q5

you can give any name to your TABOBJECT

Print out xtest object to see the results of the test.

HW 8 Guide

Q7. Repeat 5 and 6 to test the relationship between SMALLGAP and KNWCLENR. (12 pts)

Q8. Repeat 5 and 6 to test the relationship between SMALLGAP and KNWEXEC. (12 pts)

Follow the directions on two previous slides