Final Project

Did you pick variables? Please, talk to your neighbor about variables you’ve picked or pick one now.

Choose 2 dependent and 2 independent variables (four in total)

2 dependent variables: one quantitative and one categorical

2 independent variables: one quantitative and one categorical

If you want, you may create your own secondary variables

Correlation

To correlate two variables type:

cor(x, y)

where

x and y are your vectors with quantitative variable

Correlation

If your quantitative variables are within the dataset, you need to specify the dataset first, type $, then the name of your variables:

cor(data$x, data$y)

Correlation

Let’s say we are interested if the relationships between income and life satisfaction:

We might suspect that higher income will be associated with higher life satisfaction (correlation is positive):

load("project.RData")
## I need to recode the variable, 
##so that higher score was referring to higher life satisfaction (see the Codebook)
GSS$SATLIFE_R = 8-GSS$SATLIFE 
cor(GSS$SATLIFE_R, GSS$INCOME)

## [1] NA

However, when we run cor(GSS$SATLIFE, GSS$INCOME) we get NA. Let’s check if our variables have missings.

Correlation

To check the amount of missingness in each variable we use table() and is.na() commands:

table(is.na(GSS$SATLIFE_R))

## 
## FALSE  TRUE 
##  1169  1179

table(is.na(GSS$INCOME))

## 
## FALSE  TRUE 
##  2257    91

Correlation

We find that there are indeed missingness in each of the variable. Then how can we write code in R to avoid missing cases and calculate correlation coefficient properly?

To make R to run cor() command using complete cases, we need to modify our commend as follows:

cor(GSS$INCOME, GSS$SATLIFE_R, use = "complete.obs")

## [1] 0.2393682

Scatter plot (please refer also to Lab 6)

To show the relationships between two quantitative variables, we need to construct scatter plot.

The command is:

plot(x, y)

where x and y are your vectors with quantitative variable.

If your quantitative variables are within the dataset, you need to specify the dataset first, type $, then the name of your variables:

plot(data$x, data$y)

Scatter plot (please refer also to Lab 6)

plot(GSS$SATLIFE_R, GSS$INCOME, 
     main = "This is the scatter plot\nfor life satisfaction and income", 
     xlab = "Life satisfaction", 
     ylab = "Income", 
     col = "lightblue", 
     pch = 4)

Linear regression

Recall that linear regression looks like this in general:

$y = \alpha+\beta*x$

where $\alpha$ is an intercept and $\beta$ is a slope.

Linear regression

For example, if we are interested in the relationship between people’s height (y) and weight (x)

We sample a bunch of people and record their heights and weights

We run a linear regression and get: height = 3+10*weight

Then, the value of a would be 3 and the value of b would be 10

If our data (the specific group of people we sample) and/or the variables (the things we are interested in) change, the values of a and b would change

Linear regression

In R, we can estimate the linear regression model like this:

lm(data$y ~ data$x)

R would report the intercept (a) and slope (b) in the output

We can also save all the details of the regression model as an object in R:

model1 = lm(data$y ~ data$x)

By typing summary for the object (model1), we can obtain a more detailed output:

summary(model1)

Linear regression

Let’s say we are interested if income predicts life satisfaction.

We suspect that it does: higher income leads to higher life satisfaction (why?).

To test this out, I do a linear regression analysis:

Simple output:

lm(GSS$SATLIFE_R ~ GSS$INCOME)

## 
## Call:
## lm(formula = GSS$SATLIFE_R ~ GSS$INCOME)
## 
## Coefficients:
## (Intercept)   GSS$INCOME  
##      4.2206       0.1139

Detailed output with using summary() command:

model1 = lm(GSS$SATLIFE_R ~ GSS$INCOME)
summary(model1)

## 
## Call:
## lm(formula = GSS$SATLIFE_R ~ GSS$INCOME)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5871 -0.5871  0.2990  0.4129  2.6655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2206     0.1580  26.704  < 2e-16 ***
## GSS$INCOME    0.1139     0.0138   8.251 4.38e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.06 on 1120 degrees of freedom
##   (1226 observations deleted due to missingness)
## Multiple R-squared:  0.0573, Adjusted R-squared:  0.05646 
## F-statistic: 68.07 on 1 and 1120 DF,  p-value: 4.385e-16

HW 9 Guide

There are two parts:

Part 1: Question 1-13
Part 2 (R part): Question 1-8

I will walk through parts 1 and 2 with you today

HW 9 Guide

Part 1, Q2: Estimate the correlation for each scatterplot and describe it in words (4 points)

use Excel or calculator and show intermediate steps.

Part 1, Q3: For each scatterplot, draw an approximate line of best fit. Describe why it is a good regression line. (8 points)

Use your best guess and draw the two regression lines.

HW 9 Guide

Part 1, Q6: What is the slope of the regression line that describes how the total fertility rate changes in relation to the percent literate?

The regression equation is shown on top of the table. Just refer to it for the value of slope (b).

Part 1, Q8: What is the intercept of this regression line?

HW 9 Guide

Part 1, Q13: Is it plausible that a rise in women’s literacy does cause fertility to decline? If it is plausible, does this analysis show that such a causal relationship exists? Why or why not? (3 points)

Use your common sense to judge if they are causal relationship or not

HW 9 Guide

Part 2, Q1: Is it plausible that a rise in women’s literacy does cause fertility to decline? If it is plausible, does this analysis show that such a causal relationship exists? Why or why not? (3 points)

Use the summary() function to check out the 5- number summary

HW 9 Guide

Part 2, Q2: Create a scatter plot of each confidence variable versus education. Describe the relationship between the two variables. Comment on the direction, form, and strength of the relationship and identify any clear outliers. (20 points, 5 each)

Use plot() function to create a scatter plot. For example, your first scatter plot code will look like this:

plot(confdata$CONFINAN, confdata$EDUC)

HW 9 Guide

Part 2, Q3: Calculate the correlations for each scatterplot. Are any of these surprising? (8 points)

cor(confdata$CONFINAN, confdata$EDUC)

## [1] 0.5499412

HW 9 Guide

Part 2, Q4: Run the linear regressions in the script. Which variable is best predicted by education? How do you know? Write a sentence interpreting the measure.

Example of the regression:

m1 = lm(confdata$CONFINAN~confdata$EDUC)
summary(m1)

## 
## Call:
## lm(formula = confdata$CONFINAN ~ confdata$EDUC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06707 -0.05160 -0.01274  0.05888  0.08539 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.144537   0.057700   2.505   0.0277 *
## confdata$EDUC 0.009341   0.004095   2.281   0.0416 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06177 on 12 degrees of freedom
## Multiple R-squared:  0.3024, Adjusted R-squared:  0.2443 
## F-statistic: 5.203 on 1 and 12 DF,  p-value: 0.04161

S371: Lab 13

Announcement

Final Project

Correlation

Correlation

Correlation

Correlation

Correlation

Scatter plot (please refer also to Lab 6)

Scatter plot (please refer also to Lab 6)

Linear regression

Linear regression

Linear regression

Linear regression

HW 9 Guide

HW 9 Guide

HW 9 Guide

HW 9 Guide

HW 9 Guide

HW 9 Guide

HW 9 Guide

HW 9 Guide