HW8 IS DUE THURSDAY (Nov. 16) 11.59 PM!
Did you pick variables? Please, talk to your neighbor about variables you’ve picked or pick one now.
Choose 2 dependent and 2 independent variables (four in total)
2 dependent variables: one quantitative and one categorical
2 independent variables: one quantitative and one categorical
If you want, you may create your own secondary variables
To correlate two variables type:
where
x and y are your vectors with quantitative variable
If your quantitative variables are within the dataset, you need to
specify the dataset first, type $, then the name of your
variables:
Let’s say we are interested if the relationships between income and life satisfaction:
We might suspect that higher income will be associated with higher life satisfaction (correlation is positive):
load("project.RData")
## I need to recode the variable, 
##so that higher score was referring to higher life satisfaction (see the Codebook)
GSS$SATLIFE_R = 8-GSS$SATLIFE 
cor(GSS$SATLIFE_R, GSS$INCOME)## [1] NA
However, when we run cor(GSS$SATLIFE, GSS$INCOME) we get
NA. Let’s check if our variables have missings.
To check the amount of missingness in each variable we use
table() and is.na() commands:
## 
## FALSE  TRUE 
##  1169  1179
## 
## FALSE  TRUE 
##  2257    91
We find that there are indeed missingness in each of the variable. Then how can we write code in R to avoid missing cases and calculate correlation coefficient properly?
To make R to run cor() command using complete cases, we
need to modify our commend as follows:
## [1] 0.2393682
To show the relationships between two quantitative variables, we need to construct scatter plot.
The command is:
where x and y are your vectors with quantitative variable.
If your quantitative variables are within the dataset, you need to
specify the dataset first, type $, then the name of your
variables:
plot(GSS$SATLIFE_R, GSS$INCOME, 
     main = "This is the scatter plot\nfor life satisfaction and income", 
     xlab = "Life satisfaction", 
     ylab = "Income", 
     col = "lightblue", 
     pch = 4)Recall that linear regression looks like this in general:
\(y = \alpha+\beta*x\)
where \(\alpha\) is an intercept and \(\beta\) is a slope.
For example, if we are interested in the relationship between people’s height (y) and weight (x)
We sample a bunch of people and record their heights and weights
We run a linear regression and get: height = 3+10*weight
Then, the value of a would be 3 and the value of b would be 10
If our data (the specific group of people we sample) and/or the variables (the things we are interested in) change, the values of a and b would change
In R, we can estimate the linear regression model like this:
R would report the intercept (a) and slope (b) in the output
We can also save all the details of the regression model as an object in R:
By typing summary for the object (model1), we can obtain a more detailed output:
Let’s say we are interested if income predicts life satisfaction.
We suspect that it does: higher income leads to higher life satisfaction (why?).
To test this out, I do a linear regression analysis:
Simple output:
## 
## Call:
## lm(formula = GSS$SATLIFE_R ~ GSS$INCOME)
## 
## Coefficients:
## (Intercept)   GSS$INCOME  
##      4.2206       0.1139
Detailed output with using summary() command:
## 
## Call:
## lm(formula = GSS$SATLIFE_R ~ GSS$INCOME)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5871 -0.5871  0.2990  0.4129  2.6655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2206     0.1580  26.704  < 2e-16 ***
## GSS$INCOME    0.1139     0.0138   8.251 4.38e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.06 on 1120 degrees of freedom
##   (1226 observations deleted due to missingness)
## Multiple R-squared:  0.0573, Adjusted R-squared:  0.05646 
## F-statistic: 68.07 on 1 and 1120 DF,  p-value: 4.385e-16
There are two parts:
Part 1: Question 1-13
Part 2 (R part): Question 1-8
I will walk through parts 1 and 2 with you today
Part 1, Q2: Estimate the correlation for each scatterplot and describe it in words (4 points)
use Excel or calculator and show intermediate steps.
Part 1, Q3: For each scatterplot, draw an approximate line of best fit. Describe why it is a good regression line. (8 points)
Use your best guess and draw the two regression lines.
Part 1, Q6: What is the slope of the regression line that describes how the total fertility rate changes in relation to the percent literate?
The regression equation is shown on top of the table. Just refer to it for the value of slope (b).
Part 1, Q8: What is the intercept of this regression line?
Part 1, Q13: Is it plausible that a rise in women’s literacy does cause fertility to decline? If it is plausible, does this analysis show that such a causal relationship exists? Why or why not? (3 points)
Use your common sense to judge if they are causal relationship or not
Part 2, Q1: Is it plausible that a rise in women’s literacy does cause fertility to decline? If it is plausible, does this analysis show that such a causal relationship exists? Why or why not? (3 points)
Use the summary() function to check out the 5- number summary
Part 2, Q2: Create a scatter plot of each confidence variable versus education. Describe the relationship between the two variables. Comment on the direction, form, and strength of the relationship and identify any clear outliers. (20 points, 5 each)
Use plot() function to create a scatter plot. For
example, your first scatter plot code will look like this:
Part 2, Q3: Calculate the correlations for each scatterplot. Are any of these surprising? (8 points)
## [1] 0.5499412
Part 2, Q4: Run the linear regressions in the script. Which variable is best predicted by education? How do you know? Write a sentence interpreting the measure.
Example of the regression:
## 
## Call:
## lm(formula = confdata$CONFINAN ~ confdata$EDUC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06707 -0.05160 -0.01274  0.05888  0.08539 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.144537   0.057700   2.505   0.0277 *
## confdata$EDUC 0.009341   0.004095   2.281   0.0416 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06177 on 12 degrees of freedom
## Multiple R-squared:  0.3024, Adjusted R-squared:  0.2443 
## F-statistic: 5.203 on 1 and 12 DF,  p-value: 0.04161