S371: Lab 6

Lab Instructor: Katya Baldina (baldina@iu.edu)

2023-09-27

Announcement

HW 3 IS DUE SATURDAY (SEP.30)!

Today

Bivariate graphs

Scatter plot

Sampling

Central Limit Theorem

HW3 Guide

Bivariate plots

Variable	Quantitative	Categorical
Quantitative	Scatter plot	Box plot, bar chart
Categorical	Box plot, bar chart	Table or bar chart

Today we will cover bivariate graph for quantitative- quantitative variables only

Scatter plot

To visualize the relationships between two quantitative variables type:

plot(x, y) #x and y are names for your variables

The plot() command in R is a generic plot command that can actually produce many kinds of plots (e.g., distribution function)

Scatter plot

There are many options in plot():

par(mfrow=c(2,3))
plot(x, y, main = "This is the main title") #set the main title
plot(x, y, xlab = "This is the x-axis title") #set the x-axis title
plot(x, y, ylab = "This is the y-axis title" ) #set the y-axis title
plot(x, y, col = "red" ) #Color of dots in the scatter plot
plot(x, y, pch = 0) #Different dot shapes in the scatter plot; 
#This should be an integer from 0 to 25. If you don’t type the pch option, 
#R would use the default value of 1 in pch option

Scatter plot

To combine multiple options, you just need to put all the options you want within the parentheses:

Example: I want to specify the main title, x-axis title, and y-axis title in the scatter plot

plot(x, y, main = "This is the main title", 
     xlab = "This is the x-axis title", 
     ylab = "This is the y-axis title", 
     col = "red", 
     pch = 0)

Save the plot as separate file

right click the plot–>copy and paste it to other programs (e.g., Microsoft Word) –> save as a new file
Use the png() function and the dev.off() function:

Example: I want to save the scatter plot as a separate png file called scatter.png

#Three separate lines of code (they have to be in this order!)

#It asks R to create a new png file called scatter.png in the working directory 
#with the dimension of 1200X800. All the graphs produced after this line of code
#will be put into the scatter.png file. 
#(you can think of this as a “start recording button”)
png('scatter.png', width = 1200, height = 800) 

#The second line is just the usual scatter plot function
plot(x,y) 

#The third line is the dev.off() function. 
#You don’t need to put anything within the parentheses. 
#Any graphs produced between the first line (png() function) 
#and this line will be put into the scatter.png file (“stop recording button”)
dev.off()

Sampling

Why sample?:

We cannot ask every single person in our population (let’s say we are interested in income distribution of all USA residents (~350 million people))

Then, we need to find a feasible way to do it

We are going to draw a sample, that will resemble the whole population.

Why? Due to the properties of the Central Limit Theorem!

One condition: sample must be random to ensure that our sample represents the population (=resembles)

Population vs. Sample vs. Sampling distribution

Population: The whole group of people you want to know

Sample: Part of the group of people you want to know

Sampling: The process to select/pick a sample

Population vs. Sample vs. Sampling distribution

Properties	Population	Sample	Sampling
Mean	12.01	12.23	12.04
Standard Deviation	5.3	5.52	0.17

Sampling

As you can see, if the sample is random, it’s mean is close to the population mean

How close?

• According to law of large numbers, if you have large sample size, the sample mean should be close to the population mean

• The larger the sample size, the closer the sample mean with the population mean

Central Limit Theorem

As number of samples increase, the sampling distribution of means is approximately normal. Also, as the number of samples increasing, the mean of the sampling distribution approaches to the population mean.

Note: n should be large enough (30+)

Central Limit Theorem

No matter how skewed or weird the original distribution is, the sampling distribution would still be approximately normal:

Sampling Distribution

In reality:

What we don’t know:

population mean

What we know:

The sampling distribution is approximately normal
The standard error (standard deviation of the sampling distribution) can be estimated using the standard deviation of the sample:

\(StandardError = \frac{Sample SD}{\sqrt{n}}\)

HW3 Guide

when you load the HW3.RData into RStudio, you will see the function auto.sampler in your Environment pane:

load('HW3.RData')

Q5 Guide

if you pass the only number within parentheses, it will give you one sample:

x = auto.sampler(10000)

## Generating a sample of size 10000

Using this sample, write code in your HW 3.R script that will create:

Histogram of this sample (hist() function)
Mean of this sample (mean() function)
Standard deviation of this sample (sd() function)
Standard deviation of the sampling distribution (Hint: you will need sqrt() function)

Q6, Q7, Q8 Guide

if you to get sample means for your four samples you need to write:

means = auto.sampler(n=sample size, samples = number of samples)

replace sample size and number of samples with the ones that HW asks you to do.