S371: Lab 7

Lab Instructor: Katya Baldina ()

2023-10-04

Announcement

HW 4 IS DUE TOMORROW (OCT.5)!

Today

Central Limit Theorem

Let’s draw one simple random sample of size n and compute the mean:

Note: I will draw sample from the normal standard distribution with mean equals 5 and standard deviation equals 10

x = rnorm(n=1000, mean=5, sd = 10)
mean(x)
## [1] 5.111798
sd(x)
## [1] 9.893484

This is the first data point.

x2 = rnorm(1000, 5, 10)
mean(x2)
## [1] 5.112765
sd(x2)
## [1] 10.19795

This is the second data point.

Let’s draw 1000 of such samples and plot their means:

means_list = c() #We will save means of the samples into a vector to use it later
sample_list = c()
for (s in 1:1000) {
  
  x = rnorm(1000, 5, 10) #Draw simple random sample
  means_list[s] = mean(x) #Get the mean of this sample, and save it in separate list
  sample_list[s] = list(x)
  
}

hist(means_list, main="Sampling distribution of sample means", xlab = "")

Central Limit Theorem

Now let’s look at some samples and the sample of means from these samples:

Property Sample 1 Sample 100 Sample 599 Mean of 1000 sample means
Mean 5.29 5.47 5.29 5.01
SD 10.08 9.59 10.06 0.32

Central Limit Theorem

As number of samples increase, the sampling distribution of means is approximately normal. Also, as the number of samples increasing, the mean of the sampling distribution approaches to the population mean.

Note: n should be large enough (30+)

This is true for other statistics as well.

Central Limit Theorem: why bother?

Let’s imagine we are having presidential elections, and we want to know before the end of the election who is more likely to win: Obama or Romney (2012)?

Gallup predicted the win of Romney by 49% with a margin of error +/-2%.

Real world outcome (or population proportion of people voted for Romney) was 47.2% (Obama won with 51.1%)

In other words:

Candidate Sample Mean (Gallup Poll) Population Mean (Election Results)
Romney 49% 47.2%
Obama 48% 51.1%

What is margin of error?

How did they do prediction? (“Why they failed” is another question beyond the scope of this class; come to graduate school if you wanna know the answer ;))

Sampling Distribution

In reality:

What we don’t know:

What we know:

\(StandardError = \frac{Sample SD}{\sqrt{n}}\)

Central Limit Theorem: Population mean vs. Sample mean vs. Sampling mean

In real world, we can make predictions of elections results using one single sample of people (or other unit of analysis) and the properties of the CENTRAL LIMIT THEOREM and SAMPLING DISTRIBUTION.

We pretend that the sample mean is the population mean.

Then, we can know the properties of sampling distribution:

Estimate the population mean (or proportion in case of election results)

Based on what we know (mean, standard error, and probabilities of normal distribution), we can look for the range of possible values of election results with 95% probability:

Estimate the population mean

If we know population mean, then the sampling distribution would be blue distribution.

However, we only know the sample mean, then we pretend that the sampling distribution is the red distribution

Estimate the population mean

In reality we know only sample mean.

Estimate the population mean

What if we draw another sample and plot it. Let’s say we collected another sample of 1000 Americans and ask for whom they voted on elections.

We got that 55% of our new sample voted for Romney (SD is 2%).

Does shaded area include the population mean? No.

How likely is it going to happen if we randomly sample 1000 Americans for infinite number of times? 5% of the time

Estimate the Population Mean

How would we find the range of values so that the range would include the population mean with a very high chance?

sample mean ± (z)(standard error)

z - critical value

(z*)(standard error) - margin of error

\(StandardError = \frac{Sample SD}{\sqrt{n}}\)

95% Confidence Interval (CI) in R

Let’s calculate 95%CI for Romney election results.

We need to calculate:

set.seed(123)
x_sampl = rbinom( 1000, 1, 0.49) #First, let's draw random sample of Americans
z = 1.96 #Critical values to construct 95%CI
mean(x_sampl) #Sample Mean
## [1] 0.484
sd(x_sampl)/sqrt(1000) #Standard Error
## [1] 0.0158112
mean(x_sampl)+z*(sd(x_sampl)/sqrt(1000)) #Upper bound of 95%CI
## [1] 0.5149899
mean(x_sampl)-z*(sd(x_sampl)/sqrt(1000)) #Lower  bound of 95%CI
## [1] 0.4530101

HW4 Guide

There are two parts:

HW4 Guide

HW4 Guide

In case you want to know what these lines do:

Line 20-24: define plot.std.normal() function (does not allow argument) as plotting a standard normal distribution from -4 to 4. The standard normal distribution is filled with light blue color.

Line 30-34: define shade.std.normal() function as shading area under the curve based on either lb, ub, or both arguments. Color can be chosen.

Line 37-40: define mark.z.value() function as putting a vertical line based on user specified z- value.

HW4 Guide

plot.std.normal() function

You cannot type anything within the parentheses

You just need to type plot.std.normal()

Then, a standard normal distribution will be drawn (filled with light blue color)

HW4 Guide

shade.std.normal() function

It can be typed in the following ways:

shade.std.normal(lb=X) 
shade.std.normal(ub=Y) 
shade.std.normal(lb=X, ub=Y)

Option in shade.std.normal() command:

shade.std.normal(lb=X,col=“red”) #You can change color of shaded area

Note: you can only type this function after typing plot.std.normal()

HW4 Guide

The first way of typing shade.std.normal():

shade.std.normal(lb=X) #lb stands for lower bound

The shaded area will be from X to the right end:

plot.std.normal()
shade.std.normal(lb=1)

HW4 Guide

The second way of typing shade.std.normal():

shade.std.normal(ub=Y) #ub stands for upper bound

The shaded area will be from the left end to Y:

plot.std.normal()
shade.std.normal(ub=1)

HW4 Guide

The third way of typing shade.std.normal():

shade.std.normal(lb=X, ub=Y) #The shaded area will be from X to Y
plot.std.normal()
shade.std.normal(lb=-0.5, ub=1) 

The color option would be similar to the color option in other graphs in R

HW4 Guide

mark.z.value() function:

mark.z.value(z=X) #This function marks the z-value at X
plot.std.normal()
shade.std.normal(lb=-0.5, ub=1) 
mark.z.value(z=1) 
mark.z.value(z=-0.5) 

Note: you can only type this function after typing plot.std.normal()

HW4 Guide

Note for shade.std.normal() and mark.z.value() functions:

HW4 Guide