Central Limit Theorem

Let’s draw one simple random sample of size n and compute the mean:

Note: I will draw sample from the normal standard distribution with mean equals 5 and standard deviation equals 10

x = rnorm(n=1000, mean=5, sd = 10)
mean(x)

## [1] 5.111798

sd(x)

## [1] 9.893484

This is the first data point.

x2 = rnorm(1000, 5, 10)
mean(x2)

## [1] 5.112765

sd(x2)

## [1] 10.19795

This is the second data point.

Let’s draw 1000 of such samples and plot their means:

means_list = c() #We will save means of the samples into a vector to use it later
sample_list = c()
for (s in 1:1000) {
  
  x = rnorm(1000, 5, 10) #Draw simple random sample
  means_list[s] = mean(x) #Get the mean of this sample, and save it in separate list
  sample_list[s] = list(x)
  
}

hist(means_list, main="Sampling distribution of sample means", xlab = "")

Central Limit Theorem

Now let’s look at some samples and the sample of means from these samples:

Property	Sample 1	Sample 100	Sample 599	Mean of 1000 sample means
Mean	5.29	5.47	5.29	5.01
SD	10.08	9.59	10.06	0.32

Central Limit Theorem

As number of samples increase, the sampling distribution of means is approximately normal. Also, as the number of samples increasing, the mean of the sampling distribution approaches to the population mean.

Note: n should be large enough (30+)

This is true for other statistics as well.

Central Limit Theorem: why bother?

Let’s imagine we are having presidential elections, and we want to know before the end of the election who is more likely to win: Obama or Romney (2012)?

Gallup predicted the win of Romney by 49% with a margin of error +/-2%.

Real world outcome (or population proportion of people voted for Romney) was 47.2% (Obama won with 51.1%)

In other words:

Candidate	Sample Mean (Gallup Poll)	Population Mean (Election Results)
Romney	49%	47.2%
Obama	48%	51.1%

What is margin of error?

How did they do prediction? (“Why they failed” is another question beyond the scope of this class; come to graduate school if you wanna know the answer ;))

Sampling Distribution

In reality:

What we don’t know:

population mean (election results; and we want to know it)

What we know:

The sampling distribution is approximately normal
The standard error (standard deviation of the sampling distribution) can be estimated using the standard deviation of the sample:

\(StandardError = \frac{Sample SD}{\sqrt{n}}\)

Central Limit Theorem: Population mean vs. Sample mean vs. Sampling mean

In real world, we can make predictions of elections results using one single sample of people (or other unit of analysis) and the properties of the CENTRAL LIMIT THEOREM and SAMPLING DISTRIBUTION.

We pretend that the sample mean is the population mean.

Then, we can know the properties of sampling distribution:

mean (from sample mean)
standard deviation (from the standard error formula)
sampling distribution is approximately normal

Estimate the population mean (or proportion in case of election results)

Based on what we know (mean, standard error, and probabilities of normal distribution), we can look for the range of possible values of election results with 95% probability:

Estimate the population mean

If we know population mean, then the sampling distribution would be blue distribution.

However, we only know the sample mean, then we pretend that the sampling distribution is the red distribution

Estimate the population mean

In reality we know only sample mean.

Estimate the population mean

What if we draw another sample and plot it. Let’s say we collected another sample of 1000 Americans and ask for whom they voted on elections.

We got that 55% of our new sample voted for Romney (SD is 2%).

Does shaded area include the population mean? No.

How likely is it going to happen if we randomly sample 1000 Americans for infinite number of times? 5% of the time

Estimate the Population Mean

How would we find the range of values so that the range would include the population mean with a very high chance?

sample mean ± (z)(standard error)

z - critical value

(z*)(standard error) - margin of error

\(StandardError = \frac{Sample SD}{\sqrt{n}}\)

95% Confidence Interval (CI) in R

Let’s calculate 95%CI for Romney election results.

We need to calculate:

Mean
Standard error
Critical value z* is 1.96 for 95% CI

set.seed(123)
x_sampl = rbinom( 1000, 1, 0.49) #First, let's draw random sample of Americans
z = 1.96 #Critical values to construct 95%CI

mean(x_sampl) #Sample Mean

## [1] 0.484

sd(x_sampl)/sqrt(1000) #Standard Error

## [1] 0.0158112

mean(x_sampl)+z*(sd(x_sampl)/sqrt(1000)) #Upper bound of 95%CI

## [1] 0.5149899

mean(x_sampl)-z*(sd(x_sampl)/sqrt(1000)) #Lower  bound of 95%CI

## [1] 0.4530101

HW4 Guide

There are two parts:

Part 1: Q1-4
Part 2: Q5-6

HW4 Guide

Line 20-40: These lines create three custom R functions called plot.std.normal(), shade.std.normal(), and mark.z.value()
For the purpose of this class, you don’t need to understand how these custom R functions are created
You just need to run these lines before running any lines below

HW4 Guide

In case you want to know what these lines do:

Line 20-24: define plot.std.normal() function (does not allow argument) as plotting a standard normal distribution from -4 to 4. The standard normal distribution is filled with light blue color.

Line 30-34: define shade.std.normal() function as shading area under the curve based on either lb, ub, or both arguments. Color can be chosen.

Line 37-40: define mark.z.value() function as putting a vertical line based on user specified z- value.

HW4 Guide

plot.std.normal() function

You cannot type anything within the parentheses

You just need to type plot.std.normal()

Then, a standard normal distribution will be drawn (filled with light blue color)

HW4 Guide

shade.std.normal() function

It can be typed in the following ways:

shade.std.normal(lb=X) 
shade.std.normal(ub=Y) 
shade.std.normal(lb=X, ub=Y)

Option in shade.std.normal() command:

shade.std.normal(lb=X,col=“red”) #You can change color of shaded area

Note: you can only type this function after typing plot.std.normal()

HW4 Guide

The first way of typing shade.std.normal():

shade.std.normal(lb=X) #lb stands for lower bound

The shaded area will be from X to the right end:

plot.std.normal()
shade.std.normal(lb=1)

HW4 Guide

The second way of typing shade.std.normal():

shade.std.normal(ub=Y) #ub stands for upper bound

The shaded area will be from the left end to Y:

plot.std.normal()
shade.std.normal(ub=1)

HW4 Guide

The third way of typing shade.std.normal():

shade.std.normal(lb=X, ub=Y) #The shaded area will be from X to Y

plot.std.normal()
shade.std.normal(lb=-0.5, ub=1)

The color option would be similar to the color option in other graphs in R

HW4 Guide

mark.z.value() function:

mark.z.value(z=X) #This function marks the z-value at X

plot.std.normal()
shade.std.normal(lb=-0.5, ub=1) 
mark.z.value(z=1) 
mark.z.value(z=-0.5)

Note: you can only type this function after typing plot.std.normal()

S371: Lab 7

Announcement

Today

Central Limit Theorem

Central Limit Theorem

Central Limit Theorem

Central Limit Theorem: why bother?

Sampling Distribution

Central Limit Theorem: Population mean vs. Sample mean vs. Sampling mean

Estimate the population mean (or proportion in case of election results)

Estimate the population mean

Estimate the population mean

Estimate the population mean

Estimate the Population Mean

95% Confidence Interval (CI) in R

HW4 Guide

HW4 Guide

HW4 Guide

HW4 Guide

HW4 Guide

HW4 Guide

HW4 Guide

HW4 Guide

HW4 Guide

HW4 Guide

HW4 Guide