HW 4 IS DUE TOMORROW (OCT.5)!
Let’s draw one simple random sample of size n and compute the mean:
Note: I will draw sample from the normal standard distribution with mean equals 5 and standard deviation equals 10
## [1] 5.111798
## [1] 9.893484
This is the first data point.
Let’s draw 1000 of such samples and plot their means:
means_list = c() #We will save means of the samples into a vector to use it later
sample_list = c()
for (s in 1:1000) {
  
  x = rnorm(1000, 5, 10) #Draw simple random sample
  means_list[s] = mean(x) #Get the mean of this sample, and save it in separate list
  sample_list[s] = list(x)
  
}
hist(means_list, main="Sampling distribution of sample means", xlab = "")Now let’s look at some samples and the sample of means from these samples:
| Property | Sample 1 | Sample 100 | Sample 599 | Mean of 1000 sample means | 
|---|---|---|---|---|
| Mean | 5.29 | 5.47 | 5.29 | 5.01 | 
| SD | 10.08 | 9.59 | 10.06 | 0.32 | 
As number of samples increase, the sampling distribution of means is approximately normal. Also, as the number of samples increasing, the mean of the sampling distribution approaches to the population mean.
Note: n should be large enough (30+)
This is true for other statistics as well.
Let’s imagine we are having presidential elections, and we want to know before the end of the election who is more likely to win: Obama or Romney (2012)?
Gallup predicted the win of Romney by 49% with a margin of error +/-2%.
Real world outcome (or population proportion of people voted for Romney) was 47.2% (Obama won with 51.1%)
In other words:
| Candidate | Sample Mean (Gallup Poll) | Population Mean (Election Results) | 
|---|---|---|
| Romney | 49% | 47.2% | 
| Obama | 48% | 51.1% | 
What is margin of error?
How did they do prediction? (“Why they failed” is another question beyond the scope of this class; come to graduate school if you wanna know the answer ;))
In reality:
What we don’t know:
What we know:
\(StandardError = \frac{Sample SD}{\sqrt{n}}\)
In real world, we can make predictions of elections results using one single sample of people (or other unit of analysis) and the properties of the CENTRAL LIMIT THEOREM and SAMPLING DISTRIBUTION.
We pretend that the sample mean is the population mean.
Then, we can know the properties of sampling distribution:
mean (from sample mean)
standard deviation (from the standard error formula)
sampling distribution is approximately normal
Based on what we know (mean, standard error, and probabilities of normal distribution), we can look for the range of possible values of election results with 95% probability:
If we know population mean, then the sampling distribution would be blue distribution.
However, we only know the sample mean, then we pretend that the sampling distribution is the red distribution
In reality we know only sample mean.
What if we draw another sample and plot it. Let’s say we collected another sample of 1000 Americans and ask for whom they voted on elections.
We got that 55% of our new sample voted for Romney (SD is 2%).
Does shaded area include the population mean? No.
How likely is it going to happen if we randomly sample 1000 Americans for infinite number of times? 5% of the time
How would we find the range of values so that the range would include the population mean with a very high chance?
sample mean ± (z)(standard error)
z - critical value
(z*)(standard error) - margin of error
\(StandardError = \frac{Sample SD}{\sqrt{n}}\)
Let’s calculate 95%CI for Romney election results.
We need to calculate:
Mean
Standard error
Critical value z* is 1.96 for 95% CI
set.seed(123)
x_sampl = rbinom( 1000, 1, 0.49) #First, let's draw random sample of Americans
z = 1.96 #Critical values to construct 95%CI## [1] 0.484
## [1] 0.0158112
## [1] 0.5149899
## [1] 0.4530101
There are two parts:
Part 1: Q1-4
Part 2: Q5-6
Line 20-40: These lines create three custom R functions called plot.std.normal(), shade.std.normal(), and mark.z.value()
For the purpose of this class, you don’t need to understand how these custom R functions are created
You just need to run these lines before running any lines below
In case you want to know what these lines do:
Line 20-24: define plot.std.normal() function (does not allow argument) as plotting a standard normal distribution from -4 to 4. The standard normal distribution is filled with light blue color.
Line 30-34: define shade.std.normal() function as shading area under the curve based on either lb, ub, or both arguments. Color can be chosen.
Line 37-40: define mark.z.value() function as putting a vertical line based on user specified z- value.
plot.std.normal() function
You cannot type anything within the parentheses
You just need to type plot.std.normal()
Then, a standard normal distribution will be drawn (filled with light blue color)
shade.std.normal() function
It can be typed in the following ways:
Option in shade.std.normal() command:
Note: you can only type this function after typing plot.std.normal()
The first way of typing shade.std.normal():
The shaded area will be from X to the right end:
The second way of typing shade.std.normal():
The shaded area will be from the left end to Y:
The third way of typing shade.std.normal():
The color option would be similar to the color option in other graphs in R
mark.z.value() function:
Note: you can only type this function after typing plot.std.normal()
Note for shade.std.normal() and mark.z.value() functions:
You can run these functions with any specifications as many times as you want
As long as they are run after plot.std.normal() function