HW 3 IS DUE SATURDAY (SEP.30)!
| Variable | Quantitative | Categorical | 
|---|---|---|
| Quantitative | Scatter plot | Box plot, bar chart | 
| Categorical | Box plot, bar chart | Table or bar chart | 
Today we will cover bivariate graph for quantitative- quantitative variables only
To visualize the relationships between two quantitative variables type:
The plot() command in R is a generic plot command that can actually produce many kinds of plots (e.g., distribution function)
There are many options in plot():
par(mfrow=c(2,3))
plot(x, y, main = "This is the main title") #set the main title
plot(x, y, xlab = "This is the x-axis title") #set the x-axis title
plot(x, y, ylab = "This is the y-axis title" ) #set the y-axis title
plot(x, y, col = "red" ) #Color of dots in the scatter plot
plot(x, y, pch = 0) #Different dot shapes in the scatter plot; 
#This should be an integer from 0 to 25. If you don’t type the pch option, 
#R would use the default value of 1 in pch optionTo combine multiple options, you just need to put all the options you want within the parentheses:
Example: I want to specify the main title, x-axis title, and y-axis title in the scatter plot
plot(x, y, main = "This is the main title", 
     xlab = "This is the x-axis title", 
     ylab = "This is the y-axis title", 
     col = "red", 
     pch = 0)right click the plot–>copy and paste it to other programs (e.g., Microsoft Word) –> save as a new file
Use the png() function and the
dev.off() function:
Example: I want to save the scatter plot as a separate png file called scatter.png
#Three separate lines of code (they have to be in this order!)
#It asks R to create a new png file called scatter.png in the working directory 
#with the dimension of 1200X800. All the graphs produced after this line of code
#will be put into the scatter.png file. 
#(you can think of this as a “start recording button”)
png('scatter.png', width = 1200, height = 800) 
#The second line is just the usual scatter plot function
plot(x,y) 
#The third line is the dev.off() function. 
#You don’t need to put anything within the parentheses. 
#Any graphs produced between the first line (png() function) 
#and this line will be put into the scatter.png file (“stop recording button”)
dev.off()Why sample?:
We cannot ask every single person in our population (let’s say we are interested in income distribution of all USA residents (~350 million people))
Then, we need to find a feasible way to do it
We are going to draw a sample, that will resemble the whole population.
Why? Due to the properties of the Central Limit Theorem!
One condition: sample must be random to ensure that our sample represents the population (=resembles)
Population: The whole group of people you want to know
Sample: Part of the group of people you want to know
Sampling: The process to select/pick a sample
| Properties | Population | Sample | Sampling | 
|---|---|---|---|
| Mean | 12.01 | 12.23 | 12.04 | 
| Standard Deviation | 5.3 | 5.52 | 0.17 | 
As you can see, if the sample is random, it’s mean is close to the population mean
How close?
• According to law of large numbers, if you have large sample size, the sample mean should be close to the population mean
• The larger the sample size, the closer the sample mean with the population mean
As number of samples increase, the sampling distribution of means is approximately normal. Also, as the number of samples increasing, the mean of the sampling distribution approaches to the population mean.
Note: n should be large enough (30+)
No matter how skewed or weird the original distribution is, the sampling distribution would still be approximately normal:
In reality:
What we don’t know:
What we know:
\(StandardError = \frac{Sample SD}{\sqrt{n}}\)
auto.sampler in your Environment pane:
if you pass the only number within parentheses, it will give you one sample:
## Generating a sample of size 10000
Using this sample, write code in your HW 3.R script that will create:
Histogram of this sample (hist() function)
Mean of this sample (mean() function)
Standard deviation of this sample (sd()
function)
Standard deviation of the sampling distribution (Hint: you
will need sqrt() function)