A Simulating sampling distributions
In these appendices we will explore how we can use R to carry out statistical simulations. Such simulations can be useful for a variety of purposes, such as understanding classical results in statistical inference, deriving novel results for unusual statistics or underlying distributions, or to help design experiments. More generally, simulation is a useful tool for honing our intuition or revealing key properties about complex systems.
In this chapter we will focus on how we can use simulation to derive well known and not so well known results regarding sampling distributions for a variety of statistics.
A.1 Sampling distributions
Usually when we collect biological data, it’s because we’re trying to learn about some underlying “population” of interest. Population here could refer to an actual population (e.g. all males over 20 in the United States; brushtail possums in the state of Victoria, Australia), an abstract population (e.g. corn plants grown from Monsanto “round up ready” seed; yeast cells with genotypes identical to the reference strain S288c), outcomes of a stochastic process we can observe and measure (e.g. meiotic recombination in flies; rainfall per square meter in the Brazilian amazon), etc.
It is often impractical or impossible to measure all objects/individuals/instances of a population of interest, so we take a sample (ideally a random sample) from the population and make measurements on the variables of interest in that sample. We do so with the hope that the various statistics we calculate on the variables of interest in that sample will be useful estimates of those same statistics in the underlying population.
However, we must always keep in mind that the statistics we calculate from our sample will almost never exactly match those of the underlying population. That is when we collect a sample, and measure a statistic (e.g. mean or standard deviation or skew or…) on variable \(X\) in the sample, there is a degree of uncertainty about how well our estimate matches the true value of that statistic in the underlying population. We know that if we took another random sample from the same population, and re-calculated the statistic of interest, we’re likely to obtain an estimate that differs from that in our first sample. We can imagine repeating this procedure again and again a large number of times, recording our estimate each time which would yield a new distribution, which we’ll call a sampling distribution. A sampling distribution is the probability distribution of a given statistic for samples of a given size. Sampling distributions are a linchpin at the heart of statistical inference, how we quantify the uncertainty associated with statistics and use that information to test hypotheses and evaluate models.
Traditionally sampling distributions were derived analytically. In this class session we’ll see how to approximate sampling distributions for any statistic using computer simulation.
A.2 Simulating sampling from an underlying population
To illustrate the concept of sampling distributions, we’ll simulate drawing samples for an underlying population that we’re trying to estimate statistics about. This will allow us to compare the various statistics we calculate and their sampling distributions to their “true” values.
Let’s simulate a population consisting of 25,000 individuals with a single trait of interest – height (measured in centimeters). We will simulate this data set based on information about the distribution of the heights of adult males in the US from a study carried out from 2011-2014 by the US Department of Health and Human Services1.
# male mean height and sd in centimeters from USDHHS report
<- 175.7
true.mean <- 15.19 true.sd
A.3 Seeding the pseudo-random number generator
When carrying out simulations, we employ random number generators (e.g. to choose random samples). Most computers can not generate true random numbers – instead they use algorithms that approximate the generation of random numbers (pseudo-random number generators). One important difference between a true random number generator and a pseudo-random number generator is that we can regenerate a series of pseudo-random numbers if we know the “seed” value that initialized the algorithm. We can specifically set this seed value, so that we can guarantee that two different people evaluating this notebook get the same results, even though we’re using (pseudo)random numbers in our simulation.
# make our simulation repeatable by seeding RNG
set.seed(20190409)
A.4 Libraries
library(tidyverse)
library(magrittr)
library(stringr)
A.4.1 Properties of the underlying population
Heights in human populations are approximately normally distributed, so we’ll assume that the distribution of our simulated variable is also normally distributed. Let’s take a moment to visualize the probability distribution of of a normal distribution with a mean and standard deviation as given above.
<-
pop.distn data_frame(height = seq(100, 250, 0.5),
density = dnorm(height,mean = true.mean, sd = true.sd))
ggplot(pop.distn) +
geom_line(aes(height, density)) +
# vertical line at mean
geom_vline(xintercept = true.mean, color="red", linetype="dashed") +
# vertical line at mean + 1SD
geom_vline(xintercept = true.mean + true.sd,
color = "blue", linetype="dashed") +
# vertical line at mean - 1SD
geom_vline(xintercept = true.mean - true.sd,
color = "blue", linetype="dashed") +
labs(x = "Height (cm)", y = "Density",
title = "Distribution of Heights in the Population of Interest",
subtitle = "Red and blue lines indicate the mean \nand ±1 standard deviation respectively.")
A.5 Random sampling from the simulated population
Let’s simulate the process of taking a single sample of 30 individuals from our population, using the rnorm()
function which takes samples if size n
from a normal distribution with the given mean and standard deviation:
<-
sample.a data_frame(height = rnorm(n = 30, mean = true.mean, sd = true.sd))
Now we’ll create a histogram of the height variable in our sample. For reference we’ll also plot the probability for the true population (but remember, in the typical case you don’t know what the true population looks like)
%>%
sample.a ggplot(aes(x = height)) +
geom_histogram(aes(y = ..density..),
fill = 'steelblue', alpha=0.75, bins=10) +
geom_line(data=pop.distn, aes(x = height, y = density),
alpha=0.25,size=1.5) +
geom_vline(xintercept = true.mean, linetype = "dashed", color="red") +
geom_vline(xintercept = mean(sample.a$height), linetype = "solid") +
labs(x = "Height (cm)", y = "Density",
title = "Distribution of heights in the underlying population (line)\nand a single sample of size 30 (blue)")
The dashed vertical line represent the true mean of the population, the solid line represents the sample mean. Comparing the two distributions we see that while our sample of 30 observations is relatively small,its location (center) and spread that are roughly similar to those of the underlying population.
Let’s create a table giving the estimates of the mean and standard deviation in our sample:
%>%
sample.a summarize(sample.mean = mean(height),
sample.sd = sd(height))
## # A tibble: 1 × 2
## sample.mean sample.sd
## <dbl> <dbl>
## 1 176. 12.4
Based on our sample, we estimate that the mean height of males in our population of interest is 175.5215685cm with a standard deviation of 12.4101548cm.
A.5.1 Another random sample
Let’s step back and think about our experiment. We took a random sample of 30 individuals from the population. The very nature of a “random sample” means we could just as well have gotten a different collection of individuals in our sample. Let’s take a second random sample of 25 individuals and see what the data looks like this time:
<-
sample.b data_frame(height = rnorm(30, mean = true.mean, sd = true.sd))
%>%
sample.b ggplot(aes(x = height)) +
geom_histogram(aes(y = ..density..),
fill = 'steelblue', alpha=0.75, bins=10) +
geom_line(data=pop.distn, aes(x = height, y = density),
alpha=0.25,size=1.5) +
geom_vline(xintercept = true.mean, linetype = "dashed", color="red") +
geom_vline(xintercept = mean(sample.a$height), linetype = "solid") +
labs(x = "Height (cm)", y = "Density",
title = "Distribution of heights in the underlying population (line)\nand a single sample of size 30 (blue)")
%>%
sample.b summarize(sample.mean = mean(height),
sample.sd = sd(height))
## # A tibble: 1 × 2
## sample.mean sample.sd
## <dbl> <dbl>
## 1 176. 17.2
This time we estimated the mean height to be 175.651398 cm and the standard deviation to be 17.1780732 cm.
A.5.2 Simulating the generation of many random samples
When we estimate population parameters, like the mean and standard deviation, based on a sample, our estimates will differ from the true population values by some amount. For any given sample we can’t know how close our estimates of statistics like the mean and standard deviation are to the true population values, but we we can study the behavior of such estimates across many simulated samples and learn something about how well our estimates do on average, as well the spread of these estimates.
A.5.3 A function to estimate statistics of interest in a random sample
First we’re going to write a function called rnorm.stats
that carries out the following steps:
- Take a random sample of size
n
from a normal distribution with a given mean (mu
) and standard deviation (sigma
) - Calculates the mean and standard deviation of the random sample
- Return a table giving the sample size, sample mean, and sample standard deviation, represented as a data frame
Take a moment to make sure you understand how this function works.
<- function(n, mu, sigma) {
rnorm.stats <- rnorm(n, mu, sigma)
the.sample data_frame(sample.size = n,
sample.mean = mean(the.sample),
sample.sd = sd(the.sample))
}
Let’s test rsample.stats()
for a sample of size 30, drawn from a popultion with a mean and standard deviation corresponding to our height exmaple:
rnorm.stats(30, true.mean, true.sd)
## # A tibble: 1 × 3
## sample.size sample.mean sample.sd
## <dbl> <dbl> <dbl>
## 1 30 176. 17.3
A.5.4 Generating statistics for many random samples
Now we’ll see how to combine rnorm.stats
with two additional functions to repeatedly run the rsample.stats
function:
.30 <-
df.samples.ofrerun(2500, rnorm.stats(30, true.mean, true.sd)) %>%
bind_rows()
The function rerun
is defined in the purrr
library (automatically loaded with tidyverse). purrr:rerun()
re-runs an expression(s) multiple times. The first argument to rerun()
is the number of times you want to re-run, and the following arguments are the expressions to be re-run. Thus the second line of the code block above re-runs the rnorm.stats
function 2500 times, generating sample statistics for samples of size 30 each time it’s run. rerun()
returns a list whose length is the specified number of runs.
The third line includes a call the dplyr::bind_rows()
. This simply takes the list that rerun
returns and collapses the list into a single data frame. df.samples.of.30
is thus a data frame in which each row gives the sample size, sample mean, and sample standard deviation for a random sample of 30 individuals drawn from our underlying population with a normally distributed variable.
head(df.samples.of.30)
## # A tibble: 6 × 3
## sample.size sample.mean sample.sd
## <dbl> <dbl> <dbl>
## 1 30 174. 12.4
## 2 30 178. 15.4
## 3 30 178. 13.3
## 4 30 179. 14.4
## 5 30 175. 12.8
## 6 30 177. 16.5
A.6 Simulated sampling distribution of the mean
Let’s review what we just did:
- We generated 2500 samples of size 30 sampled from a population for which the variable of interest has a normal distribution
- For each of the samples we calculated the mean and standard deviation in that sample
- We combined each of those estimates of the mean and standard deviation into a data frame
The 2500 estimates of the mean we generated represents a new distribution – what we will call a sampling distribution of the mean for samples of size 30. Let’s plot this sampling distribution:
ggplot(df.samples.of.30, aes(x = sample.mean, y = ..density..)) +
geom_histogram(bins=25, fill = 'firebrick', alpha=0.5) +
geom_vline(xintercept = true.mean, linetype = "dashed", color="red") +
labs(x = "Sample means", y = "Density",
title = "Distribution of mean heights for 2500 samples of size 30")
A.6.1 Differences between sampling distributions and sample/population distributions
Note that this is not a sample distribution of the variable of interest (“heights”), but rather the distribution of means of the variable of interest (“mean heights”) you would get if you took many random samples (in one sample you’d estimate the mean height as 180cm, in another you’d estimate it as 172 cm, etc). To emphasize this point, let’s compare the simulated sampling distribution of the mean (red histogram) to the population distribution of the variable (grey line) and the distributions of heights in a single sample (blue histogram):
ggplot(df.samples.of.30, aes(x = sample.mean, y = ..density..)) +
geom_histogram(bins=50, fill = 'firebrick', alpha=0.5) +
geom_histogram(data=sample.a,
aes(x = height, y = ..density..),
bins=11, fill='steelblue', alpha=0.25) +
geom_vline(xintercept = true.mean, linetype = "dashed", color='red') +
geom_line(data=pop.distn, aes(x = height, y = density), alpha=0.25,size=1.5) +
xlim(125,225) +
labs(x = "height or mean(height) in cm", y = "Density",
title = "Distribution of mean heights for 2500 samples\nof size 30 (red) compared to the distribution of a single sample (blue)\nand the population distribution of heights (gray line)")
A.6.2 Using sampling distributions to understand the behavior of statistics of interest
The particular sampling distribution of the mean, as simulated above, is a probability distribution that we can use to estimate the probability that a sample mean falls within a given interval, assuming our sample is a random sample of size 30 drawn from our underlying population.
From our visualization, we see that the distribution of sample mean heights is approximately centered around the true mean height. Most of the sample estimates of mean height are within 5 cm of the true population mean height, but a small number of estimates of the sample mean as off by nearly 10cm.
Let’s make this more precise by calculating the mean and standard deviation of the sampling distribution of means:
.30 %>%
df.samples.ofsummarize(mean.of.means = mean(sample.mean),
sd.of.means = sd(sample.mean))
## # A tibble: 1 × 2
## mean.of.means sd.of.means
## <dbl> <dbl>
## 1 176. 2.74
IMPORTANT! – the values above are estimates of the mean and standard deviation of the sampling distribution of means for samples of size 30, they are not estimates of the mean or standard deviation of the variable of interest (though they are related to these statistics as we’ll see below).
A.6.3 Sampling distributions for different sample sizes
In the example above we simulated the sampling distribution of the mean for samples of size 30. How would the sampling distribution change if we increased the sample size?
In the next code block we generate sampling distributions of the mean (and standard deviation) for samples of size 50, 100, 250, and 500.
.50 <-
df.samples.ofrerun(2500, rnorm.stats(50, true.mean, true.sd)) %>%
bind_rows()
.100 <-
df.samples.ofrerun(2500, rnorm.stats(100, true.mean, true.sd)) %>%
bind_rows()
.250 <-
df.samples.ofrerun(2500, rnorm.stats(250, true.mean, true.sd)) %>%
bind_rows()
.500 <-
df.samples.ofrerun(2500, rnorm.stats(500, true.mean, true.sd)) %>%
bind_rows()
To make plotting and comparison easier we will combine each of the individual data frames, representing the different sampling distributions for samples of a given size, into a single data frame.
<-
df.combined bind_rows(df.samples.of.30,
.50,
df.samples.of.100,
df.samples.of.250,
df.samples.of.500) %>%
df.samples.of# create a factor version of sample size to facilitate plotting
mutate(sample.sz = as.factor(sample.size))
We then plot each of the individual sampling distributions, faceting on sample size.
ggplot(df.combined, aes(x = sample.mean, y = ..density.., fill = sample.sz)) +
geom_histogram(bins=25, alpha=0.5) +
geom_vline(xintercept = true.mean, linetype = "dashed") +
facet_wrap(~ sample.sz, nrow = 1) +
scale_fill_brewer(palette="Set1") + # change color palette
labs(x = "Sample means", y = "Density",
title = "Distribution of mean heights for samples of varying size")
A.6.4 Discussion of trends for sampling distributions of different sample sizes
The key trend we see when comparing the sampling distributions of the mean for samples of different size is that as the sample size gets larger, the spread of the sampling distribution of the mean becomes narrower around the true mean. This means that as sample size increases, the uncertainty associated with our estimates of the mean decreases.
Let’s create a table, grouped by sample size, to help quantify this pattern:
<-
sampling.distn.mean.table %>%
df.combined group_by(sample.size) %>%
summarize(mean.of.means = mean(sample.mean),
sd.of.means = sd(sample.mean))
sampling.distn.mean.table## # A tibble: 5 × 3
## sample.size mean.of.means sd.of.means
## <dbl> <dbl> <dbl>
## 1 30 176. 2.74
## 2 50 176. 2.20
## 3 100 176. 1.49
## 4 250 176. 0.970
## 5 500 176. 0.671
A.7 Standard Error of the Mean
We see from the graph and table above that our estimates of the mean cluster more tightly about the true mean as our sample size increases. This is obvious when we compare the standard deviation of our mean estimates as a function of sample size.
The standard deviation of the sampling distribution of a statistic of interest is called the Standard Error of that statistic. Here, through simulation, we are approximating the Standard Error of the Mean.
A well known mathematical results that your already familiar with is that that the expected Standard Error of the Mean can be related to sample size and the standard deviation of the underlying population distribution as follows
\[ \mbox{Standard Error of Mean} = \frac{\sigma}{\sqrt{n}} \] where \(\sigma\) is the population standard deviation (i.e. the “true” standard deviation), and \(n\) is the sample size. This result for the standard error of the mean is true regardless of the form of the underlying population distribution.
Let’s compare that theoretical expectation to our simulated results:
<- sapply(seq(10,500,10),
se.mean.theory function(n){ true.sd/sqrt(n) })
<- data_frame(sample.size = seq(10,500,10),
df.se.mean.theory std.error = se.mean.theory)
ggplot(sampling.distn.mean.table, aes(x = sample.size, y = sd.of.means)) +
# plot standard errors of mean based on our simulations
geom_point() +
# plot standard errors of the mean based on theory
geom_line(aes(x = sample.size, y = std.error), data = df.se.mean.theory, color="red") +
labs(x = "Sample size", y = "Std Error of Mean",
title = "A comparison of theoretical (red line)\nand simulated (points) estimates of the \nstandard error of the mean for samples of different size")
We see that as sample sizes increase, the standard error of the mean decreases. This means that as our samples get larger, our uncertainty in our sample estimate of the mean (our best guess for the population mean) gets smaller.
A.8 Sampling Distribution of the Standard Deviation
Above we explored how the sampling distribution of the mean changes with sample size. We can similarly explore the sampling distribution of any other statistic, such as the standard deviation, or the median, or the the range, etc.
Recall that when we drew random samples we calculated the standard deviation of each of those samples in addition to the mean. We can look at the location and spread of the estimates of the standard deviation:
<-
sampling.distn.sd.table %>%
df.combined group_by(sample.size) %>%
summarize(mean.of.sds = mean(sample.sd),
sd.of.sds = sd(sample.sd))
sampling.distn.sd.table## # A tibble: 5 × 3
## sample.size mean.of.sds sd.of.sds
## <dbl> <dbl> <dbl>
## 1 30 15.1 2.02
## 2 50 15.1 1.51
## 3 100 15.2 1.08
## 4 250 15.2 0.678
## 5 500 15.2 0.476
As we did for the sampling distribution of the mean, we can visualize the sampling distribution of the standard deviation as shown below:
ggplot(df.combined, aes(x = sample.sd, y = ..density.., fill = sample.sz)) +
geom_histogram(bins=25, alpha=0.5) +
geom_vline(xintercept = true.sd, linetype = "dashed") +
facet_wrap(~ sample.sz, nrow = 1) +
scale_fill_brewer(palette="Set1") +
labs(x = "Sample standard deviations", y = "Density",
title = "Sampling distribution of standard deviation of height for samples of varying size")
The key trend we saw when examining the sampling distribution of the mean is also apparent for standard deviation – bigger samples lead to tighter sampling distributions and hence less uncertainty in the sample estimates of the standard deviation.
A.8.1 Standard error of standard deviations
For normally distributed data the expected Standard Error of the Standard Deviation (i.e. the standard deviation of standard deviations!) is approximately:
\[ \mbox{Standard Error of Standard Deviation} \approx \frac{\sigma}{\sqrt{2(n-1)}} \] where \(\sigma\) is the population standard deviation, and \(n\) is the sample size.
As before, let’s visually compare the theoretical expectation to our simulated estimates.
<- sapply(seq(10, 500, 5),
se.sd.theory function(n){ true.sd/sqrt(2*(n-1))})
<- data_frame(sample.size = seq(10,500,5),
df.se.sd.theory std.error = se.sd.theory)
ggplot(sampling.distn.sd.table, aes(x = sample.size, y = sd.of.sds)) +
# plot standard errors of mean based on our simulations
geom_point() +
# plot standard errors of the mean based on theory
geom_line(aes(x = sample.size, y = std.error), data = df.se.sd.theory, color="red") +
labs(x = "Sample size", y = "Std Error of Standard Deviation",
title = "A comparison of theoretical (red line) and\nsimulated (points) estimates of\nthe standard error of the standard deviation\n for samples of different size")
A.9 What happens to the sampling distribution of the mean and standard deviation when our sample size is small?
We would hope that, regardless of sample size, the sampling distributions of both the mean and standard deviation should be centered around the true population value, \(\mu\) and \(\sigma\) respectively. That seemed to be the case for the modest to large sample sizes we’ve looked at so far (30 to 500 observations). Does this also hold for small samples? Let’s use simulation to explore how well this is expectation is met for small samples.
As we’ve done before, we simulate the sampling distribution of the mean and standard deviation for samples of varying size. Because we’re dealing with small samples we’ll use larger simulations (5000 samples) so we get better estimates of their behavior (in general, the noisier the process you’re simulating the more simulations you should do)
# A new version of rnorm.stats that includes the sample estimate of the
# standard error of the mean
.2 <- function(n, mu, sigma) {
rnorm.stats<- rnorm(n, mu, sigma)
the.sample data_frame(sample.size = n,
sample.mean = mean(the.sample),
sample.sd = sd(the.sample),
sample.estimate.SE = sample.sd/sqrt(sample.size))
}
# sample sizes we'll consider
<- c(2, 3, 4, 5, 7, 10, 20, 30, 50)
ssizes
# number of samples to draw *for each sample size*
<- 2500
nsamples
# create a data frame with empty columns
<- data_frame(sample.size = double(),
df.combined.small sample.mean = double(),
sample.sd = double(),
sample.estimate.SE = double())
for (i in ssizes) {
<-
df.samples.of.size.i rerun(nsamples, rnorm.stats.2(i, true.mean, true.sd)) %>%
bind_rows()
<-
df.combined.small bind_rows(df.combined.small, df.samples.of.size.i)
}
%<>%
df.combined.small mutate(sample.sz = as.factor(sample.size))
A.9.1 For small samples, sample standard deviations systematically underestimate the population standard deviation
Let’s examine how the well centered the sampling distributions of the mean and standard deviation are around their true values, as a function of sample size.
First a table summarizing this information:
<-
by.sample.size %>%
df.combined.small group_by(sample.size) %>%
summarize(mean.of.means = mean(sample.mean),
mean.of.sds = mean(sample.sd))
by.sample.size## # A tibble: 9 × 3
## sample.size mean.of.means mean.of.sds
## <dbl> <dbl> <dbl>
## 1 2 176. 12.4
## 2 3 176. 13.5
## 3 4 176. 14.1
## 4 5 176. 14.1
## 5 7 176. 14.6
## 6 10 176. 14.8
## 7 20 176. 14.9
## 8 30 176. 15.1
## 9 50 176. 15.1
We see that the sampling distributions of means are well centered around the true mean for both small and medium, and there is no systematic bias one way or the other. By contrast the sampling distribution of standard deviations tends to underestimate the true standard deviation when the samples are small (less than 30 observations).
We can visualize this bias as shown here:
ggplot(by.sample.size, aes(x = sample.size, y = mean.of.sds)) +
geom_point(color = 'red') +
geom_line(color = 'red') +
geom_hline(yintercept = true.sd, color = 'black', linetype='dashed') +
labs(x = "Sample Size", y = "Mean of Sampling Distn of Std Dev")
The source of this bias is clear if we look at the sampling distribution of the standard deviation for samples of size 3, 5, and 30.
<-
filtered.df %>%
df.combined.small filter(sample.size %in% c(3, 5, 30))
ggplot(filtered.df, aes(x = sample.sd, y = ..density.., fill = sample.sz)) +
geom_histogram(bins=50, alpha=0.65) +
facet_wrap(~sample.size, nrow = 1) +
geom_vline(xintercept = true.sd, linetype = 'dashed') +
labs(x = "Std Deviations", y = "Density",
title = "Sampling distributions of the standard deviation\nas a function of sample size")
There’s very clear indication that the the sampling distribution of standard deviations is not centered around the true value for \(n=3\) and for \(n=5\), however with samples of size 30 the sampling distribution of the standard deviation appears fairly well centered around the true value of the underlying population.
A.9.2 Underestimates of the standard deviation for small \(n\) lead to understimates of the SE of the mean
When sample sizes are small, sample estimates of the standard deviation, \(s_x\), tend to underestimate the true standard deviation, \(\sigma\). Given this, it follows that sample estimates of the standard error of the mean, \(SE_{\overline{x}} = \frac{s_x}{\sqrt{n}}\), must tend to understimate the true standard error of the mean, \(SE_\mu = \frac{\sigma}{\sqrt{n}}\).
A.9.3 The t-distribution is the appropriate distribution for describing the sampling distribution of the mean when \(n\) is small
The problem associated with estimating the standard error of the mean for small sample sizes was recognized in the early 20th century by William Gosset, an employee at the Guinness Brewing Company. He published a paper, under the pseudonym “Student”, giving the appropriate distribution for describing the standard error of the mean as a function of the sample size \(n\). Gosset’s distribution is known as the “t-distribution” or “Student’s t-distribution”.
The t-distribution is specified by a single parameter, called degrees of freedom (\(df\)) where \({df} = n - 1\). As \(df\) increases, the t-distribution becomes more and more like the normal such that when \(n \geq 30\) it’s nearly indistinguishable from the standard normal distribution.
In the figures below we compare the t-distribution and the standard normal distribution for sample sizes \(n = {3, 5, 30}\).
<- seq(-6, 6, length.out = 200)
x <- c(3, 5, 30) # sample sizes
n
<- data_frame(sample.size = double(), z.or.t = double(),
distns.df norm.density = double(), t.density = double())
for (i in n) {
<- dnorm(x, mean = 0, sd = 1)
norm.density <- dt(x, df = i - 1)
t.density <- data_frame(sample.size = i, z.or.t = x,
df.temp norm.density = norm.density, t.density = t.density)
<- bind_rows(distns.df, df.temp)
distns.df
}
%<>% mutate(df = as.factor(sample.size - 1))
distns.df
ggplot(distns.df, aes(x = z.or.t, y = t.density, color = df)) +
geom_line() +
geom_line(aes(y = norm.density), color='black', linetype="dotted") +
labs(x = "z or t value", y = "Probablity density",
title = "Standard normal distribution (black dotted line)\nversus t-distributions for different degrees of freedom")
US Dept. of Health and Human Services; et al. (August 2016). “Anthropometric Reference Data for Children and Adults: United States, 2011–2014” (PDF). National Health Statistics Reports. 11. https://www.cdc.gov/nchs/data/series/sr_03/sr03_039.pdf↩︎