We use data to do statistical inference means we either estimate some parameters with confidence, or test some theories or hypothesis about those parameters.
Significance Level
Consider an example, suppose we have 2 hypothesis: the Null and the Alternative. Also we have come up with a decision rule: after taking an observation of one sample data, if the value of the sample data is greater than a certain value (e.g. the value 5), we reject the Null hypothesis (e.g when the value is β₯ 5); otherwise we accept the Null hypothesis (e.g. when the value is β€ 4):
The Null Hypothesis
X
X X
X X
X X
X X
X X X X
X X X X X X X
--------------------------------
-100 1 2 3 4 5 6
β Accept Null hypothesis
The Alternative Hypothesis
X
X X
X X
X X
X X
X X X X
X X X X X X X
--------------------------------
1 2 3 4 5 6 100
β Reject Null hypothesis
However there are chance of making mistakes:
Type 1 Error | We rejected the Null hypothesis, but the Null is true. The probability is denoted by Ξ± , also called significance level. |
Type 2 Error | We accepted the Null hypothesis, but the Alternative is true. The probability is denoted by Ξ² |
Picking the cut-off value (in this example, it is 5) is effectively picking our significance level right up-front. The data sample that we get from our data have to be so unusual, that they would occur Ξ±
(percent) of the time or less, and then we would reject the Null hypothesis.
Once you’ve made a decision though, there either is a correct decision that was just made or an error. No more probability about those mistakes being made, it’s either right or wrong.
P-Value Approach
We could instead pick up the Ξ±
that we would like to have right up-front, in other words that probability of Type 1 Error being set, and then we look at our data and calculate a probability, the p-value, from it. According to the p-value approach, we will be looking at the data sample we got, we’d be calculating how likely is it to get that data sample or something more extreme if the Null hypothesis were really true.
- Set your significance level
Ξ±
up front - Take your data and calculate from that data the corresponding probability called the p-value.
- If that p-value is small, your results are not compatible with the Null hypothesis, you’re going to reject the Null hypothesis.
- If your p-value is somewhat large, larger than what you required, you’re not going to reject the Null hypothesis.
So, a universal decision rule that could be used is to reject the Null hypothesis, if p-value β€ Ξ±
.
Frequentist vs Bayesian
In the field of statistics, there are two primary frameworks: frequentist and Bayesian. The frequentist approach assigns probabilities to data, not to hypotheses, whereas the Bayesian approach assigns probabilities to hypotheses.
In frequentist statistics, probabilities are made of the world. Before the event, you can make calculations and updates of probabilities about the event before it happens, however, once the event has happened, updates aren’t allowed anymore. Because the event has already happened, your guess is either correct or incorrect. You either have a 0% chance of getting it right or a 100% chance of getting it right.
In Bayesian statistics, probabilities are made in your mind. You can use updated information, in other words, incorporate prior knowledge into the analysis, to update the probability hypothesis as more data become available.
Confidence Interval
A confidence interval is a range of reasonable values for a parameter. We looked at a confidence interval as being a best estimate (unbiased point estimate) plus or minus margin of error (a few estimated standard errors). The word “a few” means a multiplier z*
from appropriate distribution based on desired confidence level and sample design.
Confidence interval = Best estimate Β± Margin of error
In order to get a reliable best estimate, we need a simple random sample. A simple random sample is a representative subset of the population made up of observations (or subjects) that have an equal probability of being chosen. We also need to make sure we have a large enough sample size. If we have a large enough sample size, then we can approximate our sampling distribution with a normal curve.
Margin of error only depends 2 things: the confidence level z*
chosen, and sample size n
. Having a 95% confidence level, the z*
multiplier would be 1.96, which leads to a 5% significance level in a hypothesis test. If you want to be more confident, the multiplier we need to use then has to be adjusted.
Confidence interval | 90% | 95% | 98% | 99% |
z* | 1.645 | 1.96 | 2.326 | 2.576 |
Estimate a Population Proportion
In the case of one population proportion, the confidence internal is:
p^ Β± z* Γ β( p^ (1-p^) / n )
where:
p^ : our best estimate = x / n
x : observations of interest
We say that this interval is a confidence interval for a population proportion, even though we use our sample proportion to construct this interval but we don’t refer to this interval as being a confidence interval for a sample proportion.
It is incorrect to interpret that confidence level as a chance or probability. For instance, Don’t think that there’s a 95% chance the population proportion is in our interval that we just made. The confidence level refers instead then to the confidence we have in the process we use to make our interval. Not as a probability after our interval was made but how confident we are about the statistical procedure that was used.
So, this 95% confidence level is what we would expect in the long run. Regarding that statistical procedure, about 95% of the intervals that are made with this method we used, are expected to contain the true value.
Estimate a Difference in Population Proportions
When we want to calculate the difference between two population proportions, the confidence interval is:
(p1^ - p2^) Β± z* Γ β( p1^ (1-p1^) / n1 + p2^ (1-p2^) / n2 )
where:
p1^ : our best estimate for population 1 = x1 / n1
p2^ : the best estimate for population 2 = x2 / n2
x1, x2 : observations of interest
However this is based on the assumptions:
- We have two independent random samples
- We need large enough sample sizes to assume that the distribution of our estimate is normal. (In other words, we need at least 10 yes’s and at least 10 no’s for each sample).
Estimate a Population Mean
We have our original population that we took our data from, if the model for population of response is approximate bell-shaped normal (or sample size is “large” enough), then the distribution of our sample means coming out will be bell-shaped normal too.
Even though the sample means from one study to the next are going to vary a bit, they vary right around and on average are equaling ΞΌ
. The standard deviation of a sampling distribution of a statistic is called a standard error of that statistic: Ο / βn
, where Ο
is the true standard deviation for that population, and n
is the sample size. So, the variability or accuracy of our sample mean depends on two features:
- how much variation was in the original population that we took our data from, and
- how much data do we take from that population
If we were able to have a larger sample size in our repeated study, that standard error will be smaller, giving us that our sample means would vary less from that true mean ΞΌ
. However can never calculate it. That Ο
on the top is the true standard deviation for our population, we don’t know what that is, unless we can look at the entire population of measurements.
So, we are going to have to come up with an estimate of this measure of variability. We can measure the variability in that sample with our sample standard deviation s
, so our estimated standard error of the sample mean is s / βn
.
The confidence interval for a population mean can be calculated as below. The best estimate will be the sample mean x-
, and the margin of error will be “a few” multiplier t*
multiplied with the estimated standard error of sample mean .
x- Β± t* Γ (s / βn)
where:
x- : sample mean
s / βn : estimated standard error
Unlike the requirement for a large enough sample size in proportions, it is not required that you have to have a large sample size in order to conduct a confidence interval for a population mean. The multiplier t*
comes from a different distribution called T-distribution with n – 1 degrees of freedom. The value of t*
is determined by the sample size n
.
The underlying assumptions that are needed for the confidence interval for population mean to be reasonable:
- The data needs to be representative of that larger population, and be considered a random sample.
- The model for that population responses is bell-shaped or normal.
Estimate Difference in Population Means for Paired Data
When we have paired data, we want to treat the two sets of values simultaneously. We can’t look at them separately because they’re matched in some way. You might have:
- two measurements collected on the same individual
- two measurements collected on matched individuals (say, twins)
We’ll look at the difference of measurements within pairs. The confidence interval is calculated as below
xd- Β± t* Γ (sd / βn)
where:
xd- : sample mean
sd / βn : estimated standard error
This formula here looks familiar, it is the same general formula as the formula for the confidence interval for a single mean. We have that subscript d
both for the sample mean and for the sample standard deviation. So we are using the differences to calculate the sample mean and the standard deviation.
We do need to assume that we have a random sample. We also need to assume that the population of differences is normal. However a large sample can help us bypass this assumption, the assumption of normality isn’t quite as important with the large sample size.
Estimate Difference in Population Means for Independent Groups
If models for both populations of responses are approximately normal (or sample sizes are both large enough), distribution of the difference in sample means ΞΌ1 - ΞΌ2
is approximately normal.
The standard error β(Ο12 / n1 + Ο22 / n2)
is a value that on average tells us how far our sample statistic is going to fall from our true parameter. Unfortunately, we don’t know the true values for those population variances as denoted by Ο2
. So the best that we can do is we’re going to estimate standard error with our sample variances s2
.
Estimated standard error = β(s12 / n1 + s22 / n2)
When calculating the confidence intervals, there are actually 2 approaches:
Pooled apporach | If the variances of both of our populations are close enough, we’re going to assume them to be equal.Ο12 = Ο22 We don’t know these true variances, but looking at sample variances and the spread of each sample, we can see if this assumption will hold. |
Unpooled approach | The assumption of equal variances is dropped. |
Using the unpooled approach to calculate the confidence interval:
(x1- - x2-) Β± t* Γ β(s12 / n1 + s22 / n2)
where:
x- : sample mean
s / βn : estimated standard error
t* : multiplier comes from a T-distribution
Using the pooled approach to calculate the confidence interval, the estimated standard error is going to change:
(x1- - x2-) Β± t* Γ β( [(n1-1)s12 + (n2-1)s22] / [n1+n2-2] ) Γ β(1/n1 + 1/n2)
where:
x- : sample mean
s / βn : estimated standard error
t* : multiplier comes from a T-distribution with n1+n2-2 degrees of freedom
There’s a couple of assumptions we need to check:
- we have a simple random sample for both of our populations. We need these two samples to be independent from one another.
- models for both populations of responses are approximately normal (or sample sizes are both large enough).
- if we have enough evidence to assume equal variances between the two populations, we can use the pooled approach.
My Certificate
For more on Confidence Intervals, please refer to the wonderful course here https://www.coursera.org/learn/inferential-statistical-analysis-python/
Related Quick Recap
I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai