Probability, Sampling & Inference

In the 1930s, Jerzy Nayman and others made some very important breakthroughs in this area and his work enabled researchers to use random sampling as a technique to measure populations. It means we did not have to measure every single unit in the population to make statements about the population (inference).

The first important step is to define the target population of interest in concrete terms:

Who are we measuring?
What time frame are we interested in?
Where is the population located?

How can we make inferential statements?

Conduct a census	measure every single person in that concretely defined population. Easier for smaller target populations. Incredibly expensive for larger populations.
Select a scientific probability sample	measure all units in the sample from the population. a, Construct list of all units in population. -> sampling frame b, Determine probability of selection for every unit on the list. c, Select units from list at random with sampling rates for different subgroups are determined by probability of selection. d, Attempt to measure randomly selected units.
Select a non-probability sample	measure all units in the sample from the population. Generally does not involve random selection of individuals according to probability of selection. The probabilities of selection can not be determined for the population units. This makes it more difficult to make representative inferential statements. – opt-in web surveys – quota sampling – snowball sampling – convenience sampling Main problems. There is no statistical basis for making inference about the target population. Very high potential for bias.

Probability Sampling

With probability sampling, the known probabilities of selection for all units in the population, allows us to make unbiased statements about both population features (the stuff we try to estimate) and uncertainty in survey estimates (making statements about how uncertain we are with the estimate).

The random selection of population units from this predefined sampling frame, protects us against bias from the sample selection mechanism. It allows us to make population inferences based on sampling distribution.

With very careful sample design, probability samples yield representative, realistic, random samples from larger populations; such samples have important statistical properties.

Simple Random Sampling (SRS)

With SRS we start with a known list (sampling frame) of N population units, and randomly select n units from the list. N is the size of population, n is the size of sample. Every unit has equal probability of selection n / N (sampling fraction, which is the fraction of the population that ultimately gets selected to be in the sample).

All possible samples of size n are equally likely. Estimates of means, proportions, and totals based on SRS are unbiased, which means these estimates are equal to the population values on average. SRS can be with replacement or without replacement, for both it turns out that the probability of selection is still n / N .

With replacement	When we select some body from a larger list, we replace them in that list. Give them a chance of being selected again in the sample.
Without replacement	More common is that SRS is done without replacement. Once an unit is sampled, they can not be sampled again.

SRS is rarely used in practice. It can be prohibitively expensive. It is generally used when populations are smaller, and it is easier and less expensive to ultimately collect the data.

SRS will generate i.i.d. data for a given variable in theory. All randomly sampled units will yield observations that are independent (not correlated with each other), and identically distributed (representative, in theory).

Complex Sampling

With larger populations, complex samples often selected, where each sampled unit has known probability of selection. We use very specific features of probability sample design that allow us to save on costs and make our samples more efficient.

Population is divided into different strata, and part of sample is allocated to each stratum; it ensures sample representation from each stratum and reduces the variance of survey estimates (stratification). Stratification allows us to ensure that we are allocating some of our sample to all these different divisions of the population.
Within strata, clusters of population units are randomly sampled first (with known probability).
Within cluster, units are randomly sampled according to some probability of selection, and measured.

So a unit’s probability of selection is determined by several things:

a: number of clusters sampled from each stratum
A: total number of cluster in population in each stratum
b: number of units ultimately sampled from each cluster
B: total number of units in population in each cluster

Probability of Selection = (a / A) (b / B)

At that second stage (sampling in clusters), it is OK to have oversampling, which means different people will have different probabilities of being selected. At all stages, we know what the probability of selection are, we maintain those probabilities of selection throughout the entire design.

The inverse of a person’s probability of selection is then their sampling weight, which is used in the actual data analysis to compute representative population estimates.

Why Probability Sampling?

Every unit in the population has a known non-zero probability of selection and subsequent random sampling ensures all units will have a chance of being sampled, according to probability sampling design.

We can use those probabilities of selection to compute unbiased estimates using sampling weights. We can also estimate the sampling distribution, if we had repeated that complex sampling over and over. We can actually simulate what that sampling distribution would look like based on selecting only one sample.

Most importantly, probability sampling provides a statistical basis for making inferences about certain quantities in larger populations.

Non-Probability Sampling

Probability of selection can not be determined “a priori” for sampled units. So, for the units, we can not compute the probability of being selected into a sample. We don’t control the random selection mechanism, so there is neither random selection of individual units, nor random selection of clusters in early stage. But good news is non-probability sampling is much cheaper relative to probability sampling.

There is really no statistical basis for making inference about larger population from which sample selected. Sampled units are not selected at random, so there is a strong risk of sampling bias.

“Big data” really arise out of convenience, often comes from non-probability samples! Be careful! You can still make decent inferences by two possible approaches,

Pseudo-Randomization: with a little work, we treat non-probability samples like a probability sample.
Calibration: weight your non-probability sample to look more like the population that you are interested.

Pseudo-Randomization Approach

In this approach:

We combine non-probability sample with a probability sample. The key thing here is both samples need to collect similar measurements. We’d literally append them to each other.
Estimate probability of being included in non-probability sample as a function of auxiliary information available in both samples. We use all that auxiliary information to determine for each individual (in the non-probability sample) their probability of being included in the non-probability sample, when considering both datasets simultaneously.
Treat estimated probabilities of being selected into the non-probability sample as being “known” for the non-probability sample. Use probability sampling method for analysis.

We could use logistic regression to estimate the probability of the units being in that non-probability sample as a function of all other variables, Once we have those probabilities of being in the non-probability sample, we can then think “What would have been the probability of being in this combined data set for an individual who showed up in the non-probability sample given all of their features?” Then we pretend that the estimated probability of selection is known, and we use methods for probability samples combining these two data sources together.

Calibration Approach

Compute weights for responding units in non-probability sample that allow weighted sampled to mirror a known population. This is especially important if this weighed characteristic is correlated with the variable that we are interested in. If our sample looks more like the population in terms of a characteristic that has a strong correlation with our variables of interest, we’ll get closer and closer to making unbiased statements about the population.

If the weighting factor (that we are using to develop weights, or make our non-probability sample look more like the population) is not actually related to the variable of interest, we are not going to reduce some of the sampling bias (that might come from the non-probability sampling).

Sampling distribution

Recall we usually talk about distribution of values on a variable of interest. And we assume generally the values on a variable will follow a certain distribution, if we could measure entire population.

But when we select probability samples to make inferential statement about larger population, we make those statements based on something called sampling distribution. Sampling distribution is a distribution of survey estimates, if we selected many random samples, using the same probability sampling design, and we computed an estimate from each of these probability samples.

This is a hypothetical idea, it only describes what we would see if we had the luxury of drawing thousands of probability samples. But actually we are not going to do that in practice, we are trying to estimate the features of that sampling distribution based on one sample only!

Generally it has a very different appearance from the distribution of value on a variable. With a large enough probability sample size, the sampling distribution would look like a normal distribution, regardless what estimates we are computing – Central Limit Theorem. Sampling variance is the variability in the estimates described by sampling distribution.

Because we selected a sample, and we are not measuring everyone in a population, a survey estimate based on a single sample is not going to be exactly equal to the population quantity of interest. This is sampling error. We just want estimates across these hypothetical repeated samples to be equal to that population quantity on average. So if we average all the estimates we computed from these hypothetical repeated samples, any one of them is not going to be exactly equal to the population quantity, but the average will be equal to that population quantity.

Across hypothetical repeated samples, their sampling errors will randomly vary. Variability of these sampling errors describes the variance of sampling distribution. The bigger your sample size become, the less sampling error we are going to have, less variable our estimates are going to be.

Larger samples -> Less sampling variance -> More precise estimates -> More confidence

Fewer clusters with more units in each cluster -> More sampling variance

So, a simple random sampling (SRS) of a small sample size can have an almost identical distribution as that of a cluster complex sampling with a large sample size where there is large number units in each cluster. This is another key trade-off between “the probability sampling design” and “how variable the estimates are”.

But when we collecting data in real world, we usually have only resource to collect 1 sample, and then measure that sample and compute the estimate. Jerzy Nayman’s important sampling theory allows us to estimate features of sampling distribution based on 1 sample.

The magic of probability sampling is: one probability sample and features of the probability sample design (probability of select, stratification, cluster, etc) are all we need to know about expected sampling distribution (mainly the variance and the mean).

An interesting result is when you are given a large enough sample from some population, the sampling distribution of most statistics of interest will tend to normality, regardless of how the input variables are actually distributed themselves. This bell-shaped curve is based on the Central Limit Theorem and also drives the design-based statistical inference (also called frequentist inference). The idea we can look at the sampling distribution, assuming it is normal, make inference about the value of a give statistic in a larger population.

Pearson correlation

Pearson correlation describes the linear association between 2 continuous variables. We could apply this to draw sampling distribution of correlations. Regardless of the sample size, all of distributions are approximately normal and are all centered at the true correlation. But as the sample size increases, the distribution becomes more symmetric and less spread.

Regression

Regression allows us to assess the relations between multiple variables. In the case of 2 variables, we focus on the estimated slope (change in some dependent variable y for a one unit change in some predicative variable x) that describe a linear relationship between 2 variables. We could also apply this to draw sampling distribution of regression.

Regardless of the sample size, al of distributions are remarkably normal and fairly symmetric. As we increase the sample size, the spread shrinks.

Non-normal sampling distribution

Not all statistics has normal sampling distribution. In these cases, more specialized procedures are needed to make population inference, e.g.: Bayesian methodology.

Inference for Probability Samples

There are 2 general approaches to making population inference based on estimated features of sampling distribution:

Estimate a confidence interval for the parameters of interest in the population.
Test hypothesis about parameters of interest in the population.

Both of these approaches are valid if probability sampling was used.

A parameter of interest really is the ultimate product of our data analysis, it could be a mean, a proportion, a regression coefficient, an odds ratio, etc. Both of the approaches assume the sampling distributions for estimates are approximately normal. So our inference will be driven by what the sampling distribution looks like.

Step 1: compute an unbiased point estimate of the parameter of interest.
Unbiased point estimate means the average (or expected value) of all possible values for point estimate (again, across hypothetical repeated samples) is equal to the true parameter value. The sampling distribution is centered at the true population pamater that we are interested in. It is so important to design samples that will give us unbiased estimates.

Step 2: compute sampling variance associated with that point estimate.
Unbiased variance estimate correctly describe the variance of the sampling distribution under the sample design used. The standard error is really describing the standard deviation of the sampling distribution.

Standard error of point estimate = square root of variance

Confidence Interval

“xx% confidence level” is expecting xx% of confidence intervals (contracted in steps blew) will cover true population value. To form a confidence interval:

Confidence interval = Best estimate ± Margin of error

Best estimate = Unbiased point estimate
Margin of error = "A few" estimated standard error
"A few" = multiplier from appropriate distribution, based on desired confidence level and sample design

Hypothesis Testing

The hypothesis could read: “Could the value of the parameter be hypothesized or null value?” We really answer the question: “Is point estimate for the parameter close to the null value or far away?” We use the standard error of point estimate as yardstick:

Test statistic = (Unbiased point estimate - Null value) / Standard error

If the null is true, what is the probability of seeing this test statistic this extreme or even more extreme? If the probability is small, we could reject the null.

Inference for Non-Probability Samples

Inference approaches generally rely on modeling, so fitting statistical models and making conclusion about estimates without relying on probability sampling framework and knowing what that sampling distribution is expected to look like. We also might combine data with other probability samples and making inference.

The essential problem is non-probability samples do not let us rely on sampling theory for making population inferences based on expected sampling distribution. There are 2 approaches to solve this problem:

Quasi-randomization / Pseudo-randomization
Population Modeling

For any of these estimation techniques for non-probability samples, we need to have common variables in the two data sets. Simple estimation of the mean based on the non-probability sample may lead to a biased estimate, and we can’t estimate sampling variance from the non-probability sample.

Quasi-randomization / Pseudo-randomization

Combine data from non-probability sample with data from probability sample that collected same types of measures. We stack the 2 data sets together, then we fit logistic regression model, trying to predict the probability of being in non-probability sample with common variables. In fitting that model, we are going to weight the non-probability case by 1, and the probability case by their survey weights (that reflect their probability of selection).

Once we have these predicted probabilities based on that logistic regression model of being in the non-probability sample, we inverse them, and treat them as survey weights used in standard weighted survey analysis.

Survey weight = 1 / Predicted probability

The issue in this approach is estimating sampling variance, there is no really an entirely clear method of how to estimate a sampling distribution using these new kind of pseudo weights.

Population Modeling

Big idea is we use predictive modeling to predict aggregate sample quantities (usually totals) on key variables of interest for population units not included in the non-probability sample. We fit a regression model, and we use that model derived from non-probability sample to make prediction of what all the other people in the population might have said, had they been a part of the non-probability sample. Then we can compute the estimate of interest using estimated totals.

My Certificate

For more on Probability, Sampling & Inference, please refer to the wonderful course here https://www.coursera.org/learn/understanding-visualization-data

My #77 course certificate from Coursera

Visualizing Statistical Data

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai