Logistic Regression

Logistic regression is an extension to linear regression. When there are only only 2 options for the predicted values, like failed (0) or success (1), linear regression model is not appropriate. Instead, we will predict a transformed version of the probability of success.

It is call the logit transformation where the logit function (or log odds) is used:

logit = ln( p / 1-p )

p: the probability of success

In general though, the higher the value of the logit function (goes to the +♾️), the higher the probability of success, and the lower the value of the logit function (goes to the -♾️), the lower the probability of success. We then could choose to reframe what a success is, say using the sign of the logit odds:

  • A success when log odds is positive ln(p / 1-p) > 0,
  • A failure when log odds is negative ln(p / 1-p) < 0

So using that log function, what we will do is we will fit a model using the logit of the probability of successful.

logit(y^) = b0 + b1 x

Note that we make assumption that this model is correct. We can check the model using the residual plots. If we had a large enough sample size, we might be able to identify discrepancies with those residual plots. If x takes a wide range of values, or if we have additional covariates, it will be more informative.

Multilevel Models

Dependent data are datasets where the observations are correlated with each other for one reason or another. These datasets are generated by study designs that introduce dependencies in the data:

  1. The first case is that several observations might be collected at one time point from different sampled clusters of analytic units, for instance from neighborhoods, or schools, or clinics. Observations nested within these clusters may be correlated with each other.
  2. Another case is where several observations are collected over time from the same individuals in a longitudinal study. The repeated measurements collected are likely to be correlated with each other.

The models that we fit to these types of datasets need to reflect the correlations.

Multilevel Linear Regression

Multilevel models are a general class of statistical models that can be used to model dependent data, where the observations that arise from a randomly sampled cluster maybe correlated with each other. In different fields, multilevel models are also known as:

  • Random coefficient models
  • Varying coefficient models
  • Subject-specific models
  • Hierarchical linear models
  • Mixed-effects models

Multilevel models are unique in that the regression coefficients are allowed to randomly vary across these randomly sampled clusters. So the regression coefficients no longer have to be fixed constants that we’re trying to estimate, instead we can estimate the amount of variability in these coefficients.

For instance, a longitudinal study where time might be a predictor of some outcome of interests. So we’re interested in studying trends in different outcomes over time. We could fit a model where the intercept and the slope in our model are allowed to randomly vary across the randomly sampled subjects that we’ve included in our study. So each subject in our study could have their own unique intercept and their own unique slope, instead of assuming that every subject follows the same general relationship or the same general pattern. A multilevel models allow us to estimate the variability among subjects or clusters in terms of these coefficients of interest.

What this means is that multilevel models allow us to expand the types of inferences that we can make from fitting models to the data:

  1. We can make inference about the relationships between predictor variables and outcomes, that doesn’t change.
  2. We can make inferences about how variable the coefficients are in the larger population from which these clusters were randomly sampled.
  3. We can try to explain that variability among these clusters, using cluster level predictor variables. So we can use some feature of those randomly sampled clusters to try to explain that variability in the coefficients.


Formally, random effects are defined as random variables that allow the coefficients in a regression model to randomly vary depending on the randomly selected cluster. In other words, the random effects of these higher-level randomly sampled clusters, are the reason why the coefficients randomly vary in the models.

(Level 1: observation level)
yij = β0j + β1j x1ij + eij

(Level 2: cluster level)
β0j = β0 + u0j
β1j = β1 + u1j

i: observation
j: cluster

Note that coefficients have a subscript denoted by j, which means that those regression coefficients randomly vary depending on the cluster j. So we refer these coefficients with subscript j now as random coefficients, these are not parameters.

β0 and β1 are the parameters that we are trying to estimate. u0j and u1j are still random variables, which shows how multilevel models add these random effects that allow each cluster j to have its own unique coefficient. A very common assumption that we make is that those random variables u0j and u1j follow a normal distribution with an average of zero, but then there’s variability in those u0j and u1j, and it’s that variance that we’re trying to estimate, i.e how variable are those coefficients around the overall coefficients β0 and β1.

By including these random effects, we’re saying that observations coming from the same cluster are correlated with each other statistically. This is how we model the correlation of these observations. If we didn’t include random effects in our model, we’re making the assumption that observations from the same cluster are independent of each other, and that correlation of observations within clusters is zero. Accounting for these correlations in our modeling, often substantially improves model fit when we work with dependent data.

All these conditions need to be true to warrant multilevel modeling:

  1. Data organized into cluster where several correlated observations are collected from each other.
  2. Clusters themselves are randomly sampled from larger population of clusters and wish to make inference about that larger population.
  3. Wish to explicitly model correlation of observations within same cluster.
  4. Have explicit research interest in estimating between-cluster variance in selected regression coefficients in our model.

Unexplained Variance

Multilevel models also allow us to decompose the unexplained variance in a given outcome into between-cluster variance and/or within-cluster variance, which isn’t accounted for by the predictors.

Between-cluster variancecaptured by the random effects uxj
Within-cluster variancecaptured by the random error terms eij

An important research question is: How much of unexplained variance, due to between-cluster variance, arises in the intercepts β0j or the slopes β1j for a given model? We need explicit research interest in estimating the variances of these random coefficients. If we’re not interested in estimating that variance, we can easily consider other models for dependent data.

The error term eij means what’s not predicted by the regression coefficients and the random effects.

We assume the common distributions for random effects uxj are normal with mean 0 and specified variances and co-variances, where D is defined as variance-covariance matrix of random effect.

We also assume that the error term eij within clusters follow a normal distribution with a mean of zero and a variance of sigma squared.

⎡ u0j ⎤ ~ N ⎡ ⎡0⎤, ⎡σ20, σ01⎤ ≡ D ⎤
⎣ u1j ⎦     ⎣ ⎣0⎦, ⎣σ01, σ21⎦     ⎦

eij ~ N(0, σ2)

Overall, in multilevel models we are estimating 6 parameters:

Fixed effectsβ0, β1
Random effectsσ20, σ21 (variances)
σ01 (covariance)
Error effectsσ2 (variance)

Estimating and Testing Parameters

We could use Maximum Likelihood Estimation (MLE) to estimate the model parameters. The basic idea of MLE is we want to know what values of the model parameters make the observed data the most likely.

We can also make inference about the model parameters using specialized tests for these kinds of models. For instance, compute confidence intervals or test null hypothesis that certain parameters are zero, using a technique called Likelihood Ratio Testing.

The idea of Likelihood Ratio Testing is “Does the probability or the likelihood of the observed data change substantially if we remove a parameter or parameters out of the model?” If removing certain parameters significantly reduce the likelihood of the observed data, we’re making the fit of the model worse.


Multilevel models also offer advantages over other approaches for dependent data:

When we fit these models, we estimate one parameter that represents the variance of a given random coefficient across the clusters, instead of estimating unique regression coefficients for every possible clusters. This is a much more efficient approach especially when we have a large number of clusters.

Clusters with small sample sizes do not have as pronounced of an effect on variance estimate as larger clusters; their effects shrink toward overall mean of outcome when using random effects.

We can further add cluster-level predictors Tj to those level 2 equations for random coefficients. We can use Tj to explain variability in those uxj, just like we would in any other linear regression model.

(Level 1: observation level)
yij = β0j + β1j x1ij + eij

(Level 2: cluster level)
β0j = β00 + β01 Tj + u0j
β1j = β10 + β11 Tj + u1j

i: observation
j: cluster

Multilevel Logistic Regression

Multilevel logistic regression models are focusing on multilevel models for binary dependent variables. Using single level equation incorporating the fixed effects and random effects, the multilevel model can be expressed like this:

logit[P(yij = 1)]
= ln[ P(yij = 1) / (1 - P(yij = 1)) ]
= β0j + β1j x1ij + eij
= β0 + u0j + β1 x1ij + u1j x1ij + eij

We make the same distributional assumptions that we did in multilevel linear regression models about these random effects: normally distributed, mean vector is zeros, and unique variances and co0-variances.

However the outcome is non-normal. When we fit these kinds of generalized linear regression models to non-normal outcomes and we include random effects, estimation becomes a lot more difficult mathematically. It’s a much more difficult computational problem, and it takes longer to estimate these models from non-normal dependent variables.

We try to fit multilevel models to non-normal outcomes, in many cases it’s difficult to write down the likelihood function and in some cases we may not even be able to write it down, because there may not be what’s called a closed form solution for that likelihood function.

What we have to do in practice in many cases is:

  1. First of all, approximate that likelihood function
  2. Then find parameter estimates that maximize approximate likelihood

One possible approach is called Adaptive Gaussian Quadrature, which is just an estimation method that involves approximating the likelihood, and then maximize net likelihood.

For testing parameters, we would use the same Likelihood Ratio Testing approach used in multilevel linear models, for multilevel logistic models.

Marginal Models

The key distinction in Marginal Models is that we do not include random effects because we’re not interested in estimating between-cluster variance in the coefficients as part of the analysis, but we still want to accurately model the within-cluster correlation.

Marginal Models refer to a general class of statistical models that are used to model dependent data where observations within a randomly sampled cluster may in fact be correlated with each other for one reason or another.

What we’re interested in is the estimation of overall, population-averaged relationships between independent variables of interest (IVs) and dependent variables of interest (DVs), across all of the different clusters. So, in marginal models we don’t allow the coefficients of our model to randomly vary across clusters. Our goal with fitting marginal models is to make inference about these overall marginal relationships, and make sure that the standard errors of our estimates reflect the cluster sampling inherent to the study design.

If we fail to account for the fact that observations are correlated within the same cluster, we run the risk of understating our standard errors. We need to make sure that our standard errors reflect the sampling variance of the regression coefficients and the correlations within these higher-level clusters. Marginal models can do that without the use of random effects.

What steps do we follow when fitting marginal models?

  1. Explicitly select a structure for the mean of the dependent variables, usually defined by regression coefficients and predictor variables.
  2. Select a structure that makes sense for the variances and covariances of observations that are coming from the same cluster based on our study design. The structure needs to reflect the variances and covariances of these observations that are not explained by the selected predictor variables, that we’ve used to define the mean of the dependent variable.
  3. Compare the fits of different marginal models with different choices for the variance-covariance structure and we choose which model has the best fit.

When can we fit marginal models?

  • We have data from a longitudinal or clustered study design that introduces dependencies in the collected data (being different from assuming i.i.d data), and we need to model those dependencies to obtain accurate inference about relationship of interest.
  • We have no interest in estimating between-subject or between-cluster variance in the coefficients of interest.
  • We wish to make inference about overall, marginal relationship between IVs and DVs in the target population. We do not wish to condition on random effects of subjects or clusters in the modeling.

Marginal models do offer some advantages over other approaches for dependent data:

  1. Quicker computational times, faster estimation.
  2. Robust standard errors that reflect the specified correlation structure.
  3. Easier accommodation of non-normal outcomes.

Marginal Linear Regression

When fitting marginal models to normal dependent variables, we estimate parameters defined by this model, in which there fixed effects β0 to βp and error term eti, but there is no random effects:

yti = β0 + β1 x1ti + ... + βp xpti + eti

The vector of observations of cluster i can be expressed as this, when there are n observations in total, assuming the vector of observations on yi follows a normal distribution with a mean μi = Xiβ, and the variance-covariance matrix Vi:

yi = (y1i, y2i, ..., yni)' ~ N(Xiβ, Vi)

Xi : predictor variables
β  : fixed effects
Vi : variance-covariance matrix

An important point here we don’t explicitly estimate the variances and covariances of random effects because there simply aren’t any in the model. The correlations of observations within the same cluster are being captured entirely by the correlations of those error terms, and those correlations are being captured in Vi matrix. So an important part of fitting these models is choosing a structure for that Vi matrix.

Generalized Estimating Equations

One of the methods for fitting marginal models is called Generalized Estimating Equations (GEE). When we fit models to dependent data using GEE, we are actually trying to find estimates of the parameters that define our model that solved the following score function (or estimating equation):

S(β) = Σni=1 DTi V-1i (yi - μi) = 0

Di : ∂μi/∂β, a n-by-p matrix
Vi : variance-covariance matrix, a n-by-n matrix
yi : vector of outcome measures collected for cluster i
μi : vector of expected means for outcome measures

The score function S is a function of the regression parameters β, we want to find the estimates of the parameters that solve this particular equation where the summation equal to zero. Di represents taking the derivative of that mean function μi with respect to the regression parameters β. The variance covariance matrix Vi is for those observations on cluster i, and this is something that’s user’s specified.

We solve the equation using something called Iteratively Weighted Least Squares or other algorithms such as the Fisher-scoring algorithm. These are all iterative techniques that attempt to identify the estimates of those β parameters that solved this particular score equation.

Working Correlation Matrix

We like to specify a variance-covariance Vi matrix, but we seldom know what it actually is. So, in practice, we attempt to define a working correlation matrix. This is like a plausible guess for what we believe that true correlation structure is for observations coming from cluster i.

We typically specify expected correlations of these observations and not covariances because for many non-normal outcomes, which are what GEE is often used for, the variances of the observations are actually defined by the mean structure of the model. Remember: good specification of mean structure is very important.

So, in practice, we’re more focused on choosing a correlation structure, and not worrying as much about the variances of the dependent variable, that comes with the definition of the mean for many different types of models. This general approach allows this methodology to be easily adapted to non-normal dependent variables (Poisson distributions, Binomial distributions, etc).

Choices for working correlation matrix include:

IndependenceZero correlation, independent observations
ExchangeableConstant correlation of observations in the same cluster
AR(1)First-order auto-regressive, decaying correlation over time
UnstructuredCompletely general correlation of observations


Variances of parameter estimates are computed using Sandwich Estimator, which is based on specified working correlation matrix and variance-covariance matrix of observations based on expected means from fitted model.

Estimators of fixed effect parameters β are consistent even if specified working correlation matrix is incorrect. Bad choices of that working correlation matrix effect the standard errors, not the estimates of the fixed effect parameters. With pretty large data sets, alternative choices of working correlation matrices aren’t going to make a huge difference for our inference.

We can use Adapted Information Criterion (QIC) to compare fits of competing models. We would try to identify the model that has the smallest value of QIC, which would suggest that that’s the correlation structure that represents the best fit.

When fitting marginal models, we can no longer make inference about between cluster-variance, we were just controlling for that nuisance correlation within clusters when making inferences about our fixed effects of interest.

Marginal Logistic Regression

Marginal logistic regression models are the models for binary dependent variables where observations on that binary dependent variables are correlated within a cluster that was introduced by the study design. GEE methods were specifically designed to readily accommodate non-normal outcome variables that were measured longitudinally.

The mean structure in the estimating equation is defined by a given type of generalized linear models. Recall in marginal linear regression, the mean was defined by that linear combination of the predictor variables and the regression coefficients, i.e. Xiβ.

However in the case of logistic regression for binary dependent variables, the mean of the dependent variables is the probability that the dependent variable equals to 1, which is also the expectation value of dependent variable yi conditional on the predictor variables Xi. In the specific case of logistic regression where we’re using that logit function to fit the actual model, we write this probability that the dependent variable is equal to one as the inverse logit function:

μi = πi = E(yi | Xi) = exp(Xiβ) / [1+exp(Xiβ)]

μi : the mean of the DV
πi : the probability that DV equals to 1

this is where the β parameters enter into the estimating equation. Then we try to solve for the values of those parameters that solve the score function S(β). The mean and the variance of the binary dependent variable are both defined by the specified model. With GEE, our job is only to choose a working correlation structure because we’ve already specified the mean and the variance of the dependent variable with the model that we’ve indicated.

Survey Weights

When we talked about the notion of a probability sample, survey weights may be available in data sets that are collected from complex probability samples, and that these weights account for, at the very least, unequal probabilities of selection into the sample for different cases. Cases with a lower probability of being selected into the sample will get a higher weight in the overall analysis.

In theory, these weights are designed to enable unbiased estimation of selected parameters for a finite target population. We use these weights to make inference about that finite target population, and the weights enable unbiased estimation, meaning that, on average, our estimates will be equal to the true finite population parameters of interest.

When fitting a regression model to a dependent variable collected from a probability sample, we know that there are weights in that data set which represent, at the very least, different probabilities of selection. Now the question is “do we need to account for those weights when actually fitting the regression model and making inference?”

Quick answer is “Yes”. Using survey weights to fit regression models ensures that the estimated regression parameters will be unbiased with respect to the sample design. However, those survey weights do not protect from poor model specification. So in practice, we want both:

  • Unbiased estimates of the regression parameters, and
  • A correct model for the finite population of inference

The drawback is that using the weights in estimation, even though you’ve done a good job of specifying the model describing this relationship, you can unnecessarily inflate the standard errors of your regression parameter estimates, and that increases the sampling variance of the estimates.

If survey weights are available for a probability sample, and you wish to fit a regression model, the best practice is:

  1. Do the best you can do to specify the model correctly.
  2. Fit the model with and without using the survey weights.
    • If estimated coefficients remain similar, but weighted estimates have larger standard error, model has likely been specified correctly → weights are unnecessary.
    • If estimated coefficients change substantially, the model may have been misspecified. Weighted estimates should be reported to ensure they are at least unbiased with respect to sample design.

Bayesian Approaches

Suppose we have an original belief, which is some kind of a distribution, say a normal distribution, which is called a prior. Then when we see more and more data, we may change our belief, in other words, we may change the distribution that is in our mind.

This process that we went through is called Bayesian Updating. Finally when we finished processing all the data, this process will provides a distribution, called a posterior, which allows us to update our beliefs and answer questions about the quantity of interest.

Belief about             Collect data
 the world      - - →      & model
     ↑                        |
     ⎣        Bayesian        ↲

The steps of doing a Bayesian Update:

  1. Belief: establish a belief about the world (includes a prior and likelihood functions)
  2. Model: use data and probability, in accordance with your belief, to update your model
  3. Update: update your belief using data and model, and get a posterior
  4. Repeat step 2 and 3 using posterior from step 3 as a new prior for step 2

OK, why would someone want to do this? The first thing is all questions about our beliefs of our quantity of interest can be found via the posterior. We can also combine this with loss function to get a better model for optimal decision-making.

Bayesian updating is flipping probability on its head:

FrequentistsA long-run limiting frequencyFixed, unknowable constantsRandom
BayesiansA “degree of belief”Random variables (distributions)Fixed

However we have to rewrite our entire definition of probability to make this work. In mathematically, the process can be quite difficult and even be intractable, sometimes.

In a Bayesian setting, we need priors on each of the parameters. Two different people may use two different priors and conclude two different things! It is totally OK if model seems subjective. Often, it’s a great idea to ask experts to incorporate prior data, and this is where Bayesian analysis really shines.

My Certificate

For more on Multilevel Models vs Marginal Models, please refer to the wonderful course here https://www.coursera.org/learn/fitting-statistical-models-data-python

Related Quick Recap

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!