Generative models are a kind of statistical model that aims to learn the underlying data distribution itself. If a generative model is able to capture the underlying distribution of the data well, then it’s able to produce new instances that could plausibly have come from the same dataset. You could use for anomaly detection, telling you whether a given instance is likely.

The data distribution can be very complex and difficult to model. One approach to this problem is to take an initial, simple density and transform it possibly using a series of parameterized transformations that produce a rich and complex distribution. If these transformations are smooth and invertible, then we are able to evaluate the density of the complex transform distribution. This property is important because it then allows us to train such a model using maximum likelihood. This is the idea behind normalizing flows.

Change of Variables Formula

Normalizing flows are a class of models that exploit the change of variables formula to estimate an unknown target data density. Suppose we have a dataset with n samples D := {x(1), x(2), …, x(n)}, with each x(i) ∈ Rd, and assume that these samples are generated i.i.d. from the underlying distribution pX.

A normalizing flow models the distribution pX using a random variable Z (also of dimension d) with a simple distribution pZ, such that the random variable X can be written as a change of variables, i.e. X = fθ(Z), where Î¸ is a parameter vector that parameterizes the smooth invertible function fθ.

The function fθ is modeled using a neural network with parameters θ, which we want to learn from the data. An important point is that this neural network must be designed to be invertible, which is not the case in general with deep learning models.  In practice, we often construct the neural network by composing multiple simpler blocks together. In TensorFlow Probability, these simpler blocks are the bijectors.

In order to learn the optimal parameters Î¸, we apply the principle of maximum likelihood and search for θML such that θML = argmaxθ P(D; θ) = argmaxθ log P(D; θ). In order to calculate P(D; θ), we could use the change of variables formula.

pX(x) = pZ(z) ∙ |det Jfθ(z)|-1
⟹ pX(x) = pZ(z) ∙ |det Jfθ-1(x)|
⟹ p(D; θ) = ∏x∈D pZ(fθ-1(x)) ∙ |det Jfθ-1(x)|
⟹ log p(D; θ) = ∑x∈D log pZ(fθ-1(x)) ∙ log |det Jfθ-1(x)|

The term pZ(fθ-1(x)) can be computed for a given data point x ∈ D, since the neural network fθ is designed to be invertible and the distribution pZ is known. The term det Jfθ-1(x) is also computable, although the determinant of the Jacobian J should be efficiently computed.

The Bijector Objects

The Bijector objects from Tensorflow Probability library forms the basis for normalizing flow models. Bijectors are used to transform tensor objects. Bijector objects have methods to apply the forward as well as the inverse transformation.

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors

# one-dimensional tensor of length 3
z = tf.constant([1., 2., 3.])

# create the bijector object
scale = tfb.Scale(2.)

# apply the forward transformation
x = scale.forward(z)
# tf.Tensor([2. 4. 6.], shape=(3,), dtype=float32)

# apply the inverse transformation
scale.inverse(tf.constant([5., 3., 1.]))
# tf.Tensor([2.5 1.5 0.5], shape=(3,), dtype=float32)

Now, a simple bijective transformation could be built by adding a shift as well as a scale operation. Any chain of smooth and invertible transformations will again be smooth and invertible.

scale = tfb.Scale(2.) # scale bijector
shift = tfb.Shift(1.) # shift bijector

# bijective transformation
# equivalent to scale_and_shift = shift(scale)
scale_and_shift = tfb.Chain([shift, scale]) # NOTE the reverse order

# apply the forward transformation
# equivalent to scale_and_shift(z)
# tf.Tensor([3. 5. 7.], shape=(3,), dtype=float32)

# apply the inverse transformation
scale_and_shift.inverse(tf.constant([2., 5., 8.]))
# tf.Tensor([0.5 2. 3.5], shape=(3,), dtype=float32)

Bijectors can also be used to transform random variables, and compute log probabilities of events under the transform distribution.

normal = tfd.Normal(loc=0., scale=1.)

z = normal.sample(3)
# tf.Tensor([-0.32 1.40 0.42], shape=(3,), dtype=float32)

x = scale_and_shift.forward(z)
# tf.Tensor([0.35 3.81 1.84], shape=(3,), dtype=float32)

We’d like to be able to evaluate the density of this transform distribution at x, that’s what the change of variables formula tells us how to do.

log_prob_x = normal.log_prob(z)
           - scale_and_shift.forward_log_det_jacobian(z, event_ndims=0)

The event_ndims argument specifies the number of events space dimensions present in the input tensor z, which has shaped semantics given by sample_shape, batch_shape, and event_shape. The computation of the log Jacobian determinant should be reduced over the event dimensions. In the example above, the sample_shape is one-dimensional, but batch_shape and event_shape are both empty.

However, we can also invert the change of variables formula to express it in terms of the inverse of the bijective transformation. We can write that the log probability of x is equal to the log probability of z plus the log of the Jacobian determinant of the inverse transformation evaluated at x. Of course, we can just replaces z with the result of the bijectors inverse method applied to x.

log_prob_x = normal.log_prob(z)
           + scale_and_shift.inverse_log_det_jacobian(x, event_ndims=0)
= normal.log_prob(scale_shift.inverse(x))
+ scale_and_shift.inverse_log_det_jacobian(x, event_ndims=0)

In the second line here, we’re computing the same log probability of x, but only using the tensor x. In practice, we’re mostly going to be using this second form of the change of variables formula.

TransformedDistribution Objects

The TransformedDistribution objects are types of distributions defined by another base distribution and a bijector object. The TransformedDistribution object gives us a consistent API that means we can use the same methods and properties of other distribution objects.

A normalizing flow is a generative model of the data, the model assumes:

  1. a latent variable z, which is distributed according to some base distribution. Typically we will assume it to be something simple, like a diagonal Gaussian.
  2. the data-generating process first sample z from this base distribution and transforms it in some way according to a function f to produce the data sample x.
# Base dist.             Transformation            Data dist.
# z ~ P0       ⟺        x = f(z)          ⟺      x ~ P1

For a normalizing flow, the function f will be bijective or invertible. It’ll also be parameterized and we learn the best parameters with maximum likelihood. In the training process, we’ll have sample data points x and we’ll want to compute the log probability of x under the model.

We’re using the tensor x to compute the log probability of x under the model and this uses the inverse_log_det_jacobian method of the bijector. This log probability is what we will aim to maximize within a training loop. The bijector objects will contain the parameters that we’re trying to optimize.

= base_dist.log_prob(bijector.inverse(x))
+ bijector.inverse_log_det_jacobian(x, event_ndims=0)

Once the model is trained, we can then sample from the model by:

  1. first sampling from the base distribution, and then
  2. passing that sample through the bijective transformation using the forward method of the bijector.
x_sample = bijector.forward(base_dist.sample())

By convention, we think of the forward transformation as being used for sampling and the inverse transformation together with the inverse_log_det_jacobian is used for computing log probabilities.

The TransformedDistribution object is useful to directly define the data distribution P1 with the distribution object. The TransformedDistribution comes from the distributions module, and the constructor has two required arguments, which are the base distribution and the bijector.

normal = tfd.Normal(loc=0, scale=1)
z = normal.sample(3)

exp = tab.Exp()
x = exp.forward(z)

log_normal = tfd.TransformedDistribution(normal, exp)
# shortcut, call the bijector on the base distribution
log_normal = exp(normal)

Once we have the TransformedDistribution, we can use the sample and log_prob methods as usual.

TransformedDistribution constructor also has optional keyword arguments for event_shape and batch_shape. These keyword arguments can be used to override the batch shape or event shape of the base distribution in the case where the base distribution’s shape is empty.

Subclassing Bijectors

In case you need to make a very particular or customized transformation, you can make your own bijectors to have that extra flexibility. Here is an example of subclassing the base bijector class from the bisectors module.

class MySigmoid(tfb.Bijector):
  def __init__(self, validate_args=False, name='sigmoid'):
    #passing args into the base class initializer
    super(MySigmoid, self(.__init__(validate_args=validate_args,
      forward_min_event_ndims=0, name=name)

  def _forward(self, x):
    return tf.math.sigmoid(x)

  # inverse func is logit func
  def _inverse(self, y):
    return tf.math.log(y) - tf.math.log(1-y)

  def _inverse_log_det_jacobian(self, y):
    return -tf.math.log(y) - tf.math.log(1-y)
    # alternatively
    # return -self._forward_log_det_jacobian(self._inverse(y))

  def _forward_log_det_jacobian(self, x):
    return -tf.math.softplus(-x) - tf.math.softplus(x)
    # alternatively
    # return -self._inverse_log_det_jacobian(self._forward(x))

The initializer __init__ of this class is taking the standard keyword arguments validate_args, and name. The validate_args keyword argument can be used to check any inputs passed the class methods are valid.

When you’re creating a bijector, it’s necessary to set the minimum number of event dimensions the bijector needs to act upon. In our case of the sigmoid bijector, this is a function that can operate on scalars, so the forward_min_event_ndims argument is set to zero.

Masked Autoregressive Flows

The invertibility of the transformations is important because it allows us to compute the likelihood of the data under the model, which in turn means that we can train the model using the principle of maximum likelihood. In practice though, we often need computation of the log Jacobian determinants to be efficient. A large models with many layers and high dimensional transformations, this computation can be expensive and just too slow for practical applications.

In an autoregressive model, assuming that our data is D-dimensional, we express the joint distribution of all data features as a product of conditional distributions, where each conditional probability of the feature xi depends only on the features x0 to xi-1.

# z ~ N(0, I)
# x[i] = z[i] * scale(x[0: i-1]) + loc(x[0: i-1]), i = 0, ..., D-1

In the example shown here, we are modeling the conditional distribution for each feature xi as a Gaussian distribution, where the mean and standard deviation are functions of the features x0 to xi-1. This autoregressive model is called a Masked Autoregressive Flow.

One important property of this flow is that the Jacobian determinant is easy to compute because the Jacobian matrix is lower triangular. The log determinant of the Jacobian of the forward transformation is just the negative sum of log standard deviations for each feature.

There are many possible choices for the functions loc and scale shown above. One particular choice of model is implemented as the AutoregressiveNetwork class – an implementation of the Marked Autoencoder for Distribution Estimation (MADE) architecture.

The AutoregressiveNetwork class is a feed-forward neural networks, it isn’t actually a bijector, but we’re going to use it to define an autoregressive flow bijector. For example:

made = tfb.AutoregressionNetwork(
  event_shape=[3], # input size of the network
  hidden_units=[16, 16], # 2 hidden layers each of size 16

# for example
made(tf.random.normal([2, 3]))
# tf.Tensor(
# [[[ aaa, bbb ]
#   [ ccc, ddd ]
#   [ eee, fff ]]

#  [[ ggg, hhh ]
#   [ iii, jjj ]
#   [ kkk, lll ]]], shape=(2, 3, 2), dtype=float32)

The network returns an output that has the same shape as the event_shape input plus an extra dimension, which has size given by the params argument. In this example, for each element in the batch, the network is outputting two parameters, we can use the first for the mean and the second for the log of the standard deviation.

In other words, we have a mean parameter for each data feature and a log of standard deviation parameter for each data to feature. Each parameter for each feature i is computed only using the features up to i - 1. It does this by zeroing out a number of the weights in the network.

Now that we have our AutoregressiveNetwork, we can use it to define the Masked Autoregressive Flow.

maf_bijector = tfb.MaskedAutoregressiveFlow(shift_and_log_scale_fn=made)

The code below gives an idea of the computation that is being done in the forward and inverse transformation.

def forward(z):
  x = tf.zeros_like(z)
  for _ in range(D):
    shift, log_scale = shift_and_log_scale_fn(x)
    x = z * tf.math.exp(log_scale) + shift
  return x

def inverse(x):
  shift, log_scale = shift_and_log_scale_fn(x)
  return (x - shift) / tf.math.exp(log_scale)

Remember for data with a large number of features, and a large autoregressive network, the forward and inverse transformation have different complexity.

Forward transformationBe invoked by sampling from the distribution.
Involve the loop over all the features.
Can be very slow.
Inverse transformationBe invoked by computing the log probability.
Done in parallel.
Very fast.

You could also use an entirely different model for the shift_and_log_scale_fn function, for example, using standard Keras layers. However, you need to be sure that any model you are using for the masked autoregressive flow object has the autoregressive property. This property isn’t checked by the masked autoregressive flow class.

We can now define our transform distribution as before using the TransformedDistribution object, passing in the base distribution (a standard normal) and the masked autoregressive flow bijector.

normal = tfd.Normal(loc=0, scale=1)
maf = tfd.TransformedDistribution(normal, maf_bijector, event_shape=[3])

Inverse Autoregressive Flows

We can use the masked autoregressive flow above to easily define an Inverse Autoregressive Flow, by applying the Invert bijector to a masked autoregressive flow object.

# z ~ N(0, I)
# z[i] = (x[i] - loc(x[0: i-1])) / scale(x[0: i-1]), i = 0, ..., D-1

made = tfb.AutoregressionNetwork(
  event_shape=[3], # input size of the network
  hidden_units=[16, 16], # 2 hidden layers each of size 16

iaf_bijector = tfb.Invert(

normal = tfd.Normal(loc=0, scale=1)
iaf = tfd.TransformedDistribution(normal, iaf_bijector, event_shape=[3])

Sampling from this model is fast, but computing log probabilities is slow.

Real NVP

Real NVP stands for Real-valued Non-Volume Preserving. The Real NVP architecture is a special case of the autoregressive flow. We will partition the vector x in two:

  1. The first d elements of x are set to be equal to the first d elements of the random variable z.
  2. These first d elements of z are also used as input to functions that produce mean and standard deviation parameters for the Gaussian distributions that model the remaining features in the vector x.
# z ~ N(0, I)
# x[0: d] = z[0: d]
# x[d: D] = z[d: D] * scale(z[0: d]) + loc(z[0: d])

The scale and loc functions can be any functions. There’s no autoregressive property that they need to fulfill here, but the Real NVP flow is autoregressive. Inside the bijectors module, there’s the Real NVP default template that can be used for the Masked Autoregressive Flow object.

shift_and_log_scale_fn = tfb.real_nvp_default_template(
  hidden_layers=[32, 32],  # number and size of hidden layers
  activation=tf.nn.relu,   # activation function
  shift_only=False         # scale=1 if True

The shift_and_log_scale_fn function wraps an implementation of a feedforward neural network. This function has potentially different size, domain, and range, and so the function signature takes two arguments: the network’s input and the size of the output. So in total the D = 3 in the example below.

  tf.random.normal([2]),  # input, d = 2
  1                       # size of the output
# (<tf.Tensor: shape=(1,), dtype=float32, numpy=array([aaa], dtype=float32)>,
# <tf.Tensor: shape=(1,), dtype=float32, numpy=array([bbb], dtype=float32)>)

The output of the network is a length 2 tuple, containing the mean and log scale parameters for the second partition x[d: D] above. In this example, that second partition has length 1. The real_nvp_default_template is set up so that the network is created the first time the return function shift_and_log_scale_fn is called. Subsequent calls would need to have the same size arguments passed to it, otherwise it would raise an error.

The shift_only=True version of Real NVP is precisely the Nonlinear Independent Components Estimation (NICE) model that Real NVP builds on. In this case, the Jacobian is the identity matrix, which means that the bijective transformation is volume preserving.

Once we have the function to get the mean and scale parameters, we can define the Real NVP bijector by using the RealNVP class in the bijectors module.

relnvp_bijector = tfb.RealNVP(
  num_masked=2,    # the d value above, defines the first partition

The code below gives an idea of the computation that is being done in the forward and inverse transformation.

def forward(z):
  x = tf.zeros_like(z)
  x[0: d] = z[0: d]
  shift, log_scale = shift_and_log_scale_fn(z[0: d])
  x[d: D] = z[d: D] * tf.math.exp(log_scale) + shift
  return x

def inverse(x):
  z = tf.zeros_like(x)
  z[0: d] = x[0: d]
  shift, log_scale = shift_and_log_scale_fn(x[0: d])
  z[d: D] = (x[d: D] - shift) * tf.math.exp(-log_scale)
  return z

The inverse transformation is just a straightforward as the forward transformation. The forward and inverse computations are roughly equal computationally. The trade-off though is that the Real NVP model is not as expressive as the masked autoregressive flow.

So now we have the Real NVP bijector, we can use it to create our TransformedDistribution object as before.

mvn = tfd.MultivariateNormalDiag(loc=[0., 0., 0])
realnvp = tfd.TransformedDistribution(

Notice that this time, the event_shape of the base distribution is as required for the bijector. So I don’t need to set the event_shape argument in the TransformedDistribution.

Slightly Extended Version

The Real NVP bijector leaves part of the input vector unchanged x[0: d] = z[0: d]. In practice, we’re going to want to combine multiple Real NVP layers together to produce a bijector that can transform all of the vector components.

# change the order of the features in the vector
permute = tfp.bijectors.Permute(permutation=[1, 2, 0])

realnvp1 = tfb.RealNVP(
  fraction_masked=0.5, # masking half of the input vector
realnvp2 = tfb.RealNVP(
realnvp3 = tfb.RealNVP(

# mix the vector components, so all components can be transformed
chained_bijector = tfb.Chain(
  [realnvp3, permute, realnvp2, permute, realnvp1]

mvn = tfd.MultivariateNormalDiag(loc=[0., 0., 0])
realnvp = tfd.TransformedDistribution(

In this example, a TransformedDistribution was created to use a bijector consisting of three Real NVP layers.

My Certificate

For more on TensorFlow: Normalizing Flow Models, please refer to the wonderful course here

Related Quick Recap

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Leave a Reply

Your email address will not be published. Required fields are marked *