# TensorFlow: Normalizing Flow Models

Generative models are a kind of statistical model that aims to learn the underlying data distribution itself. If a generative model is able to capture the underlying distribution of the data well, then it’s able to produce new instances that could plausibly have come from the same dataset. You could use for anomaly detection, telling you whether a given instance is likely.

The data distribution can be very complex and difficult to model. One approach to this problem is to take an initial, simple density and transform it possibly using a series of parameterized transformations that produce a rich and complex distribution. If these transformations are smooth and invertible, then we are able to evaluate the density of the complex transform distribution. This property is important because it then allows us to train such a model using maximum likelihood. This is the idea behind normalizing flows.

## Change of Variables Formula

Normalizing flows are a class of models that exploit the change of variables formula to estimate an unknown target data density. Suppose we have a dataset with n samples D := {x(1), x(2), …, x(n)}, with each x(i) โ Rd, and assume that these samples are generated i.i.d. from the underlying distribution pX.

A normalizing flow models the distribution pX using a random variable Z (also of dimension d) with a simple distribution pZ, such that the random variable X can be written as a change of variables, i.e. X = fฮธ(Z), where ฮธ is a parameter vector that parameterizes the smooth invertible function fฮธ.

The function fฮธ is modeled using a neural network with parameters ฮธ, which we want to learn from the data. An important point is that this neural network must be designed to be invertible, which is not the case in general with deep learning models.  In practice, we often construct the neural network by composing multiple simpler blocks together. In TensorFlow Probability, these simpler blocks are the bijectors.

In order to learn the optimal parameters ฮธ, we apply the principle of maximum likelihood and search for ฮธML such that `ฮธML = argmaxฮธ P(D; ฮธ) = argmaxฮธ log P(D; ฮธ)`. In order to calculate P(D; ฮธ), we could use the change of variables formula.

``````pX(x) = pZ(z) โ |det Jfฮธ(z)|-1
โน pX(x) = pZ(z) โ |det Jfฮธ-1(x)|
โน p(D; ฮธ) = โxโD pZ(fฮธ-1(x)) โ |det Jfฮธ-1(x)|
โน log p(D; ฮธ) = โxโD log pZ(fฮธ-1(x)) โ log |det Jfฮธ-1(x)|``````

The term `pZ(fฮธ-1(x))` can be computed for a given data point x โ D, since the neural network fฮธ is designed to be invertible and the distribution pZ is known. The term `det Jfฮธ-1(x)` is also computable, although the determinant of the Jacobian `J` should be efficiently computed.

## The `Bijector` Objects

The Bijector objects from Tensorflow Probability library forms the basis for normalizing flow models. Bijectors are used to transform tensor objects. Bijector objects have methods to apply the forward as well as the inverse transformation.

``````import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors

# one-dimensional tensor of length 3
z = tf.constant([1., 2., 3.])

# create the bijector object
scale = tfb.Scale(2.)

# apply the forward transformation
x = scale.forward(z)
# tf.Tensor([2. 4. 6.], shape=(3,), dtype=float32)

# apply the inverse transformation
scale.inverse(tf.constant([5., 3., 1.]))
# tf.Tensor([2.5 1.5 0.5], shape=(3,), dtype=float32)``````

Now, a simple bijective transformation could be built by adding a `shift` as well as a `scale` operation. Any chain of smooth and invertible transformations will again be smooth and invertible.

``````scale = tfb.Scale(2.) # scale bijector
shift = tfb.Shift(1.) # shift bijector

# bijective transformation
# equivalent to scale_and_shift = shift(scale)
scale_and_shift = tfb.Chain([shift, scale]) # NOTE the reverse order

# apply the forward transformation
# equivalent to scale_and_shift(z)
scale_and_shift.forward(z)
# tf.Tensor([3. 5. 7.], shape=(3,), dtype=float32)

# apply the inverse transformation
scale_and_shift.inverse(tf.constant([2., 5., 8.]))
# tf.Tensor([0.5 2. 3.5], shape=(3,), dtype=float32)``````

Bijectors can also be used to transform random variables, and compute log probabilities of events under the transform distribution.

``````normal = tfd.Normal(loc=0., scale=1.)

z = normal.sample(3)
# tf.Tensor([-0.32 1.40 0.42], shape=(3,), dtype=float32)

x = scale_and_shift.forward(z)
# tf.Tensor([0.35 3.81 1.84], shape=(3,), dtype=float32)``````

We’d like to be able to evaluate the density of this transform distribution at x, that’s what the change of variables formula tells us how to do.

``````log_prob_x = normal.log_prob(z)
- scale_and_shift.forward_log_det_jacobian(z, event_ndims=0)``````

The `event_ndims` argument specifies the number of events space dimensions present in the input tensor `z`, which has shaped semantics given by sample_shape, batch_shape, and event_shape. The computation of the log Jacobian determinant should be reduced over the event dimensions. In the example above, the sample_shape is one-dimensional, but batch_shape and event_shape are both empty.

However, we can also invert the change of variables formula to express it in terms of the inverse of the bijective transformation. We can write that the log probability of x is equal to the log probability of z plus the log of the Jacobian determinant of the inverse transformation evaluated at x. Of course, we can just replaces z with the result of the bijectors inverse method applied to x.

``````log_prob_x = normal.log_prob(z)
+ scale_and_shift.inverse_log_det_jacobian(x, event_ndims=0)
= normal.log_prob(scale_shift.inverse(x))
+ scale_and_shift.inverse_log_det_jacobian(x, event_ndims=0)``````

In the second line here, we’re computing the same log probability of x, but only using the tensor x. In practice, we’re mostly going to be using this second form of the change of variables formula.

## `TransformedDistribution` Objects

The `TransformedDistribution` objects are types of distributions defined by another base distribution and a bijector object. The `TransformedDistribution` object gives us a consistent API that means we can use the same methods and properties of other distribution objects.

A normalizing flow is a generative model of the data, the model assumes:

1. a latent variable z, which is distributed according to some base distribution. Typically we will assume it to be something simple, like a diagonal Gaussian.
2. the data-generating process first sample z from this base distribution and transforms it in some way according to a function f to produce the data sample x.
``````# Base dist.             Transformation            Data dist.
# z ~ P0       โบ        x = f(z)          โบ      x ~ P1``````

For a normalizing flow, the function f will be bijective or invertible. It’ll also be parameterized and we learn the best parameters with maximum likelihood. In the training process, we’ll have sample data points x and we’ll want to compute the log probability of x under the model.

We’re using the tensor x to compute the log probability of x under the model and this uses the `inverse_log_det_jacobian` method of the `bijector`. This log probability is what we will aim to maximize within a training loop. The bijector objects will contain the parameters that we’re trying to optimize.

``````log_prob_x
= base_dist.log_prob(bijector.inverse(x))
+ bijector.inverse_log_det_jacobian(x, event_ndims=0)``````

Once the model is trained, we can then sample from the model by:

1. first sampling from the base distribution, and then
2. passing that sample through the bijective transformation using the `forward` method of the `bijector`.
``x_sample = bijector.forward(base_dist.sample())``

By convention, we think of the `forward` transformation as being used for sampling and the `inverse` transformation together with the `inverse_log_det_jacobian` is used for computing log probabilities.

The `TransformedDistribution` object is useful to directly define the data distribution P1 with the distribution object. The `TransformedDistribution` comes from the distributions module, and the constructor has two required arguments, which are the base distribution and the bijector.

``````normal = tfd.Normal(loc=0, scale=1)
z = normal.sample(3)

exp = tab.Exp()
x = exp.forward(z)

log_normal = tfd.TransformedDistribution(normal, exp)
# shortcut, call the bijector on the base distribution
log_normal = exp(normal)``````

Once we have the `TransformedDistribution`, we can use the `sample` and `log_prob` methods as usual.

`TransformedDistribution` constructor also has optional keyword arguments for `event_shape` and `batch_shape`. These keyword arguments can be used to override the batch shape or event shape of the base distribution in the case where the base distribution’s shape is empty.

## Subclassing Bijectors

In case you need to make a very particular or customized transformation, you can make your own bijectors to have that extra flexibility. Here is an example of subclassing the base `bijector` class from the bisectors module.

``````class MySigmoid(tfb.Bijector):
def __init__(self, validate_args=False, name='sigmoid'):
#passing args into the base class initializer
super(MySigmoid, self(.__init__(validate_args=validate_args,
forward_min_event_ndims=0, name=name)

def _forward(self, x):
return tf.math.sigmoid(x)

# inverse func is logit func
def _inverse(self, y):
return tf.math.log(y) - tf.math.log(1-y)

def _inverse_log_det_jacobian(self, y):
return -tf.math.log(y) - tf.math.log(1-y)
# alternatively
# return -self._forward_log_det_jacobian(self._inverse(y))

def _forward_log_det_jacobian(self, x):
return -tf.math.softplus(-x) - tf.math.softplus(x)
# alternatively
# return -self._inverse_log_det_jacobian(self._forward(x))
``````

The initializer `__init__` of this class is taking the standard keyword arguments `validate_args`, and `name`. The `validate_args` keyword argument can be used to check any inputs passed the class methods are valid.

When you’re creating a bijector, it’s necessary to set the minimum number of event dimensions the bijector needs to act upon. In our case of the sigmoid bijector, this is a function that can operate on scalars, so the `forward_min_event_ndims` argument is set to zero.

The invertibility of the transformations is important because it allows us to compute the likelihood of the data under the model, which in turn means that we can train the model using the principle of maximum likelihood. In practice though, we often need computation of the log Jacobian determinants to be efficient. A large models with many layers and high dimensional transformations, this computation can be expensive and just too slow for practical applications.

In an autoregressive model, assuming that our data is D-dimensional, we express the joint distribution of all data features as a product of conditional distributions, where each conditional probability of the feature `xi` depends only on the features `x0` to `xi-1`.

``````# z ~ N(0, I)
# x[i] = z[i] * scale(x[0: i-1]) + loc(x[0: i-1]), i = 0, ..., D-1``````

In the example shown here, we are modeling the conditional distribution for each feature `xi` as a Gaussian distribution, where the mean and standard deviation are functions of the features `x0` to `xi-1`. This autoregressive model is called a Masked Autoregressive Flow.

One important property of this flow is that the Jacobian determinant is easy to compute because the Jacobian matrix is lower triangular. The log determinant of the Jacobian of the forward transformation is just the negative sum of log standard deviations for each feature.

There are many possible choices for the functions `loc` and `scale` shown above. One particular choice of model is implemented as the `AutoregressiveNetwork` class – an implementation of the Marked Autoencoder for Distribution Estimation (MADE) architecture.

The `AutoregressiveNetwork` class is a feed-forward neural networks, it isn’t actually a bijector, but we’re going to use it to define an autoregressive flow bijector. For example:

``````made = tfb.AutoregressionNetwork(
params=2,
event_shape=[3], # input size of the network
hidden_units=[16, 16], # 2 hidden layers each of size 16
activation='sigmoid'
)

# for example
# tf.Tensor(
# [[[ aaa, bbb ]
#   [ ccc, ddd ]
#   [ eee, fff ]]

#  [[ ggg, hhh ]
#   [ iii, jjj ]
#   [ kkk, lll ]]], shape=(2, 3, 2), dtype=float32)``````

The network returns an output that has the same shape as the `event_shape` input plus an extra dimension, which has size given by the `params` argument. In this example, for each element in the batch, the network is outputting two parameters, we can use the first for the mean and the second for the log of the standard deviation.

In other words, we have a mean parameter for each data feature and a log of standard deviation parameter for each data to feature. Each parameter for each feature `i` is computed only using the features up to `i - 1`. It does this by zeroing out a number of the weights in the network.

Now that we have our `AutoregressiveNetwork`, we can use it to define the Masked Autoregressive Flow.

``maf_bijector = tfb.MaskedAutoregressiveFlow(shift_and_log_scale_fn=made)``

The code below gives an idea of the computation that is being done in the `forward` and `inverse` transformation.

``````def forward(z):
x = tf.zeros_like(z)
for _ in range(D):
shift, log_scale = shift_and_log_scale_fn(x)
x = z * tf.math.exp(log_scale) + shift
return x

def inverse(x):
shift, log_scale = shift_and_log_scale_fn(x)
return (x - shift) / tf.math.exp(log_scale)``````

Remember for data with a large number of features, and a large autoregressive network, the forward and inverse transformation have different complexity.

You could also use an entirely different model for the `shift_and_log_scale_fn` function, for example, using standard Keras layers. However, you need to be sure that any model you are using for the masked autoregressive flow object has the autoregressive property. This property isn’t checked by the masked autoregressive flow class.

We can now define our transform distribution as before using the `TransformedDistribution` object, passing in the base distribution (a standard normal) and the masked autoregressive flow bijector.

``````normal = tfd.Normal(loc=0, scale=1)
maf = tfd.TransformedDistribution(normal, maf_bijector, event_shape=[3])``````

### Inverse Autoregressive Flows

We can use the masked autoregressive flow above to easily define an Inverse Autoregressive Flow, by applying the `Invert` bijector to a masked autoregressive flow object.

``````# z ~ N(0, I)
# z[i] = (x[i] - loc(x[0: i-1])) / scale(x[0: i-1]), i = 0, ..., D-1

params=2,
event_shape=[3], # input size of the network
hidden_units=[16, 16], # 2 hidden layers each of size 16
activation='sigmoid'
)

iaf_bijector = tfb.Invert(
)

normal = tfd.Normal(loc=0, scale=1)
iaf = tfd.TransformedDistribution(normal, iaf_bijector, event_shape=[3])``````

Sampling from this model is fast, but computing log probabilities is slow.

## Real NVP

Real NVP stands for Real-valued Non-Volume Preserving. The Real NVP architecture is a special case of the autoregressive flow. We will partition the vector x in two:

1. The first d elements of x are set to be equal to the first d elements of the random variable z.
2. These first d elements of z are also used as input to functions that produce mean and standard deviation parameters for the Gaussian distributions that model the remaining features in the vector x.
``````# z ~ N(0, I)
# x[0: d] = z[0: d]
# x[d: D] = z[d: D] * scale(z[0: d]) + loc(z[0: d])``````

The `scale` and `loc` functions can be any functions. There’s no autoregressive property that they need to fulfill here, but the Real NVP flow is autoregressive. Inside the `bijectors` module, there’s the Real NVP default template that can be used for the Masked Autoregressive Flow object.

``````shift_and_log_scale_fn = tfb.real_nvp_default_template(
hidden_layers=[32, 32],  # number and size of hidden layers
activation=tf.nn.relu,   # activation function
shift_only=False         # scale=1 if True
)``````

The `shift_and_log_scale_fn` function wraps an implementation of a feedforward neural network. This function has potentially different size, domain, and range, and so the function signature takes two arguments: the network’s input and the size of the output. So in total the D = 3 in the example below.

``````shift_and_log_scale_fn(
tf.random.normal([2]),  # input, d = 2
1                       # size of the output
)
# (<tf.Tensor: shape=(1,), dtype=float32, numpy=array([aaa], dtype=float32)>,
# <tf.Tensor: shape=(1,), dtype=float32, numpy=array([bbb], dtype=float32)>)``````

The output of the network is a length 2 tuple, containing the mean and log scale parameters for the second partition `x[d: D]` above. In this example, that second partition has length 1. The `real_nvp_default_template` is set up so that the network is created the first time the return function `shift_and_log_scale_fn` is called. Subsequent calls would need to have the same size arguments passed to it, otherwise it would raise an error.

The `shift_only=True` version of Real NVP is precisely the Nonlinear Independent Components Estimation (NICE) model that Real NVP builds on. In this case, the Jacobian is the identity matrix, which means that the bijective transformation is volume preserving.

Once we have the function to get the mean and scale parameters, we can define the Real NVP bijector by using the `RealNVP` class in the `bijectors` module.

``````relnvp_bijector = tfb.RealNVP(
num_masked=2,    # the d value above, defines the first partition
shift_and_log_scale_fn=shift_and_log_scale_fn
)``````

The code below gives an idea of the computation that is being done in the `forward` and `inverse` transformation.

``````def forward(z):
x = tf.zeros_like(z)
x[0: d] = z[0: d]
shift, log_scale = shift_and_log_scale_fn(z[0: d])
x[d: D] = z[d: D] * tf.math.exp(log_scale) + shift
return x

def inverse(x):
z = tf.zeros_like(x)
z[0: d] = x[0: d]
shift, log_scale = shift_and_log_scale_fn(x[0: d])
z[d: D] = (x[d: D] - shift) * tf.math.exp(-log_scale)
return z``````

The `inverse` transformation is just a straightforward as the `forward` transformation. The forward and inverse computations are roughly equal computationally. The trade-off though is that the Real NVP model is not as expressive as the masked autoregressive flow.

So now we have the Real NVP bijector, we can use it to create our TransformedDistribution object as before.

``````mvn = tfd.MultivariateNormalDiag(loc=[0., 0., 0])
realnvp = tfd.TransformedDistribution(
distribution=mvn,
bijector=realnvp_bijector
)``````

Notice that this time, the `event_shape` of the base distribution is as required for the bijector. So I don’t need to set the `event_shape` argument in the `TransformedDistribution`.

### Slightly Extended Version

The Real NVP bijector leaves part of the input vector unchanged `x[0: d] = z[0: d]`. In practice, we’re going to want to combine multiple Real NVP layers together to produce a bijector that can transform all of the vector components.

``````# change the order of the features in the vector
permute = tfp.bijectors.Permute(permutation=[1, 2, 0])

realnvp1 = tfb.RealNVP(
shift_and_log_scale_fn=real_nvp_default_template(hidden_layers=[32,32]))
realnvp2 = tfb.RealNVP(
shift_and_log_scale_fn=real_nvp_default_template(hidden_layers=[32,32]))
realnvp3 = tfb.RealNVP(
shift_and_log_scale_fn=real_nvp_default_template(hidden_layers=[32,32]))

# mix the vector components, so all components can be transformed
chained_bijector = tfb.Chain(
[realnvp3, permute, realnvp2, permute, realnvp1]
)

mvn = tfd.MultivariateNormalDiag(loc=[0., 0., 0])
realnvp = tfd.TransformedDistribution(
distribution=mvn,
bijector=chained_bijector
)``````

In this example, a `TransformedDistribution` was created to use a bijector consisting of three Real NVP layers.

## My Certificate

For more on TensorFlow: Normalizing Flow Models, please refer to the wonderful course here https://www.coursera.org/learn/probabilistic-deep-learning-with-tensorflow2

## Related Quick Recap

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai