We’ll be making extensive use of the TensorFlow Probability library to help us develop probabilistic deep learning models. The distribution objects from the library are the vital building blocks because they capture the essential operations on probability distributions. We are going to use them when building probabilistic deep learning models in TensorFlow.

Univariate Distributions

Within the tfp library, there are several modules that we’ll use a lot, one of them being the distributions module. The code below is an example of normal distribution.

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions

normal = tfd.Normal(loc=0., scale=1.)   # mean = 0, std deviation = 1

normal.sample()   # sample from the dist, returning a Tensor object
normal.sample(3)  # draw multiple independent samples from the dist

normal.prob(0.5)  # evaluate the prob density function at a point

Below is an example of a discrete univariate distribution object:

bernoulli = tfd.Bernoulli(probs=0.7)    # prob that the random var takes 1
bernoulli = tfd.Bernoulli(logits=0.847) # sigmoid(0.847) ~= 0.7

bernoulli.sample(3)  # draw multiple independent samples from the dist

bernoulli.prob(1)     # evaluate the prob of event 1, which ~= 0.7
bernoulli.log_prob(1) # evaluate using log prob

The event_shape property of these objects is what captures the dimensionality of the random variable itself. In the case of univariate distributions with single random variable, the event_shape property is empty.

Another one of the powerful features of distribution objects is that a single object can represent a batch of distributions of the same type. By designing distribution objects in this way, the TensorFlow probability library can exploit the performance gains from vectorizing computations.

# 1 object with 2 batches of distributions
batched_bernoulli = tfd.Bernoulli(probs=[0.4, 0.5])

batched_bernoulli.batch_shape    # returns (2,)
batched_bernoulli.sample(3)      # returns a Tensor with shape (3, 2)
batched_bernoulli.prob([1, 1])   # eval the prob of event 1 for both batches
batched_bernoulli.log_prob([1, 1])

Multivariate Distributions

Multivariate distributions can be constructed and used in a very similar way to that of the univariate distributions. Below is the code example of instantiating a 2-dimensional diagonal Gaussian:

mv_normal = tfd.MultivariateNormalDiag(loc=[-1., 0.5], scale_diag=[1., 1.5])

mv_normal.event_shape    # returns (2,)
mv_normal.sample(3)      # returns a Tensor of shape (3, 2)

Note that a 2-dimensional multivariate distribution (batch_shape is empty and event_shape = 2) is totally different from a univariate distribution with 2 batches (batch_shape = 2 and event_shape is empty). This difference is clear when we compute log_prob for a given input.

Multivariate distributions can also be batched. This MultivariateNormalDiag distribution has an event shape of two and a batch shape of three. In other words, it contains a batch of three multivariate Gaussians, each of which is a distribution over a two-dimensional random variable.

batched_mv_normal = tfd.MultivariateNormalDiag(
  loc=[[-1., 0.5], [2., 0.], [-0.5, 1.5]],
  scale_diag=[[1., 1.5], [2., 0.5], [1., 1.]] )

# batch_shape = [3], event_shape = [2]

batched_mv_normal.sample(2)   # returns a Tensor of shape (2, 3, 2)
# (sample_size, batch_size, event_size)

The Independent Distribution

Sometimes we might want to reinterpret a batch of independent distributions over an event space as a single joint distribution over a product of event spaces. For example, our model might assume that the features of our data are independent given a class label. In this case, we could set up a separate class conditional distribution for each feature in a batch.

But this batch of distributions is really a joint distribution over all the features, and we’d like that to be reflected in the batch_shape and event_shape properties, and the outputs of the log_prob method.

In the distributions module, there is the Independent distribution class, which is designed especially for this purpose. First lets do some comparison:

Multivariatemv_normal = tfd.MultivariateNormalDiag(loc=[-1., 0.5], scale_diag=[1., 1.5])
# batch_shape = [], event_shape = [2]

mv_normal.log_prob([-0.2, 1.8])
# tf.Tensor(-2.9388978, shape=(), ...)
Univariatebatched_normal = tfd.Normal(loc=[-1., 0.5], scale_diag=[1., 1.5])
# batch_shape = [2], event_shape = []

batched_normal.log_prob([-0.2, 1.8])
# tf.Tensor([-1.2389386, -1.699959], shape=(2,), ...)

The Independent distribution gives us a way to absorb some or all of the batch dimensions into the event_shape. In the example above, we could use the Independent distribution to transform our batched_normal distribution so that it’s equivalent to the multivariate diagnoal normal distribution mv_normal.

independent_normal = tfd.Independent(
  reinterpreted_batch_ndims=1   # how many batch dims absorbed to event
# batch_shape = [], event_shape=[2]

independent_normal.log_prob([-0.2, 1.8])
# tf.Tensor(-2.9388796, shape=(), ...)

Mathematically, this independent distribution is now equivalent to the multivariate diagonal normal distribution we had before.

Higher Rank of batch_shape

Here is one more example when the batch_shape has a rank that is greater than one.

batched_normal = tfd.Normal(
  loc=[[-1., 0.5], [2., 0.], [-0.5, 1.5]],
  scale_diag=[[1., 1.5], [2., 0.5], [1., 1.]] )

# batch_shape = [3, 2], event_shape = []

independent_normal = tfd.Independent(
  reinterpreted_batch_ndims=1   # how many batch dims absorbed to event

# batch_shape = [3], event_shape = [2]

independent_normal = tfd.Independent(
  reinterpreted_batch_ndims=2   # how many batch dims absorbed to event

# batch_shape = [], event_shape = [2, 2]

Sampling and log_prob

Just as the batch_shape and the event_shape can have a rank greater than one, so can the sample_shape. Suppose we already have gotten an independent distribution object, with batch_shape = [2, 1] and event_shape = [2, 3], now try to sample it.

ind_exp = tfd.Independent(exp, ...)
ind_exp.sample([4, 2])

The resulting Tensor object will be rank 6, i.e. (4, 2, 2, 1, 2, 3). Again, remember the order is sample_shape, batch_shape, and then event_shape.

Now let us consider log_prob. This is a simple example of using broadcasting when computing log prob.


Because the distribution will compute the log probability for each event in the batch, so the value 0.5 will be broadcast to both the event_shape of [2, 3] and the batch_shape of [2, 1], where every entry is equal to 0.5. The log probability for this event is computed for each distribution in the batch. The result is a (2, 1) tensor.

As a general rule, the log_prob method will broadcast its input against the batch and event shape, which in this example is (2, 1, 2, 3). It will collapse the event_shape in the computation and the shape of the resulting tensor will be whatever is left, which here is the batch_shape of (2, 1).

Make Distribution Objects Trainable

Recall that in TensorFlow, variable objects are used to capture the values of parameters of our deep learning models. These variables are objects that persist in our program once created, but can change their values during the course of the program, say by using an optimizer object to apply gradients obtained from a loss function and data.

For example, we can learn the mean of a normal distribution object, which also has a trainable_variables attribute.

normal = tfd.Normal(
  loc=tf.Variable(0., name='loc'),
  scale=1. )

The mean value of this normal distribution is now trainable and can be updated according to some learning principle. The learning principle that we often use when training deep learning models is maximum likelihood, which is the same as finding the parameters that minimize the negative log likelihood. The function below can get the negative log likelihood:

def nll(x_train):
  return -tf.reduce_mean(normal.log_prob(x_train))

Let’s continue with the implementation of a training loop to learn the main parameter from the data:

def get_loss_and_grads(x_train):
  with tf.GradientTape() as tape:
    loss = nll(x_train)
  grads = tape.gradient(loss, normal.trainable_variables)
  return loss, grads

optimizer = tf.keras.optimizer.SGD(learning_rate=0.05)

for _ in range(num_steps):
  loss, grads = get_loss_and_grads(x_samples)
  optimizer.apply_gradients(zip(grads, normal.trainable_variables))

My Certificate

For more on Distribution Objects in TensorFlow Probability, please refer to the wonderful course here https://www.coursera.org/learn/probabilistic-deep-learning-with-tensorflow2

Related Quick Recap

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Leave a Reply

Your email address will not be published. Required fields are marked *