Unfortunately, deep learning models aren’t always accurate, especially when asked to make predictions on new data points that are dissimilar to the data that they were trained on. The insight here is that it’s important to models to be able to assign higher levels of uncertainty to incorrect predictions. We want our deep learning models to know what they don’t know. The probabilistic approach to deep learning provides a means of dealing with the uncertainty involved in the modeling process.



The two categories of uncertainty are often referred to as aleatoric and epistemic uncertainty.

Aleatoric uncertainty (Irreducible)The uncertainty in the data itself. Measurement error or noise in labels of a dataset.
Inherent stochasticity in the data-generating process. This uncertainty comes in two flavors, homoscedastic and heteroscedastic. The distinction comes down to whether or not the noise is dependent on the input variable.
1) Homoscedastic: the data uncertainty is the same for all target variables, 
regardless of the input.
2) Heteroscedastic: the data uncertainty varies according to the input variable.
Epistemic uncertaintyEpistemic uncertainty is model uncertainty.
A model might well have the capacity to model the underlying data distribution accurately, but if there isn’t enough data available, then the model won’t be able to learn effectively.
Epistemic uncertainty is something that will decrease as we gather more data, since the model gets more information about which parameters accurately explain the data.

The DistributionLambda Layer

The DistributionLambda layer is the most direct way of incorporating distribution objects into a deep learning model. There are many layers allow us to model aleatoric uncertainty (data uncertainty) by learning for example the variance of a normal distribution output by network. One way that you could think about these types of probabilistic layers is that they define Stochastic activation units in our model.

Here is an example – an implementation of a linear regression model using the sequential API:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# y = <x, w> + b + epsilon
# epsilon ~ N(0, sigma^2)

mode = Sequential([
  Dense(1, input_shape=(2,))
])

model.compile(loss='mse', optimizer='rmsprop')
model.fit(x_train, y_train, epoch=10)

The algorithm is motivated by an underlying modeling assumption, and the learning principle of maximum likelihood. The modeling assumption is that our data was generated by a linear function y of the input x, i.e. y equals to the inner product of the input x and the weight w plus some bias b. We also assume that the observations y are given by this linear function of x, plus some noise denoted here as the random variable ฮต, which is normally distributed with zero mean and variance ฯƒ2. Minimizing the mean squared error loss as we did above, is equivalent to maximizing the likelihood of the data under our statistical modeling assumptions.

However we can use probabilistic layers to make our modeling assumptions more explicit by including the normal distribution within the model itself.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import tensorflow_probability as tfp
tfd  = tfp.distributions
tfdl = tfp.layers

# y = <x, w> + b + epsilon
# epsilon ~ N(0, sigma^2)

mode = Sequential([
  Dense(1, input_shape=(2,)),
  tfpl.DistributionLambda(
    lambda t: tfd.Normal(loc=t, scale=1),
    convert_to_tensor_fn=tfd.Distribution.sample
    # specifies the way of extracting a tensor object out of a distribution. 
  )
])

The final layer of the model includes a normal distribution. The output of the Dense layer is the mean of this normal distribution. (Let’s just assume the variance of the normal distribution is fixed). The constructor for the DistributionLambda layer should be a function. This function should take the output of the previous layer as an input (in this case, the variable t), and return a distribution object (in this case, the object tfd.Normal). We’re using a lambda function to instantiate our DistributionLambda layer. This construction means that the model itself now returns a distribution object when it’s called.

model(x_sample)  # x_sample is of shape (16, 2)

# the model returns:
# tfp.distribution.Normal("sequential_distribution_lambda_Normal",
#   batch_shape=[16,1] event_shape[], dtype=float32)

The normal distribution is a univariate distribution, so it always has a scale of event_shape. The output tensor of the dense layer will have shape 16 by 1, 16 for the batch size, and 1 for the number of units in the Dense layer.



What we want to do is minimize the negative log likelihood of the data:

# All loss functions in Keras have the following signature

def nll(y_true, y_pred):
  return - y_pred.log_prob(y_true)

model.compile(loss=nll, optimizer='rmsprop')
model.fit(x_train, y_train, epoch=10)

Remember that now the output of our model is a distribution object, so here y_pred is the normal distribution object. We can easily compute the negative log-likelihood using the log_prob method of the distribution.

Now our model is probabilistic. When we use our model on test inputs, it returns a distribution object. We can then get predictions by sampling from this distribution. Or we could also get the mean of this distribution.

model(x_test).sample()
model(x_text).mean()

Some Other Probabilistic Layers

The DistributionLambda layer is a base class for several of the other probabilistic layers inside the layers module of the TensorFlow Probability library.

The IndependentNormal Layer

model = Sequential([
  Dense(16, activation='relu', input_shape=(2,)),
  Dense(2), # the output tensor is of shape (batch_size, 2)
  tfpl.DistributionLambda(
    lambda t: tfd.Indepenent(
      tfd.Normal(
        loc=t[..., :1], # ... is for batch, first column is mean
        scale=tf.math.softplus(t[..., 1:]) # second column is std deviation
    )
  )
])

model(x_sample) # suppose the shape of x_sample is (16, 2)

# tfp.distributions.Independent("sequential_distribution_...",
#   batch_shape=[16], event_shape=[1], dtype=float32)

Instead of the DistributionLambda layer, we could instead use the IndependentNormal layer as a drop-in replacement.

model = Sequential([
  Dense(16, activation='relu', input_shape=(2,)),
  Dense(2), # the output tensor is of shape (batch_size, 2)
  tfpl.IndependentNormal(1) # input 1 is the event_shape
])

model(x_sample) # suppose the shape of x_sample is (16, 2)

# tfp.distributions.Independent("sequential_distribution_...",
#   batch_shape=[16], event_shape=[1], dtype=float32)

The IndependentNormal layer internally creates a batch normal distribution wrapped by an independent distribution just like a DistributionLambda layer in our previous example.

However when the input to IndependentNormal is some value other than 1, for example:

model = Sequential([
  Dense(16, activation='relu', input_shape=(2,)),
  Dense(4), # need to be 4, since input to IndependentNormal is 2 now
  tfpl.IndependentNormal(2) # input 2 is the event_shape
])

tfpl.IndependentNormal(2), you can think of the as being a 2 dimensional multivariate normal with a diagonal covariance matrix. Now the mean is 2-dimensional, we also need 2 parameters to specify the standard deviations of each component, that is why the previous Dense layer has 4 units.

It’s also possible to define an event shape with a rank that is greater than one.

model = Sequential([
  Dense(16, activation='relu', input_shape=(2,)),
  Dense(8), # need to be 8, since input to IndependentNormal is [2, 2] now
  tfpl.IndependentNormal([2, 2]) # input 2 is the event_shape
])

model(x_sample) # suppose the shape of x_sample is (16, 2)

# tfp.distributions.Independent("sequential_distribution_...",
#   batch_shape=[16], event_shape=[2, 2], dtype=float32)

In any case, many of these probabilistic layers have a convenient static method called params_size, which takes the event shape as an argument and returns the number of parameters needed for this probabilistic layer with this event shape.

model = Sequential([
  Dense(16, activation='relu', input_shape=(2,)),
  Dense(tfpl.IndependentNormal.params_size(event_shape)),
  tfpl.IndependentNormal(event_shape)
])


The OneHotCategorical Layer

Here is a simple, standard convolutional neural network that uses the probabilistic layer called OneHotCategorical. Before we might have designed this model by having a final dense layer, with 10 units and the softmax activation function. But now the OneHotCategorical layer directly outputs the OneHotCategorical distribution object.

num_classes = 10

model = Sequential([
  Conv2D(16, (3, 3), activiation='relu', input_shape=(32, 32, 3)),
  MaxPooling2D((3, 3)),
  Flatten(),
  Dense(64, activation='relu'),
  Dense(tfpl.OneHotCategorical.params_size(num_classes)),
  tfpl.OneHotCategorical(num_classes)
])

model.compile(loss=lambda y_true, y_pred: - y_pred.log_prob(y_true))
model.fit(x_train, y_train, epochs=20)

This model can be trained using the same principle of maximum likelihood that we’ve used before. In the compile method, we’re using a lambda function to define the negative log likelihood.

The DenseVariational Layer

Remember that the second type of uncertainty is epistemic uncertainty (model uncertainty). This is uncertainty about which possible model parameter values, explain the data. Normally, when we optimize a deep learning model will end up with a single value for each weight in the network.

But in reality and especially considering our data set as finite, there will most likely be many possible parameter values that do a good job of modeling that relationship between data inputs and outputs. If we went out and collected more data, then we would have more information about that relationship and the likely sets of model parameters would probably narrow down.

This likely set of model parameter values given a data set is represented as a distribution over all possible parameter values and is called the posterior distribution. A common interpretation is that the posterior distribution captures our belief over which parameter values are most likely, given the data that we’ve seen.

  1. We will start with some prior distribution over the weights. That is a belief about which weights are likely before we’ve even seen any data.
  2. We would then update our belief about the weights once we’ve seen the data set, to obtain the posterior distribution.

Unfortunately though, learning the true posterior distribution is hard, and in practice we have to approximate it. We could use variational inference to give an approximation of the posterior over the model weights. We can implement this algorithm for a feed-forward network using the DenseVariational layer.

Here we have a simple feed-forward network with a probabilistic layer output:

model = Sequential([
  Dense(16, activation='relu', input_shape=(8,)),
  Dense(2),
  tfpl.IndependentNormal(1)
])

model.compile(loss=lambda y_true, y_pred: - y_pred.log_prob(y_true))
model.fit(x_train, y_train, epochs=20)

The two Dense layers contain weights and biases which are being learned in the training run and it’s these weights and biases that define the mean and variance of the output normal distribution. After training this model, will obtain point estimates for these Dense layer weights and biases. But in order to capture the epistemic uncertainty, would like instead to learn the posterior distribution over these parameters.

The Prior

We need to start by defining the prior distribution over these parameters. This distribution represents our belief about which model parameters are likely before we’ve seen any data. The standard assumption is that the prior distribution is a spherical Gaussian or in other words, an independent normal distribution for each weight and bias, all with equal variance.

The prior distribution to be the same independent normal distribution regardless of the dense layer input. The prior also has no trainable variables involved at all, it won’t change during the optimization procedure.

def prior(kernel_size, bias_size, dtype=None):
  # kernel_size is the number of parameters in the dense layer weights matrix
  n = kernel_size + bias_size

  # returns a callable object - a lambda
  # the input tensor t will be the input tensor to the dense layer that we're defining this prior distribution for, which isn't used in this distribution

  return lambda t: tfd.Independent(
    tfd.Normal(loc=tf.zeros(n, dtype=dtype), scale=1),
    reinterpreted_batch_ndims=1
  )


The Posterior

We also need to define the posterior distribution over the dense layer parameters. The Sequential model is also a callable object, which takes a tensor as input and this model will again return a distribution object.

def posterior(kernel_size, bias_size, dtype=None):
  n = kernel_size + bias_size
  return Sequential([
    # VariableLayer is a simple layer type that returns a TensorFlow variable when called regardless of the input
    tfpl.VariableLayer(tfp.IndependentNormal.params_size(n), dtype=dtype),

    # event_size equal to the number of parameters in the dense layer
    tfpl.IndependentNormal(n, convert_to_tensor_fn=tfd.Distribution.sample)
  ])

So the posterior distribution will also be independent of the tensor inputs to the dense layer. The distribution returned by the callable Sequential model defines a distribution over the dense layer parameters.

The Model

Now that we’ve defined the prior and posterior, we’re ready to redefine our deep learning model. This time using distributions for the model weights instead of point estimates. Replace each Dense layer we had before with a DenseVariational layer from the TensorFlow Probability layers module.

DenseVariational layers also require you to provide functions that define the posterior and prior distributions. The layer then adds the KL divergence loss in the optimization of the model in a similar way to how weights regularizer is added. N is the size of the dataset. Normally, 1/N is the value that you should use.

model = Sequential([
  tfpl.DenseVariational(16, posterior, prior, kl_weight=1/N,
    kl_use_exact=True, activation='relu', input_shape=(8,)),
  tfpl.DenseVariational(2, posterior, prior, kl_weight=1/N,
    kl_use_exact=Ture),
  tfpl.IndependentNormal(1)
])

Additionally, kl_use_exact option is for how the KL divergence should be computed. Depending on the choice of distributions used for both the posterior and the prior, it may be possible to analytically compute the KL divergence.

Reparameterization Layers

The DenseVariational layer is an updated version of an early class in TensorFlow Probability called the DenseReparameterization layer. Here is a convolutional neural network architecture, but where the convolutional and dense layers have been replaced with their reparameterization variance from the layers module in the TFP library.

model = Sequential([
  tfpl.Convolution2DReparameterization(16, [3, 3],
    activation='relu', input_shape=(28, 28, 1)),
  MaxPool2D(3),
  Flatten(),
  tfpl.DenseReparameterization(tfpl.OneHotCategorical.params_size(10)),
  tfpl.OneHotCategorical(10)
])

Behind these reparameterization layers, the underlying algorithm and theoretical background is the same as what we discussed with the DenseVariational layer. These posterior distributions are learned using variational inference by assuming a parametrized form of the posterior and then maximizing a lower bound on the log evidence (ELBO).



My Certificate

For more on TensorFlow: Probabilistic Deep Learning Models, please refer to the wonderful course here https://www.coursera.org/learn/probabilistic-deep-learning-with-tensorflow2


Related Quick Recap


I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Leave a Reply

Your email address will not be published. Required fields are marked *