Sequential data is data that has a natural sequential structure built into it, like text data or audio data. There are various network architectures and layers that we can use to make predictions from sequence data. However sequence data is often unstructured and not as uniform as other datasets.

Preprocessing Sequential Data

Each sequence example might be a different length, we could use tools to pad or truncate sequences, so that a number of sequence examples can be stacked together for batch processing.

from tensorflow.keras.preprocessing.sequence import pad_sequences

test_input = [
  [4, 12, 33, 18],
  [63, 23, 54, 30, 19, 3],
  [43, 37, 11, 33, 15]

test_input_2 = [
  [ [2, 1], [3, 3] ],
  [ [4, 3], [2, 4], [1, 1] ]

preprocessed_data = pad_sequences(test_input, padding='post', maxlen=5, truncating='post', value=-1)

# preprocessed_data
# [[4, 12, 33, 18, -1],
#  [63, 23, 54, 30, 19],
#  [43, 37, 11, 33, 15]]

preprocessed_data_2 = pad_sequences(test_input_2, padding='post')

# preprocessed_data_2
# [[ [2, 1], [3, 3], [0, 0] ],
#  [ [4, 3], [2, 4], [1, 1] ]]

However padding sequences does lead to complications, though, because you will want to train your model on those parts of the input sequences that are the padding values. Fortunately, it’s really easy to handle this using masking in your network. The masking layer expects a three-dimensional input, i.e. (batch_size, seq_length, features), so you probability need to add a new dimension by using [..., np.newaxis].

The new tensor now has an extra attribute called _keras_mask, which is a boolean tensor that signals which values in the input are part of the original data and which should be ignored. This mask is used to make sure the loss function is calculated correctly, and ignores any parts of the input that is padding.

from tensorflow.keras.layers import Masking

masking_layer = Masking(mask_value=-1)
masked_input = masking_layer(preprocessed_data)

# masked_input
# [[ [4], [12], [33], [18], [-1] ],
#  [ [63], [23], [54], [30], [19] ],
#  [ [43], [37], [11], [33], [15] ]]

# masked_input._keras_mask
# [[ True, True, True, True, False ],
#  [ True, True, True, True, True ],
#  [ True, True, True, True, True ]]

The Embedding Layer

The embedding layer takes in a tokenized sequence and will map each one of those separate tokens to a point in some high-dimensional embedding space. This allows the network to learn its own representation of each token in a sequence input.

from tensorflow.keras.layers import Embedding
import numpy as np

embedding_layer = Embedding(1000, #input dimension
                            32,   #embedding dimension

dummy_input = np.random.randint(1000, size=(16, 64)) #(batch_size, input_len)

embedding_imputs = embedding_layer(dummy_input) # (16, 64, 32)

The first argument is the input dimension, which you might find easier to think of as the vocabulary size. It’s just the total number of unique tokens or words in the sequence data inputs. The second argument is the embedding dimension, each of the input token will be mapped somewhere into the embedding dimension space, in such a way as to make a useful representation for the network to accomplish its task. The embedding layer is also able to handle padded sequence inputs correctly.

By setting mask_zero=True, the embedding layer will interpret any zeros that are in the input as padding values. So the network will ignore them.

Recurrent Neural Network

An important class of models to work with sequence data are recurrent neural networks, they are designed to capture the temporal dependencies in the data. Here’s an example of a simple recurrent neural network SimpleRNN.

from tensorflow.keras.models import Sequential
from tensorflow.keras.models import Embedding, SimpleRNN, Dense

model = Sequential([
  Embedding(1000, 32, input_length=64), #output (None, 64, 32)
  SimpleRNN(64, activiation='tanh'),    #output (None, 64)
  Dense(5, activation='softmax')        #output (None, 5)

In general, an RNN layer expects a three-dimensional tensor input with

(batch_size, sequence_length, num_features)

In the example above, this simple RNN layer is a plain recurrent neural network with hidden states of size 64. The RNN will process the sequence input and the output is from the final hidden state of the network, i.e. a two dimensional tensor with shape

(batch_size, num_hidden_states)

One of the strengths of recurrent neural nets is their ability to take flexible length sequences, so it is OK to omit the input_length when using Embedding layer, which will enable the network to take a batch of sequences of any length. Both the batch_size and sequence_length are flexible. That’s possible because the RNN layer is only returning its hidden state at the final time step.

model = Sequential([
  Embedding(1000, 32),                  #output (None, None, 32)
  SimpleRNN(64, activiation='tanh'),    #output (None, 64)
  Dense(5, activation='softmax')        #output (None, 5)

LSTM (Long Short Term Memory) and GRU (Gated Recurrent Unit) can be used in the same way as you do with the SimpleRNN. You might want to experiment with these different RNN layers to see which one works best for your application.

For example, using Functional API to define a model:

# flexible sequence_length, 10 features

inputs = Input(shape=(None, 10))             # (None, None, 10)
h = Masking(mask_value=0)(inputs)            # (None, None, 10)
h = LSTM(64)(h)                              # (None, 64)
outputs = Dense(5, activation='softmax')(h)  # (None, 5)

model = Model(inputs, outputs)

Stacked RNN

Usually the RNN layers have only returned the output at the final time step, however sometimes what you’d like is for an RNN layer to return an output at every time step in the sequence. These outputs can then be used for:

  1. the final model predictions, or
  2. as an input for another recurrent layer, further downstream.

Each of the recurrent neural network layers have an optional argument return_sequences, which is by default set to False and the reason why the RNN layer only returns the output at the final time step. If this option is set to True, then the layer will return an output for each time step.

The shape of the output after the first LSTM layer is still a three-dimensional tensor of the form (batch_size, sequence_length, num_features), which can be used as an input to another recurrent neural network layer. This is how we can create stacked LSTMs.

h = LSTM(32, return_sequences=True)(h)       # (None, None, 32)
h = LSTM(64)(h)                              # (None, 64)

Bidirectional Wrapper

Bidirectional recurrent layers are often used when we’d like the network to take account future context as well as past contexts. We can create a bidirectional layer by using the bidirectional wrapper and calling it on a regular recurrent layer.

h = Bidirectional(LSTM(32, return_sequences=True))(h)     # (None, None, 64)
h = Bidirectional(LSTM(64))(h)                            # (None, 128)

Because this is now a bidirectional layer for each LSTM, we effectively have two LSTM networks. One, running in forwards time and one in backwards time. So the outputs will be a combination of the final outputs from each of those recurrent networks.

  1. The first dimension is the batch_size as always
  2. The second dimension is the sequence_length, which in this model is flexible,
  3. The third dimension is the feature dimension that is the concatenation of the outputs for each of these LSTMs, running in forwards and backwards time.

We can also change the behavior of the bidirectional wrapper by changing the merge_mode option. If we set it to sum, then the forward and backward RNN outputs will be added together instead of concatenated.

My Certificate

For more on Sequential Data and Recurrent Neural Networks, please refer to the wonderful course here

Related Quick Recap

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at

All of your support will be used for maintenance of this site and more great content. I am humbled and grateful for your generosity. Thank you!

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Leave a Reply

Your email address will not be published. Required fields are marked *