Data pipelines are for loading, transforming, and filtering a wide range of different data for use in your models. Broadly speaking there are two ways to handle our data pipeline:

  1. from within Keras,
  2. from within the tensorflow.data module.

These two modules together offer some powerful methods for managing data and building an effective workflow.



Keras Datasets

The Keras datasets give us a really convenient way to access different types of datasets, which are available as modules within tensorflow.keras.datasets. For example you could import mnist module, and load the dataset itself.

from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

The dataset is downloaded and stored in a hidden folder on your local machine. Other datasets could be downloaded in the same way, but the load_data() method may support some parameters.

Dataset Generators

In practice datasets are often a lot bigger and won’t fit into memory. One way to handle this is to use detaset generators. This is a way to feed data into model without loading it all up in your memory at once.

A generator in Python is a function that returns a generator object that you can iterate over, and it yields a finite or infinite series of values, but it doesn’t store all those value in memory. Each time we iterate the generator, it yields the next value in the series. In this way, we can use generators to feed data into our model when the data doesn’t fit into memory.

Here is an example of generator that generate an infinite series of values:

import numpy as np

def get_data(batch_size):
  while True:
    y_train = np.random.choice([0, 1], (batch_size, 1))
    x_train = np.random.randn(batch_size, 1) + (2 * y_train - 1)
    yield x_train, y_train

datagen = get_data(32)

x, y = next(datagen)

Note: when you want to fit a model using this data generator, you shall call the fit_generator method of the model. We need to explicitly tell the model how much data should be used by specifying the steps_per_epoch and epochs parameters.

Another example of training a model using a generator is to use the train_on_batch method, which just performs one optimizer update for a single batch of training data. In most cases though, it’s unlikely that you’ll really need to use the train_on_batch method, because this is a slightly lower level way of handling model training.

Finally, when it comes to evaluation and prediction, there are corresponding evaluate_generator and predict_generator methods. We can pass in other data generators for the purpose of evaluation and prediction. Remember, if the generator is generating infinite series of data, we need to specify how many steps should run by specifying the steps parameter.



Image Data Augmentation

Another benefit of using generators is that we can add data pre-processing as part of the workflow. The ImageDataGenerator object in Keras offers a useful way of handling image data. This object allows us to define a generator that will pass to our model to train on using fit_generator, gives us a really easy way of pre-processing and augmenting our image data set on the fly during training.

Data augmentation techniques such as:

  1. shifting or rotating the image slightly
  2. flipping an image horizontally

effectively produce more data for our model to train on. This is really useful when training on image data, especially when the data set is quite small. There are options to implement certain data augmentation or pre-processing techniques.

rescalevalues of all pixels are all multiplied by a factor.
horizontal_flipif set to True, this will randomly choose whether to flip each training image horizontally or not.
height_shift_rangeeach image will be randomly shifted up or down.
fill_modechoose the method by which those missing pixels are filled in.
featurewise_centerif set to True, this will standardize that data set so that the mean of each individual feature (say, RGB channels) in the input image over the whole data set is equal to zero.
from tensorflow.keras.preprocessing.image import ImageDataGenerator

image_data_gen = ImageDataGenerator(rescale=1/255., horizontal_flip=True, height_shift_range=0.2, fill_mode='nearest', featurewize_center=True)

The ImageDataGenerator needs to calculate the data set feature means first, before it can start generating pre-processed samples. So if we’re using a standardization technique, we then need to run the fit method of the ImageDataGenerator on the training data set.

image_data_gen.fit(x_train)

Finally, we can then get the generator itself by using the flow method and passing in the training images and labels. This method also has some training options. Here we’re setting the batch size to be 16.

train_datagen = image_data_gen.flow(x_train, y_train, batch_size=16)

And then we are ready to train the model like before:

model.fit_generator(train_datagen, epochs=20)

Similar objects are available in Keras for pre-processing, sequence data and text data.

Introducting tensorflow.data Module

The tensorflow.data API gives us a unified way of handling datasets of any type, and data that might come from a range of different sources. The module contains a powerful set of tools for managing complex and flexible data pipelines. The main object that we’re going to be using in the module is the Dataset class, which is the main abstraction, whatever:

  1. form or size they might come in
  2. apply any necessary preprocessing, or
  3. filtering that we might want to do

Creating Dataset Objects From Different Sources

The Dataset class comes with a few static methods that we can use to create the Dataset object. Here is a simple example of passing in a list of data elements to the method from_tensor_slices. This returns a TensorSliceDataset object. We’re passing in a tuple of tensors to create the dataset, which is very common whenever we want to create a dataset with input and output data.

The first dimension of the tensor is interpreted as the dataset size, it’s important that each tensor in this tuple has the same first dimension, otherwise we’d get an error here. We can directly inspect the type specification of a dataset elements by looking at the elements spec property.

import tensorflow as tf
dataset = tf.data.Dataset.from_tensor_slices(
    ( tf.random.uniform([256, 4], minval=1, maxval=10, dtype=tf.int32),
      tf.random.normal([256]) )
)

print(dataset.element_spec)
# (TensorSpec(shape=(4,), dtype=tf.int32, name=None),
#  TensorSpec(shape=(), dtype=tf.float32, name=None) )

The object is iterable, we can easily access each element in the dataset by writing a simple for loop. The take method will just take first few elements of the dataset. This is a useful way to take a quick look at the data to check that everything looks right.

for elem in dataset.take(2):
  print(elem.numpy())

We could use the dataset objects to load an actual dataset. Remember each input and output are numpy arrays.

import tensorflow as tf
from tensorflow.keras.datasets import cifa10

(x_train, y_train), (x_test, y_test) = cifa10.load_data()

dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))


We can also use the dataset class to wrap the generators. Suppose we’ve set up an image data generator with a couple of options for random data augmentation. Then we could use the from_generator method to create a dataset object.

image_datagen = ImageDataGenerator(horizontal_flip=True, weight_shift_range=0.2)

dataset = tf.data.Dataset.from_generator(image_datagen.flow,
args=[x_train, y_train], output_types=(tf.float32, tf.int32),
output_shapes=([32,32,32,3], [32,1]))

There are many more ways of creating instances of the same dataset class from different types of data sources. But with all of these methods in different data sources, the point is that the end result is an instance of the Dataset class. This abstraction provides a nice unified way of dealing with different types and sources of data. Once we’ve created a data set objects for our data, then we can use the flexible data preprocessing and filtering methods to prepare our data for training.

Training Model with Datasets

Recall that if we iterate over the data set object we extract each data element one by one. But we can also batch data examples together by using the dataset.batch method, and choose to drop the remainder. This makes sure that every batch really does have 16 images and labels.

Now we want to train our model. And that’s as simple as calling the model.fit method as usual where this time we don’t need to pass in the inputs and outputs separately. We just pass in the data set object.

dataset = dataset.batch(16, drop_remainder=True)
model.fit(dataset)

Running this model.fit call will train the model for one epoch or one complete pass through the dataset object. If we want to train for 10 epochs we could do this.

dataset = dataset.repeat(10)
model.fit(dataset)

Using the repeat method without an argument means that the dataset will repeat indefinitely. Then to train for a certain number of epochs I can set the steps_per_epoch argument, so that the training process knows when an epoch has ended, and the epochs argument.

dataset = dataset.repeat()
history = model.fit(dataset, steps_per_epoch=x_train.shape[0]//16, epochs=10)

You might also want to randomly shuffle the dataset, call the shuffle method specifying the shuffle buffer size. The buffer will stay filled with 100 data examples and the batch will be sampled from the buffer.

dataset = dataset.shuffle(100)

Pre-Processing and Filtering

You can also apply transformations to data set objects for pre-processing or filter out certain examples in the data that you don’t want to use for training.

We’re going to define a function that will apply to each element in the data set. The function takes an image and label as arguments and this is precisely what each element in the data set object consists of. The function itself just normalizes the pixel values of the image. We are then applying this function to every dataset element with the map method just by passing the function itself into the argument.

def rescale(image, label):
  return image/255, label

dataset = dataset.map(rescale)

We can also filter out certain dataset examples that we don’t want using the filter method. We also need to define a function, but this time it returns a boolean. The label is actually a one-dimensional tensor in this dataset, we could use tf.squeeze to convert it to a scalar. In the example below, any data examples with label equal to 9 will be filtered out and not used in the training.

def label_filter(image, label):
  return tf.squeeze(label) != 9

dataset = dataset.filter(label_filter)

The map and filter functions can be very complex and we can even apply several map or filter functions to the same data set.



My Certificate

For more on Keras and Tensorflow Datasets, please refer to the wonderful course here https://www.coursera.org/learn/customising-models-tensorflow2


Related Quick Recap


I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Leave a Reply

Your email address will not be published. Required fields are marked *