Data pipelines are for loading, transforming, and filtering a wide range of different data for use in your models. Broadly speaking there are two ways to handle our data pipeline:
- from within Keras,
- from within the
These two modules together offer some powerful methods for managing data and building an effective workflow.
The Keras datasets give us a really convenient way to access different types of datasets, which are available as modules within
tensorflow.keras.datasets. For example you could import mnist module, and load the dataset itself.
from tensorflow.keras.datasets import mnist (x_train, y_train), (x_test, y_test) = mnist.load_data()
The dataset is downloaded and stored in a hidden folder on your local machine. Other datasets could be downloaded in the same way, but the
load_data() method may support some parameters.
In practice datasets are often a lot bigger and won’t fit into memory. One way to handle this is to use detaset generators. This is a way to feed data into model without loading it all up in your memory at once.
A generator in Python is a function that returns a generator object that you can iterate over, and it yields a finite or infinite series of values, but it doesn’t store all those value in memory. Each time we iterate the generator, it yields the next value in the series. In this way, we can use generators to feed data into our model when the data doesn’t fit into memory.
Here is an example of generator that generate an infinite series of values:
import numpy as np def get_data(batch_size): while True: y_train = np.random.choice([0, 1], (batch_size, 1)) x_train = np.random.randn(batch_size, 1) + (2 * y_train - 1) yield x_train, y_train datagen = get_data(32) x, y = next(datagen)
Note: when you want to fit a model using this data generator, you shall call the
fit_generator method of the model. We need to explicitly tell the model how much data should be used by specifying the
Another example of training a model using a generator is to use the
train_on_batch method, which just performs one optimizer update for a single batch of training data. In most cases though, it’s unlikely that you’ll really need to use the
train_on_batch method, because this is a slightly lower level way of handling model training.
Finally, when it comes to evaluation and prediction, there are corresponding
predict_generator methods. We can pass in other data generators for the purpose of evaluation and prediction. Remember, if the generator is generating infinite series of data, we need to specify how many steps should run by specifying the
Image Data Augmentation
Another benefit of using generators is that we can add data pre-processing as part of the workflow. The
ImageDataGenerator object in Keras offers a useful way of handling image data. This object allows us to define a generator that will pass to our model to train on using
fit_generator, gives us a really easy way of pre-processing and augmenting our image data set on the fly during training.
Data augmentation techniques such as:
- shifting or rotating the image slightly
- flipping an image horizontally
effectively produce more data for our model to train on. This is really useful when training on image data, especially when the data set is quite small. There are options to implement certain data augmentation or pre-processing techniques.
|rescale||values of all pixels are all multiplied by a factor.|
|horizontal_flip||if set to True, this will randomly choose whether to flip each training image horizontally or not.|
|height_shift_range||each image will be randomly shifted up or down.|
|fill_mode||choose the method by which those missing pixels are filled in.|
|featurewise_center||if set to True, this will standardize that data set so that the mean of each individual feature (say, RGB channels) in the input image over the whole data set is equal to zero.|
from tensorflow.keras.preprocessing.image import ImageDataGenerator image_data_gen = ImageDataGenerator(rescale=1/255., horizontal_flip=True, height_shift_range=0.2, fill_mode='nearest', featurewize_center=True)
ImageDataGenerator needs to calculate the data set feature means first, before it can start generating pre-processed samples. So if we’re using a standardization technique, we then need to run the
fit method of the
ImageDataGenerator on the training data set.
Finally, we can then get the generator itself by using the
flow method and passing in the training images and labels. This method also has some training options. Here we’re setting the batch size to be 16.
train_datagen = image_data_gen.flow(x_train, y_train, batch_size=16)
And then we are ready to train the model like before:
Similar objects are available in Keras for pre-processing, sequence data and text data.
tensorflow.data API gives us a unified way of handling datasets of any type, and data that might come from a range of different sources. The module contains a powerful set of tools for managing complex and flexible data pipelines. The main object that we’re going to be using in the module is the
Dataset class, which is the main abstraction, whatever:
- form or size they might come in
- apply any necessary preprocessing, or
- filtering that we might want to do
Creating Dataset Objects From Different Sources
The Dataset class comes with a few static methods that we can use to create the Dataset object. Here is a simple example of passing in a list of data elements to the method
from_tensor_slices. This returns a
TensorSliceDataset object. We’re passing in a tuple of tensors to create the dataset, which is very common whenever we want to create a dataset with input and output data.
The first dimension of the tensor is interpreted as the dataset size, it’s important that each tensor in this tuple has the same first dimension, otherwise we’d get an error here. We can directly inspect the type specification of a dataset elements by looking at the elements spec property.
import tensorflow as tf dataset = tf.data.Dataset.from_tensor_slices( ( tf.random.uniform([256, 4], minval=1, maxval=10, dtype=tf.int32), tf.random.normal() ) ) print(dataset.element_spec) # (TensorSpec(shape=(4,), dtype=tf.int32, name=None), # TensorSpec(shape=(), dtype=tf.float32, name=None) )
The object is iterable, we can easily access each element in the dataset by writing a simple
for loop. The
take method will just take first few elements of the dataset. This is a useful way to take a quick look at the data to check that everything looks right.
for elem in dataset.take(2): print(elem.numpy())
We could use the dataset objects to load an actual dataset. Remember each input and output are numpy arrays.
import tensorflow as tf from tensorflow.keras.datasets import cifa10 (x_train, y_train), (x_test, y_test) = cifa10.load_data() dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
We can also use the dataset class to wrap the generators. Suppose we’ve set up an image data generator with a couple of options for random data augmentation. Then we could use the
from_generator method to create a dataset object.
image_datagen = ImageDataGenerator(horizontal_flip=True, weight_shift_range=0.2) dataset = tf.data.Dataset.from_generator(image_datagen.flow, args=[x_train, y_train], output_types=(tf.float32, tf.int32), output_shapes=([32,32,32,3], [32,1]))
There are many more ways of creating instances of the same dataset class from different types of data sources. But with all of these methods in different data sources, the point is that the end result is an instance of the
Dataset class. This abstraction provides a nice unified way of dealing with different types and sources of data. Once we’ve created a data set objects for our data, then we can use the flexible data preprocessing and filtering methods to prepare our data for training.
Training Model with Datasets
Recall that if we iterate over the data set object we extract each data element one by one. But we can also batch data examples together by using the
dataset.batch method, and choose to drop the remainder. This makes sure that every batch really does have 16 images and labels.
Now we want to train our model. And that’s as simple as calling the
model.fit method as usual where this time we don’t need to pass in the inputs and outputs separately. We just pass in the data set object.
dataset = dataset.batch(16, drop_remainder=True) model.fit(dataset)
model.fit call will train the model for one epoch or one complete pass through the dataset object. If we want to train for 10 epochs we could do this.
dataset = dataset.repeat(10) model.fit(dataset)
repeat method without an argument means that the dataset will repeat indefinitely. Then to train for a certain number of epochs I can set the
steps_per_epoch argument, so that the training process knows when an epoch has ended, and the
dataset = dataset.repeat() history = model.fit(dataset, steps_per_epoch=x_train.shape//16, epochs=10)
You might also want to randomly shuffle the dataset, call the
shuffle method specifying the shuffle buffer size. The buffer will stay filled with 100 data examples and the batch will be sampled from the buffer.
dataset = dataset.shuffle(100)
Pre-Processing and Filtering
You can also apply transformations to data set objects for pre-processing or filter out certain examples in the data that you don’t want to use for training.
We’re going to define a function that will apply to each element in the data set. The function takes an
label as arguments and this is precisely what each element in the data set object consists of. The function itself just normalizes the pixel values of the image. We are then applying this function to every dataset element with the
map method just by passing the function itself into the argument.
def rescale(image, label): return image/255, label dataset = dataset.map(rescale)
We can also filter out certain dataset examples that we don’t want using the
filter method. We also need to define a function, but this time it returns a boolean. The
label is actually a one-dimensional tensor in this dataset, we could use
tf.squeeze to convert it to a scalar. In the example below, any data examples with
label equal to 9 will be filtered out and not used in the training.
def label_filter(image, label): return tf.squeeze(label) != 9 dataset = dataset.filter(label_filter)
filter functions can be very complex and we can even apply several
filter functions to the same data set.
For more on Keras and Tensorflow Datasets, please refer to the wonderful course here https://www.coursera.org/learn/customising-models-tensorflow2
Related Quick Recap
I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai
All of your support will be used for maintenance of this site and more great content. I am humbled and grateful for your generosity. Thank you!
Don't forget to sign up newsletter, don't miss any chance to learn.
Or share what you've learned with friends!Tweet