TensorFlow is an open-source, high-performance library for any numerical computation (not just for machine learning). For example, you could use TensorFlow to solve partial differential equations. In order to make TensorFlow work, you need to create directed graph, which represent the computation you want to do.
In a graph, nodes represent mathematical operations, edges represent arrays of data. A tensor is an N-dimensional array of data. They flow through the graph. Portability is the reason why TensorFlow uses directed graph to represent computation.
Layers of API
Tensorflow contains multiple abstraction layers:
- High-level API for distributed training
- Components useful when building custom neural networks
- Core Python API
- Tensorflow C++ API (implement your own custom operations)
- The lowest layer deals with various hardware.
Tensors and Variables
Recall that tensor is an N-dimensional array of data. When you create a tensor, you specify its shape. The first element of a shape could be a variable. tf.constant produces a constant tensor, you can stack them up, slice them down and reshape them to a new size. tf.Variable produces a tensors that can be modified, typically hold model weights that need to be updated in a training loop.
Tensorflow can compute the derivative of a function with respect to any parameter. During training ,weights are updated by using the partial derivative of the loss with respective to each individual weight. To differentiate automatically, Tensorflow needs to remember what operations happened in what order during the forward pass. Then during backward pass, Tensorflow traverses this list of operations in reverse order to compute those gradients.
Often dimensions of a tensor is ordered from global to local. You should keep tracking the meaning of each of them. The batch dimension comes first, then spatial dimensions, and features for each location last. In this way the features are using continuous region of memory.
Broadcasting is under certain conditions, smaller tensors are stretched to fit larger tensors when running combined operations on them. For example: adding a scalar to a vector, in which scalar is stretched to the same shape as the other argument (the vector). You see what broadcasting looks like using tf.broadcast_to .
Input Data Pipeline
After you define the business use case and the success criteria, the process of delivering an ML model to production include these steps:
- data: extraction, analysis, preparation
- model: training, evaluation, validation, serving, monitoring
There are 2 phrases:
- Training phrase: Labeled data -> ML algorithm -> Trained model
- Inference phrase: New data -> Served model -> Prediction
tf.data API is one of efficient way to build data pipeline, which is really just a series of data processing steps. Use tf.data.Dataset.from_tensors or tf.data.Dataset.from_tensors_slices to create data sets from tensors in memory. Use tf.data.TextLineDataset to load data from a CSV file, its map method is responsible for parsing each row of the CSV file. Its shuffle, repeat, batch methods allow data to be fed into training loop iteratively.
Finally when we need to load large data set from a few files, we could use tf.data.Dataset.list_files to scan the disk and list all data set file names. Then flat_map method with tf.data.TextLineDataset to load each file in turn into the dataset. And at last use map method as the case above.
Dataset allows for data to be pre-fetched. Without pre-fetching, GPU could be idle when CPU prepares dataset, and pre-fetching solves this problem. By combining pre-fetching and multi-threaded loading and processing, you can achieve a very good performance.
“Feature columns” takes care of packing input data into the input vectors of model, it bridge the gap between columns in CSV files and features used to train a model. Feature columns provide methods to properly transform input data before sending it to a model for training. Use tf.feature_column API to determine the features.
Lots of data likely won’t fit in memory, it can possibly be spread across many files, or may come from an input pipeline. The tf.data API can help build these pipelines from simple reusable pieces. The tf.data.Dataset represents a sequence of elements, in which each element consists of 1 or more components. For example in an image pipeline, an element might be a single training example with a pair of tensor components representing its image and label. (The label itself is a tensor!)
Activation functions are critical in neural networks, you need them to capture non-linearities of your data. Recall that a linear model can be represented as nodes and edges. There is also often a bias term added in. But merely adding new hidden layers actually do not change anything, it is just another presentation of the same linear model before. The solution is adding non-linear transformation layers which are facilitated by activation functions. You may imagine each neuron has 2 nodes: 1) the weighted sum and 2) activation function. Adding this non-linear transformation is the only way of preventing neural network from condensing back down into a shallow network.
Usually neural networks have all first n-1 layers with non-linearity transformations, but the final output layer being linear for regression, and softmax or sigmoid for classification. However sigmoid and tanh usually cause saturation and finally lead to the problem called vanishing gradients with value zeros. Meanwhile ReLU works well and is our favorites. The network with ReLU often have 10 times speed of training than the network with sigmoid. There are many ReLU variants:
- Softplus (smooth ReLU)
- Leaky ReLU / Parametric ReLU
- Exponential Linear Unit
- Gaussian Error Linear Unit
There are 3 common failure modes for gradient descent:
- vanishing gradients
- each additional layer can reduce signal versus noise
- fix this using non-saturating non-linear activation functions (ReLU) instead of sigmoid or tanh
- exploding gradients
- weights get so large (overflow)
- specially true for sequence layers with long sequence length
- learning rate might be the reason
- fix this by
- weight regularization
- smaller batch size
- weight clipping
- batch normalization (ideally keep gradients as close as 1)
- dead ReLU layers
- ReLU will stop working when their inputs keep give them negative values (which result in zero activation value)
- Use Tensorboard to monitor fraction of zeros in hidden layer
- lower your learning rate
Keras Sequential API
tf.keras is Tensorflow’s high-level API for building and training neural networks. A sequential model is a plain stack of layers which are all with 1 input tensor and 1 output tensor. A linear model with 1 single Dense layer is able to perform logistic regression. With additional Dense layers, the model become a neural network which is possible to map non-linearity. More layers lead to deeper neural network but possibly also over-fitting. Regularization will help to mitigate the over-fitting.
Once we defined the model, we compile it. A few parameters shall be passed: optimizer, loss function, and evaluation metrics. Loss function is the guide to the terrain, telling the optimizer whether it is moving in the right or wrong direction for reducing the loss. Optimizer actually updates the model parameters in response to the output of the loss function. A few useful optimizes include:
- Stochastic gradient descent
- FTRL (Follow the regularized leader)
Adam and FTRL make really good default for neural networks as well as linear models.
We train the model by calling the fit method. Epoch is a complete pass on the training data set. Steps per epoch is the batch iterations before an epoch is considered finished. Batch size determines the number of samples in each mini batch. Callback (usually to Tensorboard) is for logging and visualization.
Once trained, the model now can be used for predictions or inferences. The steps parameter is the total number of steps before declaring a prediction round is finished.
Keras Functional API
Jointly training a wide linear model and a deep neural network can combine the power of memorization and generalization. It is called Wide and Deep Learning. Linear models are good for sparse and independent features. Deep neural networks are good for dense, highly-correlated features. DNN will de-correlate the input and map them to lower dimensions. Functional API gives the model the ability to have multiple inputs and outputs. It also allows non-linear models and sharing layers. Models are created by specifying inputs and output in a graph of layers. A single graph of layers can be used to generate multiple models. You can treat a model as if it were a layer.
If we use a model that is too complicated (for example the one with many synthetic features or features crosses), we give the model the opportunity to squeeze and over-fit itself to the training data at the cost of making the model perform badly at test data. Generalization theory defines the statistical framework. When training a model, we apply Ockham’s razor principle: we favorite simpler model with less assumption about training data.
The idea is to penalize / minimize the model complexity. Given a model, we need to balance the loss of data and model complexity. The over-simplified models are useless. We need to find the balance between simplicity and actually accurate fitting of training data. Optimal model complexity is data-dependent, so hyperparameters tuning is required.
Regularization refers to any technique that generalize a model, which is a major field of research, there are various methods:
- early stopping
- parameter norm penalty
- L1 regularization (adds a sum of absolute value of parameter weights term to the loss function)
- L2 regularization (adds a sum of squared parameter weights term to the loss function)
- max-norm regularization
- dataset augmentation
- noise robustness
- sparse representation
L1, L2 and Early-Stopping
L2 is great at keeping weight small, having stability and a unique solution. But it can leave the model unnecessarily large and complex since all of the features may still remain (L2 only makes weight small, not zero).
L1 tends to force the useless features’ weights to zero, killing of bad features and leaving only strong features, resulting in a sparse model. Fewer coefficients to store/load, to save memory. Fewer multiplication needed, to speed the prediction.
To counteract over-fitting we usually do both regularization and early-stopping. Model complexity increases with large weights. As we get larger and larger weights, we end up increasing the loss, so we stop. To find the optimal L1 / L2 hyper parameters, you search for a point in the validation loss function, where you obtain the lowest value.
- To the left -> less regularization -> more variance -> starts overfitting -> hurts generalization
- To the right -> more regularization -> more bias -> starts under-fitting -> hurts generalization
Early fitting stops training when over-fitting begins. As you train your model, you should evaluate your model on the validation dataset often. As training continues, both training error and validation error should be decreasing. But at some time, validation error begins to increase. At this point, model begin to memorize the training data set, and start to lose its ability to generalize on the validation data set. Using early stopping, we stop at this point.
Early stopping is an approximate equivalent of L2 regularization, but in practice we always explicitly use L1 / L2 with early-stopping.
Dropout is a method of regularization by adding a dropout layer to neural networks. Dropout actually creates ensemble models, because for each forward pass, there is effectively a different network. This way your network uses its more capacity, thus you have a better generalization.
For more on TensorFlow Essentials, please refer to the wonderful course here https://www.coursera.org/learn/intro-tensorflow
I am Kesler Zhu, thank you for visiting my website. Checkout more course reviews at https://KZHU.ai
All of your support will be used for maintenance of this site and more great content. I am humbled and grateful for your generosity. Thank you!
Don't forget to sign up newsletter, don't miss any chance to learn.
Or share what you've learned with friends!Tweet