Image Data Pipelines in TensorFlow and Keras

Overview

As we all know, all Machine Learning models or Deep Learning models are data-hungry. We can only make Machine Learning or Deep Learning models if we have data. So we need a promising data pipeline ready for making a good machine learning or deep learning model. The data pipeline involves a series of steps before feeding into the model. A promising data pipeline handles data efficiently and reduces the training time of any machine learning or deep learning model. In this blog, I will show you how you can make a reliable image data pipeline with Keras, and we will compare performance with the pre-existing techniques.

Introduction

Before making a Deep Learning model, we must prepare our data pipeline. For making the data pipeline, we will use Keras. Before feeding the data into the model, we must ensure that the data is shuffled (if we consider supervised learning scenarios), the data is also appropriately batched, and the next batch is available before the current iteration of model training is finished.

The ImageDataGenerator Class

The ImageDataGenerator Class generates batches of images with different augmentations. There is a catch that ImageDataGenerator does, i.e., it takes the original data and randomly transforms it, returning the transformed data for training the model. The ImageDataGenerator also applies ‘in place’ or ‘on the fly’ augmentation. In TensorFlow version 2.10, ImageDataGenerator is deprecated. You can head to this link for reference. Using tf.data while building the data pipeline in Keras is advised.

How to Create the Data Pipeline?

Introduction to tf.data

This section will use tf.data to build our image data pipeline. The tf.data module helps us to create complex and efficient data pipelines effortlessly. It is easy to use but also faster than ImageDataGenerator class. As a result, the training time of our model will be quicker than ImageDataGenerator class. Furthermore, by using tf.data, we unlock TensorFlow’s multi-threading/multi-processing implementation and the concept of autotuning.

Features of tf.data

We will go through some following features of tf.data

shuffle: It helps to randomly select samples from the dataset by replacing the selected samples with new samples. For example, if the length of the dataset is 50000 and the buffer_size is set to 5000, then shuffle will select random elements from the first 5000 samples.
cache: It helps to cache the dataset in memory when it is iterated for the first time. Also, it helps when we require to do subsequent iterations.
repeat: It helps to repeat our dataset if we run out of data. In addition, it prevents TensorFlow from throwing an error while training.
batch: It helps to return a certain number of data points. If we use drop_remainder as True, it will prevent the creation of smaller batches.
prefetch: It is an essential feature of tf.data. It prepares more data in the background while the current one is ready. It consumes more RAM, but it's all worth it as it removes the latency while training the model. The tf.data.AUTOTUNE also tells of building the pipeline and optimizing as per the CPU requirement.

Download the Data

We will use CIFAR-10 Dataset from the keras library. The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. In addition, there are 50000 training images and 10000 test images.

Here are some 10 random samples.

cifar-10-examples

Code Sample

Output:

The classes which are present in the CIFAR-10 dataset.

Prepare Auxiliary Functions

We will load the image and preprocess the image in this section. While creating auxiliary functions, it is up to you how you can load the image. If you are replicating a research paper, it is advised to use the functions mentioned in the research paper.

In this case, we will pass a flag on whether the data is training data or validation data. For the training data, we will first resize to a bigger spatial resolution, take the random crops, and flip the image left and right. Next, we will resize the image to the specific dimension for the validation data.

Build the Train and Test Datasets

We will be using from_tensor_slices to prepare the dataset. from_tensor_slices helps to make the dataset where each input tensor is the column of your data.

We have another thing in the library, which is from_tensor. It helps to make the data where each input tensor is like a row of your dataset.

Another method in the library, i.e., list_files. It helps make the dataset by matching the glob pattern. However, this process may result in poor performance with remote storage systems.

If you want to know more about tf.dataDataset, you can head over to this link.

Visualize the Training and Validation Data

Now it's time to visualize our data. We can easily visualize the image and corresponding labels with a few lines of code. For visualizing the images, we are going to use matplotlib.

output-visualizing-images-using-matplotlib

Configure Datasets for Training & Testing

Let's see how we can configure our dataset for building our pipeline effectively in Keras.

Batch Size

Setting up batch_size while training is a very intimidating thing. Selecting a batch_size that is not high, like 2048 for 10000 samples, is preferred. Using 32, 64, or 128 for 10000 samples is preferred. Increasing the batch_size also helps to do the training faster, but setting up a high batch_size can hamper the performance during training.

Configure for performance

To get good performance, we can add augmentation to build a good pipeline. There are different types of augmentation like MixUp, CutOut, CutMix etc., for better performance.

Comparison of Performance

Let's check the performance of which data pipeline is faster. We will use the CIFAR-10 dataset. You will see the difference if you consider experimenting with a large dataset. Now let's jump into the code.

In this code block, we imported the desired libraries and downloaded the CIFAR-10 dataset.

Let's create an image data generator object.

Let's create the data pipeline using tf.data

I have created a function to benchmark the performance.

Now, lets benchmark the performance.

Results:

We used the same dataset to create the data pipeline using two ways. We can easily observe that tf.data is faster than ImageDataGenerator by 8.04x times.

Note: I have created a Colab Notebook for experimenting purposes if you want to perform more experiments. Refer here.

Conclusion

This article discussed data pipeline, ImageDataGenerator, and tf.data. We also went through how to create a pipeline using Keras.
We went through the features of tf.data.
We also performed a performance check between ImageDataGenerator and tf.data and concluded that tf.data is the most efficient way to create a pipeline in Keras.