Text Data Pipelines in TensorFlow and Keras

Overview

This article will teach us how to create a Text Data pipeline in TensorFlow and Keras and Train/Test models using the Data Input Pipeline. Data Input Pipelines are the basic building block of Machine Learning Operations (MLOps), implemented in Model training and Deployment. There are three types of Data Input Pipelines tf.data, Keras Utils Sequence, and Python Native pipeline which we can; build in this article, we will discuss how to build a TExt Data Pipeline using tf.data.

Introduction

Data Input pipeline enables us to utilize the underlying function of Keras API, i.e., utilizing GPU by parallel processing, reducing training time. Most importantly, if we train the model using the Data Input Pipeline, we can easily transport the model to the production or UAT environment.

In other words, a data pipeline is a set of tools used to extract, transform, and load data from one or more sources into a target system. It’s broken down into three steps: data sourcing, processing, and delivery. Data Input Pipelines are the backbone of any Deep Learning (DL) or Machine Learning (ML) model.

Every model accepts the data input in some specified format applicable for Training, Test, and Validation Set, but the Test and Validation Set data augmentation step is excluded. If the input data on which model will make its prediction has yet to be preprocessed as the Training data have been processed, then the models' output can differ, or the model can start throwing errors/exceptions.

A Machine Learning pipeline can be explained as a way to codify and automate the workflow to produce a Machine Learning (ML) or Deep Learning (ML) model. Data Input Pipeline is crucial when any Machine Learning (ML)/ Deep Learning(DL) models go from development to production.

tf.data and Its Functions

tf data is one of the most important parts when we discuss Data pipeline creation with Tensorflow and Keras; below I have discussed some of the most important functions of the tf.data.

batch()

Batch size is a term used in machine learning and refers to the number of training examples utilized in one iteration. The batch function is used to group the data samples into batches which will be fed to the .fit function while training the model.

shuffle() The shuffle function is used to shuffle the dataset.

map() The map function is used to preprocess the samples in the dataset. We can use the lambda function inside the map or the custom TensorFlow function inside the map function. The called function will be applied to all the samples in the dataset.

filter() As we know, that map function will be applied to all the samples, but what if we want to manipulate the sample based on the value? In this case, the filter function comes into action. Its functioning is the same as the map function, but the only difference is that the manipulation of the samples will be applied based on the same condition.

prefetch() The prefetch function keeps the next batch/batches ready so that the next batch is immediately available once the GPU iterates over the current batch working on the forward and backward propagation. We can use this concept where the bached dataset is produced by the CPU and consumed by the GPU. We need to add `object.prefetch(1) at the end of the pipeline (after batching) so that at least one batch is ready at any point in time. Even we can prefetch more than one batch.

cache() The cache function caches a dataset in memory or local storage. It helps save resources for repeating operations, such as file opening and data reading, from being executed during each epoch.

Download Data using TensorFlow Datasets(TFDS)

TensorFlow Datasets is a set of dozens of machine learning datasets that can be used immediately. The tf.data file can be used to load the data. Formatting datasets requires only a few lines of code for utilizing tf.data.Datasets. In this section, I will load the dataset named yelp_polarity_reviews.

Importing Libraries

In this section, I will import the libraries which will be used to download the dataset from the TensorFlow dataset and the supporting library for creating Text Data Input Pipeline.

Downloading the Dataset

The tfds.load will download the dataset from the Tensorflow dataset repository. It can split the dataset, download the metadata about the dataset, and specify the batch_size. The syntax of tfds.load() is explained below:

Syntax

Argument

name: Name of the dataset we want to download. Type : String
split: Which split of the dataset do we want to load i.e., test, train, ['test','train']. It will return the dataset in the format Dict[split,tf.data.Dataset] if none. For more detail, please visit here.
data_dir: Path where the downloaded dataset will be saved. Type : String
batch_size : It will return the dataset in batches. If the batch_size is -1 it will return the full dataset as tf.tensors. Type: Integer
shuffle_files: Whether to shuffle the dataset Type: Boolean
as_supervised: If this is specified to True, it will download the dataset as two tuple (data, input) else the dataset will be downloaded as dictionary. Type: Boolean
with_info : It will download the metadata related to the dataset we are downloading.
Type: Boolean

In the above section, I have explained the most frequently used argument while downloading the dataset as tf.records for more detail, kindly refer to here

I have downloaded the yelp_polaity_reviews dataset from the tfds.load(), the shuffling_files is set to True as well I have specified to download the metadata about the dataset by specifying with_info to True.

The Yelp reviews polarity dataset is constructed by considering stars 1, 2, and 3, which are negative, respectively. 280,000 training samples and 19,000 testing samples are randomly selected for each polarity. There are 38,000 testing samples and 560,000 training samples. Class 1 is negative polarity, and class 2 is positive polarity. The training samples can be presented as comma-separated values in train.csv and test.csv. They have two columns, one for the class index and the other for the review text. Double quotes (") are used to escape the review texts, and two double quotes ("") are used to escape any internal double quotes. A backslash followed by an "n" character, which stands for "," is used to escape new lines.

This is a binary sentiment classification dataset. For training and testing, we provide a set of 38,000 highly polar Yelp reviews. ORIGIN The reviews from Yelp make up the Yelp reviews dataset. It is derived from the 2015 Yelp Dataset Challenge data. Please visit here for additional information. Using the initial dataset, the Yelp reviews polarity dataset was created by Xiang Zhang (xiang.zhang@nyu.edu). The following paper is first utilized as a benchmark for text classification: Junbo Zhao, Xiang Zhang, and Yann LeCun Convolutional networks at the character level for text classification. NIPS 2015, Advances in Neural Information Processing Systems 28

The below code snippets depict the code for downloading and loading the dataset.

After downloading the dataset, we will print the info variable to show the full description of the dataset we downloaded. The code snippets are shown below:

If you print the dataset, you can see the description of the dataset.

Load the Dataset from the Directory

We can load the dataset; we can load it by the above method, i.e., by tfds, or from the directory. In this section, I will discuss tf.keras.preprocessing.text_dataset_from_directory in depth. TensorFlow text dataset from directory tool. tf.data.Dataset is created by the text dataset from directory tool from text files in a directory. For reading the files using text_dataset_from_directory, the directory structure for utilizing the text dataset from the directory should be as follows.

Structure of Directory

Let's look at the example below to understand text_dataset_from_directory. We will create tf.data with a label in this example using text_dataset_from_directory, a dataset for the IMDB movie review dataset.

Downloading the Dataset

I will download the IMDM Movie review sentiment dataset in this section. The above code will download the dataset from the Stanford ai dataset repository, known as the IMDB Movie sentiment dataset, and save it in our current directory in a folder named aclImdb_v1 and unzip the folder.

The output of the code snippets is shown below:

Output

Loading the Dataset

I will load the downloaded IMDB Movie review dataset in this section using text_dataset_from_directory. The below code snippets are used to read the dataset from the directory. In the text_dataset_from_directory i have specified the path from where I will read the dataset, batch_size to 64 with validation_split of 0.3. Similarly, I have performed a similar step for the test set, except the validation_split and subset arguments were not specified.

Below I have shown the output after executing the code snippets.

Output

The below code snippets will show the class name along with their values.

Output:

Finally, I am displaying the samples from one batch of the Training dataset.

Output:

Split the Dataset Into Training and Validation Set

In the above section, i.e., tfds.load(), I have downloaded the dataset; the ds will be composed of two dictionary keys, i.e., train and test, which can be used separately. This can be seen by executing the below code.

Output

Since we know that dataset is already separated into train and validation set, which is of tensor type. Therefore, by separating the test and train sets, I have created two pipelines, one for Test data and one for Train data. The code snippets are shown below:

Prepare the Dataset for Training

Text cleaning or Text pre-processing is mandatory when working with text in Natural Language Processing (NLP). In real-life human writable text data contain different words with the wrong spelling, short words, special symbols, emojis, etc. We must clean this noisy text data before feeding it to the machine learning model. Any Deep Learning algorithm accepts numbers in floating point numbers or integers. But our dataset is in text form, so we need to pre-process the dataset. This section will discuss the basic pre-processing steps implemented in the text data.

Seperating the Text, Label, and Dataset

The below function will separate the text and label from the dataset in the form of tf.records. The code snippets are shown below:

Tokenization

The most important step in any NLP pipeline is tokenization. It has a significant impact on the remaining pipeline components. A tokenizer separates chunks of unstructured data and text written in natural language into discrete elements. The symbolic events in a record can be utilized straightforwardly as a vector addressing that report.

The below function will tokenize the dataset.

Stopwords Removal

A commonly used word like "the," "a," "an," or "in" is a stop word that a search engine is programmed to ignore when indexing or retrieving entries as a result of a search query. We don't want these words to take up valuable processing time or space.

The below function will remove any stopwords from the English `NLTK dataset and convert the dataset into lowercase.

Preprocessing the Dataset

In this section, I will apply all individual functions and apply to the dataset pipeline using .map function. The map function is called on the dataset input pipeline, which invokes the preprocess function for all the samples present after preprocessing the dataset. The map function will send one sample at a time to the preprocess function. I will implement the remove_punctuation ,tokenise_data, and remove_stopwords. The snippets are shown below:

Vectorization

The term "vectorization" refers to a traditional method of transforming input data into vectors of real numbers, which is the format that ML models can work with. This method has been around since computers were first made, it works well in many different areas and is now used in Natural Language Pre-processing (NLP). We know that most Deep Learning architectures and many Machine Learning algorithms cannot process strings or plain text in its raw form. To perform any task, including classification, regression, clustering, and so on, they require numerical numbers as inputs. Additionally, extracting knowledge from the text format's vast amount of data and developing useful applications is essential. To put it briefly, any machine learning or deep learning model must be built using numerical data at the final level because models cannot directly comprehend text or image data like humans can.

We need clever methods known as vectorization, or Word embeddings in the NLP world, to transform text data into numerical data. In this manner, Vectorization or word implanting is the most common way of changing text information over completely to mathematical vectors. These vectors are then put to use in a variety of machine-learning models. Text-based feature extraction is used to build multiple natural languages, processing models, etc. The text data can be transformed into numerical vectors in various ways.

For creating a vectorization object, I will implement a tensorflow pre-processing layer known as tf.keras.layers.TextVectorization. The syntax is shown below:

Syntax

For detail of the above-mentioned pre-processing layer kindly visit here

This layer will accept the text data and convert it into Bag of Words, tf-idf, and other specified algorithms. It also has the option of n-grams, max_tokens, and standardize, which will pre-process the data by in-build feature or callable function. The code snippets for the text vectorization layer are shown below:

As we can see in the above code snippets, we have initialized the object of the TextVectorization layer. The next step is to call the .adapt function of the TextVectorization layer for creating the vocabulary. This pre-processing layer will be added to the model after the input layer.

Testing the Vectoriser Component

The vectorizer component is created, but we need to test whether it functions as expected or has some bugs or errors. For this, we need to create a simple Keras Sequential Model and add the pre-processing layer to the Keras Sequential model, and we will test it by making predictions based on the test data. The below code snippets depict the code for testing the vectorizer component of the pipeline.

In the code snippets below, we are creating a Sequential model and adding an input layer with the shape of (1,) and data type of tensor.string; after that, we add the pre-processing layer. In the below code, snippets depict the creation of the Keras Sequential model with the preprocessing layer component. If we want, we can add another layer, such as an Embedding layer and CNN or RNN (LSTM or GRU), but we are just testing the pipeline component, so I am creating just two layers, one Input Layer and Vectorizer Layer.

The below code snippets give the Vectorized value of the test dataset, and we are displaying the Count Vectorized output of the first sample of the test dataset.

Below I have depicted the Count Vectorized output of the first sample of the test dataset.

Output:

Configure the Dataset for Performance

This section will explain how to optimize the Text Data Pipeline by implementing the prefetch and catch functions below.

Dataset.cache

The cache function is used to cache a memory or local storage dataset. It helps save resources for repeating operations, such as file opening and data reading, from being executed during each epoch. Instead, the first call will be made from the original dataset, and subsequent calls from the cached dataset, saving time and resources.

Dataset.prefetch

Prefetch will keep at least one batch ready at any point so that there is no delay while feeding the batches into the training phase of the model. The prefetch function is used when we want to keep the next batch/batches ready so that once the GPU iterate over the current batch working on the forward and backward propagation, the next batch is immediately available. We can use this concept where the bached dataset is produced by the CPU and consumed by the GPU. We need to add object.prefetch(1) at the end of the pipeline (after batching) so that at least one batch is ready at any point in time. Even we can prefetch more than one batch.

Conclusion

In this article, we have studied the Text data pipeline in Tensorflow. The following are the key takeaways:

Data Input Pipeline is one of the basic building blocks of the MLops procedure.
By integrating Data Input Pipelines, we can easily ship our model from the development environment to the production or UAT environment.
tf.data pipeline function batch, map, shuffle, catch, and prefetch is the core of the tf.data Pipeline where the map can be implemented to incorporate native Python functions in the pipeline.
tfds.load, we can download the dataset and split it into the test, train, and validation.
tf.keras.layers.TextVectorization is one of the most crucial layers of the text preprocessing. It accepts only text data and acts as a layer after the Input Layer.