Tabular Data Pipelines in TensorFlow and Keras

Overview

In this article, we will learn how to create a Tabular Data pipeline in TensorFlow and Keras and Train/Test models using the Data Input Pipeline. Data Input Pipelines are the basic building block of Machine Learning Operations (MLOps), implemented in Model training and Deployment. There are three types of Data Input Pipelines, tf.data, Keras Utils Sequence, and Python Native pipeline, which we can build. In this article, we will discuss the component of a Tabular data pipeline that will be used to make Tabular data pipelines.

Tabular Files

This article is dedicated to the Tabular Data Pipelines so that we will have brief discussions on CSV and Excel files.

CSV Files : A Comma Separated Values (CSV) file is a plain text file containing a data list. These files are often used for exchanging data between different applications. A CSV file has a fairly simple structure. It’s a list of data separated by commas. A plain text file is a Comma Separated Values (CSV) file that stores data by separating data entries with commas. CSV records are often utilized when information should be viable with various projects. Text editors, spreadsheet programs like Excel, and other specialized applications can open CSV files.

XLS Files : An XLS file is a spreadsheet file that Excel or other programs can create. The file type represents an Excel Binary File format. An XLS file stores data as binary streams -- a compound file. Streams and substreams in the file contain information about the content and structure of an Excel workbook. Even though Excel might be one of the most recognizable spreadsheet programs, other vendors offer competing products which include Google Sheets, Apple's spreadsheet- Number, and Apache OpenOffice Calc.

Download and Load Dataset

This section will discuss how we can download and load the dataset. There are numerous ways by which we can load and download the dataset in Tensorflow.

We can read tabular data files in many different ways, and numerous Python packages exist. Still, Pandas is one of the most reliable and fast techniques to read tabular data files. Pandas is a Python package that is open source and is most commonly used for tasks related to data science, data analysis, and machine learning. It is based on top of another package known as Numpy, which offers help for complex exhibits. Pandas allows us to read files from the internet or the local directory. The syntax for loading the CSV file from pandas can be found here.

For this article, I will read a file known as Titanic Survival Dataset, which is composed of 10 columns, of which 9 columns are Independent Variable (IV) and 1 is Dependent Variable (DV) for more detail regarding the Dependent Variable and Independent Variable please refer Model Evaluation. The below code snippets display the code for loading the dataset from the internet:

After loading the dataset, I am displaying the dataset's top 5 rows using the head() function. The output is shown below:

Output

You can download the dataset and then load it also. You need to replace the file's location with the local file path.

Basic Preprocessing

Preparing raw data for use in a Machine Learning (ML) model is the goal of data preprocessing. Creating a Machine Learning (ML) model is the first and most important step. When starting a Machine Learning (ML) project, clean and properly formatted data are not always available. Additionally, data must be formatted and cleaned before being used in any way. Thus, we use the data preprocessing task for this.

A real-world dataset contains a dataset that is composed of noises and missing values which cannot be directly used for Machine Learning (ML) models. Data preprocessing is the most important task because in the Machine Learning (ML) model, the inferences are dependent on the quality of data, i.e., on the concept of GIGO - Garbage In Garbage Out. In this section, I will discuss basic preprocessing techniques, i.e., Normalization.

Normalization

In machine learning, normalization is a scaling method that is used during the preparation of data to change the values of the numerical columns in the dataset to use a common scale. In a model, it is not necessary for all datasets. It is only necessary when machine learning model features have distinct ranges. Although normalization is not required for all machine learning datasets, it is utilized whenever the dataset's attributes have different ranges. A machine learning model's performance and dependability are both improved as a result. We will briefly discuss normalization in this article.

Formulae

$X_n=\frac{\space(X\space-\space X_(minimum))}{X_(maximum)\space-\space X_(minimum)}$

The above image displays the formulae of the Normalization technique, where,

Xn = Value of Normalization
Xmaximum = Maximum value of a feature
Xminimum = Minimum value of a feature

If the value of Xn is between 0 and 1, then the scaling technique is known as Min-Max scaling. There are many Normalization techniques, but frequently used techniques are Min-Max Scaling and Standardization scaling - Z-score standardization.

Normalization Implementation

In this section, I will implement Normalization techniques using the Tensorflow layer as well as the Sklearn Min-Max Scaler.

Keras Layer

In this section, I will implement data normalization using the Keras layer. It also acts as a preprocessing layer because it normalizes the data. This layer will shift and scale the inputs into a distribution centered around 0 with a standard deviation of 1. It accomplishes this by precomputing the mean and variance of the data by calling the below formulae:

Formulae

$X_n=\frac{input-mean}{\sqrt(variance)}$

The layer's mean and variance must be learned through adapt() or supplied during construction. The data's mean and variance will be calculated and saved as the layer's weights by adapt(). Before calling fit(), evaluate(), or predict(), adapt() should be called.

Syntax

Arguments

axis : The x- axis or Y-axis of the dataset. Type : Integer (0 or 1)
mean : The mean value will normalize the dataset. Type : Float
variance : The variance value will normalize the dataset. Type : Float
invert : Whether to invert the normalization or not. Type : Boolean

In the above section, I downloaded the Titanic dataset so that we will use the Titanic dataset in this section. First, we will create a layer that will scale the fare column between 0 and 1, and after adapting the layer to the dataset, we will add the specific layer to the model, which we will create for training on the Titanic dataset.

In the below code, I am creating a normalization layer and adapting the layer on the fare column from the dataset.

After the layer has adapted, we can see the mean and variance of the layer as shown below:

Output

Model Training with Normalization Layer

In this section we will train a simple shallow neural network with a normalization layer. I will be using the Abalone dataset which is a collection of limited-range floating point values and a small number of samples.

Dataset Loading

In the below code I am loading the dataset and displaying its 5 samples using head() functions in pandas.

Output

Below I am separating the dataset into features and labels. Labels are whether the person survived or not, and all other columns except Age on the CSV are treated as feature columns. The code snippets for separating the feature and labels are shown below:

In the below code, I am creating a normalization layer which will be implemented as one of the layers in the Keras Sequential model.

Model Creation

I will create a very simple shallow neural network which will consist of the following layers:

Normalization Layer: Above, we have created a Normalization layer. So, we will add the Normalization layer into the Keras sequential model as a first layer.
Dense Layer : I have created a dense layer with a neuron of 64 with an activation function as Relu, which will accept the output of the Normalization layer as an input. Relu - stands for rectified linear activation unit, which gives the output in the range of 0 and 1 and is denoted with the following formulae: $f(x)=max(0,x)$
Output Layer : I have created a dense layer with a neuron of 1 with activation function as Sigmoid - has ranged between 0,1. It is denoted with the following Sigmoid formulae:

$\sigma\space=\frac{1}{1+e^{-x}}$

Below I am displaying the model architecture.

Model Architecture

Model Compilation and Training

In this section, we are going to compile our model. We need to specify the loss function, optimizer, and metric to compile the model. I have used Mean Squared Error. Adam was selected as the optimizer to propagate the error backward. Adam is an extension of the Stochastic Gradient Descent and combination of the Root Mean Square Propagation (RMSProp) and Adaptive Gradient Algorithm (AdaGrad). For metric, we have used accuracy for simplicity. You can use any metric based on your problem statement. The below snippets depict the code for model compilation.

After successfully compiling the model, our final step is to train the model. The dataset will be split into two sets, i.e, training and testing sets. The argument validation_split denotes the ratio by which the dataset will be splitted, in our case, it is 0.1 it signifies that ten percent of the dataset will be used for testing, and the remaining ninety percent will be used for training the model with a batch size of 128. The below snippets depict the code for model training.

Finally, while executing the training code snippets, we will see the outputs as shown below image.

Output

Sklearn Normalization

In the above section, I downloaded the Titanic Dataset, and I will use Min-Max Scaler to scale the dataset in this section. For demonstration purpose, I will scale the Fare column. The code snippets are shown below:

In the above code, I have implemented the Min-Max scaler on the Fare column, and then I am replacing the original column values with the scaled values, which are shown below.

Output

Building Pipeline Using tf.data

In this file, I will discuss the components by which we can create the tabular data pipeline in Tensorflow.

On a Memory Data

In this section, we will create a dataset from one Memory Data. The below code shows how we can download the file from the folder. We will be using the Titanic dataset, which I have discussed above. Then, we will use the Pandas read_csv to read the file from memory.

Output

From a Single File

In this section, we will create a dataset from one CSV file. The below code shows how we can download the file from the internet. We will be using the Titanic dataset, which I have discussed above. In these code snippets I will download the CSV using tf.keras.utils.get_file the code snippets are shown below:

Output

After downloading the file, it is time to create the dataset from the CSV file. For that we will be using tf.data.experimental.make_csv_dataset with arguments of batch_size=5, label_name=survived, num_epochs=1 and ignore_errors=True, the code snippets is shown below:

After creating the dataset, it is a good procedure to print it. It gives us the assurity that the code is working as expected. Therefore in this part, I am printing the dataset samples as shown below:

Output

Caching

A cache is a high-speed data storage layer in computing that stores a subset of data, usually temporary data, so that future requests for that data can be processed faster than if the data were stored in its primary location. Using cache, you can effectively reuse data that has already been retrieved or computed. Tensorflow has a dedicated function that can cache the pre-processed data locally. This section will discuss how we can catch the dataset in Tensorflow.

The below code shows how we can download the file from the internet. We will be using the Titanic dataset, which I have discussed above. In these code snippets, I will download the CSV using tf.keras.utils.get_file the code snippets are shown below:

Output

The code below displays the time required to read the Titanic dataset without caching.

Output

I have implemented the caching with the Titanic dataset in the code below. Caching is a high-speed data storage layer in computing that stores a subset of data, usually temporary data, so future requests can be processed faster than if the data were stored in its primary location.

In the code, I have cached the Titanic dataset using the cached function and then used the cached dataset. The code below displays the time required to read the Titanic dataset with caching.

Output

As we can see, there is a huge time difference between reading the dataset from cached and non-cached data. These time differences and resource management plays a very crucial role when we are dealing with huge datasets.

Multiple File

In this section, I will read the dataset with multiple files. The process is identical to reading a single file but with small differences, which we will discuss in the below sections. The below code shows how we can download the file from the internet. We will be using the Titanic dataset, which I have discussed above. In these code snippets I will download the CSV using tf.keras.utils.get_file the code snippets are shown below:

Output

After downloading the dataset known as Fonts, I am just displaying the list of files in the code snippets shown below:

Output

This is the main focus of this section here. I will load multiple files at once. After downloading the file, creating the dataset from the CSV file is time. We will use tf.data.experimental.make_csv_dataset with arguments of batch_size=5, label_name=survived, num_epochs=1 and ignore_errors=True, the code snippets are shown below. In the code snippets, you will notice that I am using a wildcard expression telling the duction to read all CSV files under the fonts directory.

Output

Lower Level Function

In this section, I will discuss Tensorflow Keras, which are useful for individuals well acclimatized with low-level coding.

tf.io.decode_csv

The tf.io.decode_csv function is used to parse the text lines into the list of CSV column tensors. It is used to decode a list of strings into columns. It does not try to guess the column datatype. Instead, we have to specify the datatype for the columns manually.

The below code shows how we can download the file from the internet. We will be using the Titanic dataset, which I have discussed above. In these code snippets I will download the CSV using tf.keras.utils.get_file the code snippets are shown below:

Output

The below code snippets specify the data types for the Titanic dataset. It must be noted that the shape of the data type should be equal to the number of columns in the file. The code snippets are shown below:

In the below code, I read the file using tf.io.decode_csv by specifying the file path and column data types and displaying the features, i.e., columns data type and shape.

The output of the above code is shown below.

Output

tf.data.experiemental.Csv.Dataset

The experimental tf.data Csv. Without the common features of the tf.data.experimental.make_csv_dataset function, i.e., column header parsing, column type-inference, automatic shuffling, and file interleaving—the Dataset class provides us with a minimal set of CSV Dataset reading functions. It performs the same functions as tf.io.decode_csv.

Output

In the below code, I read the file using tf.data.experimental.CsvDataset by specifying the file path and column data types and displaying the one dataset sample.

Output

Conclusion

In this article, we have studied the Tabular data pipeline in Tensorflow. The following are the key takeaways:

Pandas is one of the most efficient ways to work with tabular data files.
Keras has Normalization layers, which can be used as pre-processing layers.
Keras has various components such as loading data from memory, loading a single file, and loading multiple files at once, which are most widely used to construct tabular data pipeline
Caching function can be used to optimize the performance of the processing layer
For low-level functions, Tensorflow has tf.io.decode_csv and tf.data.experimental.Csv.Dataset function