Training a Semantic Segmentation Model in Keras

Learn via video courses
Topics Covered

Overview

In this article, you will learn to train a Semantic Segmentation Model in Keras using U-net. For demonstration purposes, I will use the Oxford-IIIT Pet Dataset dataset, composed of 37 pet breeds, with 200 images each (less than 100 for the training and test sections). This Oxford-IIIT Pet Dataset dataset is freely available all over the internet. In this article, we will implement a U-net-like structure, a state-of-the-art model architecture for segmentation purposes.

What We are Building?

In the modern era, the applications of computer vision and deep learning models are expanding at an exponential rate. One such area of artificial intelligence is computer vision, in which we train our models to understand real-world images. We can accomplish complex tasks with high-quality results on computer vision datasets thanks to deep learning architectures like U-Net and CANet. While PC vision is a humongous field with such a great amount to offer and thus a wide range of novel sorts of issues to tackle, our concentration for the following several articles will be on one model, specifically U-Net, which is intended for Image Segmentation.

Image segmentation aims to break up a large image into smaller pieces. The computation of image segmentation tasks will benefit from producing these fragments or multiple segments. The use of masks is another essential requirement for image segmentation tasks. We can achieve the desired outcome for the segmentation task with masking, essentially a binary image with zero or non-zero values. With the help of images and their associated masks, we can perform a wide range of subsequent tasks once we have described the most important parts obtained during image segmentation.

Before you plunge into this article, look at a few discretionary pre-necessities to track with this article. Then, as we will use these deep learning frameworks to build the U-Net architecture, I suggest reading the TensorFlow and Keras guides to learn more about them.

Pre-Requisites

This article's pre-requisite is in-depth knowledge of Convolutional Neural Networks, Data Input pipelines, and basic idea Tensorflow datasets.

CNN

A Convolutional Neural Network (CNN or convnet) is a subset of AI. A Convolutional Neural Network (CNN) is a type of network architecture for deep learning algorithms used for image recognition and other tasks requiring processing pixel data. In deep learning, there are other kinds of neural networks, but Convolutional Neural Networks are the preferred network architecture for identifying and recognizing objects. As a result, they are excellent candidates for applications requiring object recognition, such as self-driving cars and facial recognition, as well as computer vision (CV) tasks.

Tensorflow and Keras

Keras is a compact, easy-to-learn, high-level Python library that runs on top of the TensorFlow framework. It focuses on understanding deep learning techniques, such as creating layers for neural networks, maintaining the concepts of shapes, and mathematical implementation.

What is Image Segmentation?

Image Segmentation is based on the properties of the picture's pixels. Image segmentation is a widely used method in digital image processing and analysis to divide an image into various parts or areas. For example, foreground and background can be distinguished in an image by segmenting it, or pixels can be grouped based on their similarity in color or shape. , For example,, a common application of image segmentation in medical imaging is to detect and label pixels in an image or voxels of a 3D volume representing a tumor in a patient’s brain or other organs.

Several methods and approaches for picture segmentation have been developed over time to address segmentation issues in that particular application area efficiently. These include machine vision, automated driving, video surveillance, and medical imaging.

How We are Going to Build?

The U-Net architecture, first released in 2015, has revolutionized deep learning. In several categories, the architecture won the International Symposium on Biomedical Imaging (ISBI) cell tracking challenge in 2015. Transmission light microscopy images and the segmentation of neuronal structures in stacks using electron microscopy are two examples of their work.

The segmentation of images with sizes of 512 by 512 can be computed quickly using this U-Net architecture and a modern GPU. Due to its phenomenal success, this architecture has been adapted in numerous ways. LadderNet, U-Net with attention, R2-UNet (recurrent and residual convolutional U-Net), and U-Net with residual blocks or blocks with dense connections are a few examples.

While U-Net is a significant development in deep learning, it is just as important to comprehend the previous approaches used to solve similar problems. The sliding window approach, which won the ISBI EM segmentation challenge in 2012 by a significant margin, is one of the primary examples that comes to an end. In addition to the initial training dataset, the sliding window method generated various sample patches.

The network of sliding window architecture achieved this result by providing a local region (patch) around each pixel to separate the class labels for those pixels. In addition, this architecture's ability to easily localize on any given training dataset for the respective tasks was another accomplishment.

On the other hand, the U-Net architectureovercame the two main drawbacks of the sliding window method. First, since each pixel was considered separately, there was a lot of overlap between the produced patches. As a result, there was a lot of overall redundant work. The overall training process could have been more active and took a lot of time and resources, which was another limitation. The following factors raise doubts regarding the network's operational viability.

The U-Net architecture is elegant and addresses most of the issues that arise. This strategy is based on the idea of fully convolutional networks. The U-Net's goal is to record the context's characteristics and the localization. The built architecture completes this process. The main idea behind the implementation is to use contracting layers in order, then upsampling operators to get higher-resolution outputs on the input images.

u net architecture

By taking a quick look at it, we can understand why the architecture depicted in the image is likely referred to as U-Net architecture. The resulting architecture has the shape of a "U," which is why it is known as such. We can comprehend that the constructed network is fully convolutional simply by examining the structure and the numerous components involved in its construction. They have not utilized any additional layers of a similar nature, such as dense or flattened. An initially contracting path follows an expanding path in the visual representation.

According to the architecture, the model processes an input image before being processed by a few convolutional layers containing the ReLU activation function. As a result, the image's resolution decreases from 572 x 572 to 570 x 570, then to 568 x 568. The use of unpadded convolutions (defined as "valid") has reduced the overall dimensionality, which is the cause of this reduction. We also notice that, in addition to the Convolution blocks, we have an encoder block on the left side and a decoder block on the right.

The max-pooling layers of strides two aid in constantly reducing the encoder block's image size. The encoder architecture also has repeated convolutional layers and more and more filters. The number of filters in the convolutional layers decreases once we reach the decoder aspect, and the upsampling in the subsequent layers continues to the top. Additionally, we observe the utilization of skip connections to link the decoder blocks' layers to the preceding outputs.

This skip connection is a crucial idea for preserving the loss from previous layers so that they reflect more strongly on the values overall. Additionally, it has been demonstrated scientifically that they speed up model convergence and deliver superior outcomes. There are a few convolutional layers in the final convolution block, followed by the final convolution layer. The two-filter filter on this layer provides the appropriate function for displaying the output. The project's requirements can alter this final layer.

Final Output

Image segmentation makes it easier to work with computer vision applications. It is mostly implemented in Biomedical, Object Detection, Object Classification, and Identification. For example, the below image displays the Input image, Actual Masked Image, and Predicted Masked Image from the trained U-Net Image Segmentation model.

u-net image segmentation model

Required Libraries

A low-level set of tools to create and train neural networks is offered by Google’s TensorFlow2.x. With Keras, you can stack layers of neurons and work with various neural network topologies. We also use additional supporting packages like opencv2 and NumPy for data pre-processing. We will use the Oxford-IIIT Pet Dataset for the dataset, which is available on the Tensorflow dataset repository and can easily be accessed by tfds.load.

Training a Semantic Segmentation Model in Keras

Dataset

Data Input Pipeline is crucial when any Machine Learning (ML)/ Deep Learning (DL) models go from development to production. Every model expects data in a predefined format. In this section, I will download the Oxford IIIT Pet dataset from the Tensorflow dataset repository using tfds.load. It has the option to split the dataset, download the metadata of the dataset, and specify the batch_size, but we will only load the dataset with metadata. The code is shown below:

The metadata information of the dataset is shown below.

Output

After data loading, we will create the input data pipelines to pre-process the dataset. First, the preprocessing function accepts an image and labels it as an argument. The image pixel is then divided by 255 so that the value of the image pixel is between 0 and 255. The process is also known as pixel scaling, and finally, the images are reshaped into 128*128, and it is returned along with the labels.

The map function is called on the dataset input pipeline, invoking the load_image function, which recursively calls the normalize for all the samples present after preprocessing the dataset. The map function will send one sample at a time to the preprocess function. The next step is shuffling the dataset, i.e., randomly arranging the dataset sequence to rule out the association between the dataset sample. Finally, we are converting the dataset into a batch size of 32 and applying prefetch with AUTOTUNE. Prefetch will keep at least one batch ready at any point so that there is no delay while feeding the batches into the training phase of the model.

The identical data input pipeline is applied for the Test data pipeline, but we are not shuffling the dataset samples and are not Prefetching any of the batches.

The code snippets depict the code for the functions, which the map functions will call to normalize the images and their respective masks.

The snippets depict the code for the functions the map functions will call to resize the images and their respective masks.

The code snippets below are for the Train and Test data pipelines used to train the Image segmentation Model.

Visualizing Dataset

In this section, I will visualize the dataset. First, I will display images and the mask of 2 samples from the Train set. Then, I will implement the Matplot library to plot the images. The code snippets below display how the Matplot Library will plot the Images and their respective masks.

The below code snippets are used to extract the two samples from the Train Dataset. Next, I will implement the take function from the TensorFlow pipeline and pass the data and label from the Train Dataset to the display function, which will plot the images and their respective mask. The code snippets are shown below.

output-samples-from-train-dataset

U-Net Architechture Implementation

In this section of the article, we will look at the TensorFlow implementation of the U-Net architecture. We will use the TensorFlow library for the implementation of U-Net architecture for the Image Segmentation article. We will import the required libraries and build our U-Net Image Segmentation Model from scratch.

Importing Libraries

For building the U-Net design, we will use the TensorFlow profound learning system, as examined as of now. As a result, we will import the TensorFlow library and the Keras framework, both of which are now included in TensorFlow model structures. The convolutional layer, the max-pooling layer, an input layer, and the activation function ReLU are among the essential imports for the fundamental modeling structure, as we have previously comprehended the U-Net architecture. The Conv2DTranspose layer, for example, is one of the additional layers we'll use to upsample the decoder blocks we want. Concatenate layers will be used to combine the necessary skip connections, and the Batch Normalization layers will be used to stabilize the training process.

Convolution Block

We can continue constructing the U-Net architecture after importing the necessary libraries. This can be done in a single class by defining all of the parameters and values in the correct order and continuing until the very end, or it can be done in a few iterative blocks. Because it is easier for most users to comprehend the U-Net model architecture using a small number of blocks, I will use the latter approach. As depicted in the architecture's representation, the convolution operation block, the encoder block, and the decoder block will serve as our three iterative blocks. With the assistance of these three blocks, we can construct the U-Net design effortlessly. Let's go through each of these function code blocks one at a time and process them all.

The primary operation of processing a double layer of convolution operations using the entered input parameters is carried out using the convolution operation block. We have two arguments in this function: the input for the convolution layer and the default value of 64 for the number of filters. In contrast to valid or unpadded convolutions, we will use the same padding value to maintain identical shapes. The Batch Normalization layer comes after these convolutional layers. In order to achieve the best possible results, these modifications to the initial model are made. In the end, a ReLU activation layer is added to the mix, as defined in the research paper. First, let's investigate the code block used to construct the convolution block.

The encoder and decoder blocks

The construction of the encoder and decoder blocks will be our next step. The construction of these two functions is quite simple. Beginning with the topmost layer, the encoder architecture will use consecutive inputs. The convolutional block, or two convolutional layers, will be followed by their respective batch normalization and ReLU layers in the encoder function defined by us. According to the research paper, we will quickly downsample these elements after passing them through the convolution blocks. We will employ a max-pooling layer and adhere to the paper's parameters, with strides equal to 2. Since we need the max-pooled output for the skip connections, we will return both the initial and max-pooled outputs.

The number of filters in a particular building block, the input from the skip connection, and the receiving inputs will all be arguments in the decoder block. With our model's Conv2DTranspose layers assistance, we will upsample the input. The final value of the skip connections will be obtained by concatenating the receiving input and newly upsampled layers. Then, using this combined function, we'll carry out our convolutional block operation, go on to the following layer, and then return the obtained result.

The U-Net architecture

Many different blocks need to be processed, so if you try to design the full U-Net architecture from scratch in a single layer, you can find that the structure is rather large. However, by separating our respective functions into three independent code blocks— convolutional operation, encoder structure, and decoder structure —we can quickly construct the U-Net architecture in just a few lines of code. The input layer we will employ will hold the shapes from our input image.

After this step, we will collect all primary and skip outputs to pass them on to subsequent blocks. Then, we will construct the decoder architecture and the subsequent block until we reach the output. The dimensions of the output will correspond to our desired output. Finally, we will call the functional Programming interface demonstrating the framework to make our last model and return this model to the client for playing out any assignment with the U-Net architecture.

Initiating the U-Net Model

Ensure that the shapes in your images can be divided by at least 16 or by multiples of 16. We want to avoid running into the divisibility of any odd-number shapes during the down-sampling process because we're using four max-pooling layers. As a result, it's best to make sure that the sizes of your architecture are the same for shapes like (48, 48), (80, 80), (160, 160), (256, 256), and (512, 512). Let's put our model structure to the test with an input shape of (160, 160, 3) and see what happens. The model and its associated plot are summarized. The Jupyter Notebook that is attached lets you look at both of these structures. To illustrate the particular plot of the entire architectural build, I will also include the model.png.

You can run the code snippets below to see the model plot and the model parameter. I have yet to execute the below code because it will take a lot of space to display the output.

Compile and Train the Segmentation Model

We built the model in the parts above. In this stage, we will compile and train our model. This component will be used to put our model together. Before compiling the model, we must specify the loss function, optimizer, and metrics. Our dataset is a multi-label dataset. Hence I used sparse categorical cross entropy as our loss function. Adam was chosen as the optimizer to propagate the error backward. Adam combines the Root Mean Square Propagation (RMSProp), Adaptive Gradient Algorithm, and Stochastic Gradient Descent (AdaGrad). We chose accuracy for simplicity, but you can choose any statistic that fits your issue statement. To save the model and generate predictions, we also use a checkpoint. The below snippets depict the code for model compilation.

Our final step is to train the model after a successful model compilation. The dataset has previously been divided into training and testing sets.

Output

Test the U-Net Segmentation Model

This section will predict the Test dataset unseen by the model. We will select the test set as an input to the U-Net Image Segmentation model, and we will predict and finally display the result as shown below:

The below image shows the model output

output test u net segmentation model

What's Next?

In this article, we have trained U-Net on the Oxford IIT Pet dataset. You can try the U-Net and CANET on different datasets. Also, you can implement instance segmentation.

Conclusion

In this article, we learned how to train an Image Segmentation model using U-Net. The following is the takeaway from this article:

  • Brief description of the U-Net modeling technique, which is extremely useful for most current image segmentation-related tasks.
  • The main methods used to construct the U-Net architecture and its steps.
  • We analyzed our constructed U-Net architecture with a simple example problem for image segmentation.
  • U-Net can solve the most complex problems in deep learning.
  • Initially, U-Net was designed to solve the Biomedical Image Segmentation problem.