Building a Neural Machine Translation Model in Keras

Overview

A Neural Machine Translation(NMT) model is a deep learning model used to translate text from one natural language to another. NMT models typically use an encoder-decoder architecture with attention. NMT models are trained on large parallel corpora of text in the source and target languages and are widely used in various applications, such as machine translation and text-to-speech. They have been shown to produce more accurate translations than traditional machine translation models. In this article, we will learn how to translate one language to another by implementing the neural machine translation model in keras.

What are We Building?

A Neural Machine Translation (NMT) model is a deep learning model used to translate text from one natural language to another. NMT models are based on neural networks, a machine learning algorithm designed to emulate how the human brain processes information.

Neural Machine Translation (NMT) models typically use an encoder-decoder architecture, where the encoder takes the source sentence as input and generates a fixed-length representation (or context vector) that encodes the meaning of the input sentence. This context vector is then passed to the decoder, which generates the target sentence word by word, using the attention mechanism over the encoded source sentence and the previously generated target sentence.

Neural Machine Translation (NMT) models are trained on large parallel corpora of text in the source and target languages and use this training data to learn the relationships between words and phrases in the two languages. Once trained, the model can translate new sentences from the source to the target language.

Neural Machine Translation (NMT) models are widely used in various applications such as machine translation, text summarization, and text-to-speech. They have been shown to produce more accurate translations than traditional statistical machine translation models.

Pre-requisites

To build a Neural Machine Translation (NMT) model with Transformers in Keras, you will need the following:

A good understanding of the transformer architecture and its components, such as the encoder, decoder, and attention mechanism.
A dataset of parallel sentences in the source and target languages will be used to train and evaluate the model.
A programming environment with Keras and TensorFlow installed.
Understanding tokenization, indexing, padding, and masking techniques applied to the input dataset.
Understanding the evaluation metrics used in the NMT model like BLEU, ROUGE, etc.

It's also recommended to have some experience with machine learning and deep learning concepts, as well as experience with Python programming.

How Are We Going to Build This?

To create and train a sequence-to-sequence Transformer in Keras, we will need to follow these general steps:

Prepare your dataset: Preprocess the source and target languages, such as tokenizing and indexing the words and creating training and test sets.
Import Required Libraries the transformer model from the TensorFlow or Keras library. You can use a pre-trained transformer model such as T5 or BE`RT or create your transformer model from scratch.
Define your model's architecture: This includes creating the encoder, decoder, and attention layers. You can use the pre-trained transformer model as the encoder and create a custom decoder with a linear layer on top of the encoder to predict the target sequence.
Compile the model: Specify the loss function, optimizer, and evaluation metrics you want to use.
Train the model: Use the training dataset to fine-tune the transformer model and update the model's parameters until the performance on the validation dataset stops improving.
Evaluate the model: Use the test dataset to evaluate the performance of the fine-tuned transformer model.
Use the model: Use the fine-tuned transformer model to translate new sentences from the source to the target language.

It's important to note that training a transformer model with large data sets and large sequence lengths can be computationally expensive and require a significant amount of memory, so using powerful GPUs to run the training process is recommended. Also, it's good to have a good understanding of the transformer architecture before implementing it.

Final Output

The Final output of a Neural Machine Translation (NMT) model with Transformers in Keras will be a predicted target sentence in the target language. , For example,, the final output of the model is a sequence of words in the English language, which have been translated from Portuguese.

The encoder part of the model takes the source sentence as input and generates a fixed-length representation (or context vector) that encodes the meaning of the input sentence. This context vector is then passed to the decoder, which generates the target sentence word by word using the attention mechanism over the encoded source sentence and the previously generated target sentence.

Output:

It's important to note that the final output of the transformer-based NMT model can be post-processed for better results, such as using beam search or greedy decoding techniques to generate multiple translations with different scores to select the best one.

Requirements

This section will import several libraries, including logging, time, numpy, matplotlib, tensorflow_datasets, tensorflow and tensorflow_text.

Logging is used for logging messages and events.
Time is used to measure the time of execution of code.
Numpy is a library for working with arrays and matrices of numerical data.
Matplotlib is a plotting library for creating static, animated, and interactive visualizations in Python.
tensorflow_datasets is a collection of datasets ready to use with TensorFlow.
Tensorflow does Google develop an open-source machine learning framework.
tensorflow_text is a library for preprocessing, tokenizing, and encoding natural language text within TensorFlow.

Building a Neural Machine Translation Model in Keras

In this section, we will build Neural Machine Translation Model in Keras. An NMT model is a type of deep learning model that is used to translate text from one language to another. It consists of two main components: an encoder and a decoder. The encoder takes in the input text in the source language and converts it into a fixed-size representation called a "context vector". The decoder then takes this context vector and generates the output text in the target language, word by word. In this section, we will be implementing all the sub-parts of NMT in a step-by-step process.

Dataset and Preprocessing

Neural Machine Translation (NMT) is a machine learning algorithm used to translate text from one language to another. The quality of the translation depends largely on the quality of the training data and the preprocessing steps performed on the data. In this section, we will load and preprocess the TED talks dataset, which is in Portuguese and English Language.

Loading the Dataset:

The below code loads the TED talks dataset from the TensorFlow Datasets library. The dataset is called ted_hrlr_translate/pt_to_en and it contains Portuguese-English translations of TED talks. The with_info=True argument returns metadata about the dataset along with the examples, and as_supervised=True returns the dataset as a tuple of inputs and labels.

The dataset is split into two sets: train (52,000 training samples) validation (12,00 validation samples) and validation. The train_examples variable holds the training set, which will be used to train a model, and the val_examples variable holds the validation set, which will be used to evaluate the performance of the trained model.

Dataset Visualization:

The below code iterates over the training examples in the train_examples variable in batches of 3, and takes the first batch of examples. Then, for each batch, it prints the examples in Portuguese and English.

The batch(3) method splits the dataset into groups of 3 examples, and the take(1) method takes the first batch of examples.

For each batch, the for pt_examples, en_examples in train_examples.batch(3).take(1): loop iterates the Portuguese and English examples separately and assigns them to pt_examples and en_examples variables respectively.

Then, the code iterates over each example in the batch, converts it to a numpy array, decodes the bytes to a string using .decode('utf-8') and prints it. It then prints a new line to separate the Portuguese and English examples.

Output:

Tokenizer:

A tokenizer is a tool used to segment a text string into individual words, phrases, symbols, or other meaningful elements, known as tokens. Tokenization is breaking down a sentence or a piece of text into its words or symbols. This process is typically the first step in natural language processing (NLP) tasks, such as text classification, language translation, and text generation. For example, a word-based tokenizer would segment the sentence "I am an AI model" into the tokens ["I", "am", "an", "AI", "model"] and a character-based tokenizer would segment the sentence into the tokens ["I", " ", "a", "m", " ", "a", "n", " ", "A", "I", " ", "m", "o", "d", "e", "l"].

The below code downloads a pre-trained model called ted_hrlr_translate_pt_en_converter from a specified URL and saves it to the local file system with the file name ted_hrlr_translate_pt_en_converter.zip.

The model_name variable is set to the name of the model being downloaded.
tf.keras.utils.get_file() is a utility function provided by TensorFlow that downloads a file from a specified URL and saves it to the local file system.

The first argument passed to this function is the local file path where the downloaded file will be saved. In this case, it is set to {model_name}.zip The second argument is the URL from where the model will be downloaded.

The cache_dir argument is set to ., which means that the downloaded file will be saved in the current working directory. The cache_subdir argument is set to ' ', which means that the file will be saved directly in the cache directory and not in a subdirectory. The extract argument is set to True, which means that the downloaded file will be extracted (unzipped) to the cache_dir.

Output:

This line of code is loading a saved model called ted_hrlr_translate_pt_en_converter using the tf.saved_model.load() function.

The tf.saved_model.load() function takes one argument, the path to the saved model, and returns the model's Python object. In this case, it's loading the tokenizer from the model, which is already downloaded and saved on the system.

The tokenizer object returned by this function contains the tokenizer for Portuguese and English, which can be used to tokenize text into subwords. The tokenizer object will have two attributes: en for English tokenizer and pt for Portuguese tokenizer.

The tokenizer can be used to tokenize text into subwords, which can be used as input for the translation model. This is necessary because the translation model was trained on subwords, not on whole words.

Data Pipeline:

Below is a function prepare_batch that takes two inputs, pt and en, which are the Portuguese and English text, respectively. The function tokenizes the text, trims it to MAX_TOKENS (which is set to 128 in this case), and converts it to a dense Tensor.

The function starts by tokenizing the Portuguese text, pt, using the tokenizer object tokenizers.pt.tokenize(pt). This returns a ragged tensor, which is a tensor with variable-length subword sequences.

It then trims the Portuguese text to MAX_TOKENS using pt = pt[:, :MAX_TOKENS], and converts it to a 0-padded dense Tensor using pt = pt.to_tensor().

The function then tokenizes the English text, en, using the tokenizer object tokenizers.en.tokenize(en), trims it to MAX_TOKENS+1 using en = en[:, :(MAX_TOKENS+1)].

Then the function makes a distinction between the input and label of the English text. It takes the input as the English text's first MAX_TOKENS subwords, dropping the last subword token, which represents the [END] token.

It takes the labels as the next MAX_TOKENS subwords of the English text, dropping the first subword token, which represents the [START] token.

Both inputs and labels are then converted to tensors so that they can be used as input for a model and returned as a tuple.

The returned tuple is a tuple of two tensors, one for the Portuguese text and one for the English text, and a tensor for the English text labels.

The below function make_batches takes a dataset as input and returns a new dataset. This function is used to prepare the dataset for training.

The function starts by shuffling the dataset using ds.shuffle(BUFFER_SIZE), where BUFFER_SIZE is set to 20000. This randomizes the order of the elements in the dataset.

It then groups the elements of the dataset into batches of size BATCH_SIZE, which is set to 64. It does this using the .batch(BATCH_SIZE) method.

It then applies the prepare_batch function to each batch to tokenize and pad the text, and format it as inputs and labels. It does this using the .map(prepare_batch, tf.data.AUTOTUNE) method.

These two lines of code use the make_batches function to create two new datasets: *rain_batches and val_batches. The train_examples dataset is passed as an argument to the make_batches function to create the train_batches dataset, and the val_examples dataset is passed as an argument to create the val_batches dataset. Both of these new datasets are shuffled, batched, and ready for training or validation.

Positional Encoding

The below function defines a positional encoding for the input sequences.

A positional encoding is added to the input embeddings to indicate the position of each token in the sequence. This is because, in the Transformer model, the self-attention mechanism doesn't take the order of tokens into account. So, the model needs some way to know the position of each token in the sequence.

The function takes two arguments: length and depth. The length argument is the length of the input sequence, and the depth argument is the dimension of the embedding.

The function first calculates the positions of each token in the sequence, which is an array of integers from 0 to length-1. Then it calculates the depths, which is an array of floats from 0 to $depth/2$ .

The function then calculates the angle rates and angle radians, which are used to calculate the sine and cosine values of the position and depth. The sine and cosine values are concatenated together and returned as a 2D numpy array of size $length x depth$ .

This Encoding is used to add the position information to the embeddings and passed to the model.

The PositionalEmbedding class is a custom layer for the Transformer model. It takes in two parameters, vocab_size and d_model, which are used to create an embedding layer and a positional encoding. The embedding layer is used to convert the input tokens (integers) into dense vectors (embeddings). Positional Encoding is used to add information about the position of the tokens in the input sequence to the embeddings. The class also has a compute_mask method and a call method. The compute_mask method is used to propagate the input mask to the output of the layer, and the call method is used to apply the embedding and positional Encoding to the input. The call method also multiplies the embeddings with the square root of d_model to set the relative scale and adds the positional Encoding to the embeddings.

embed_pt and embed_en are instances of the PositionalEmbedding class that we have defined earlier. These instances are being created with the vocabulary size of the Portuguese and English tokenizers and a depth of 512. The call method of these instances will take in input tensors and will add positional encodings to the embeddings of each token in the input tensors.

The Base Attention Layer

The below code defines a custom layer called BaseAttention that wraps the Multi-Head Attention (MHA) layer and a Layer Normalization (LN) layer from the TensorFlow Keras library. The Multi-Head Attention (MHA) layer is a powerful attention mechanism that allows the model to focus on different parts of the input when processing it, while the LN layer is used to normalize the input before passing it through the MHA layer. The Add layer is used to add the output of the MHA and the input of the layer together. The kwargs argument is used to pass any additional parameters to the MHA layer. These parameters may include things like the number of heads to use, the size of the model's hidden layer, etc.

The Global Self-attention Layer

The GlobalSelfAttention class is a subclass of the BaseAttention class and it defines a call method that applies multi-head self-attention to the input tensor x. In this case, the query, key, and valuable inputs to the multi-head attention layer are all set to be x. The attention output is then added to the input tensor and passed through a layer normalization layer before being returned as the output of the call method. This class can apply self-attention to the input tensor globally, meaning the attention is applied to all the positions in the input tensor. The code is shown below.

The Feed Forward Network

The Feed-forward network in Transformer is a sub-component of the multi-head attention mechanism that operates on the intermediate representation obtained from the self-attention mechanism. It has two fully-connected (dense) layers, one with a ReLU activation function and one without, and is used to add non-linearity to the intermediate representation before it is fed into the final output layer of the Transformer model. The feed-forward network captures and models complex relationships between the input features.

This custom Tensorflow 2.x layer implements a feed-forward network, which is commonly used in Transformer models. It takes as input the number of neurons in the hidden dense layer and the number of neurons in the dense output layer (d_model), and an optional dropout rate. Then, it applies a sequence of operations, which include a ReLU activation, a dropout layer, and layer normalization, to the input x and returns the final output.

The Encoder Layer

This custom Tensorflow 2.x layer implements an encoder layer of a Transformer model.

An encoder layer contains two sub-components: A self-attention mechanism and a feed-forward network.

The self-attention mechanism, represented by self_attention, is an instance of the GlobalSelfAttention layer. It computes self-attention over the input x and returns the attention-weighted representation.

The feed-forward network, represented by ffn, is an instance of the previously defined FeedForward layer. It takes the output from the self-attention mechanism and applies a sequence of dense layers and normalization to produce the final output.

The call method of this layer applies the self-attention mechanism and feed-forward network in sequence to the input x and returns the final output.

This is a custom Tensorflow 2.x layer that implements the encoder part of a Transformer model. It takes as input the number of layers, the number of neurons in the hidden dense layer (d_model), the number of heads in the multi-head attention mechanism (num_heads), the number of neurons in the feed-forward network, the vocabulary size of the input data (vocab_size), and an optional dropout rate.

It consists of several sub-components: a positional embedding layer, multiple encoder layers, and a dropout layer.

The positional embedding layer, represented by pos_embedding, is an instance of the PositionalEmbedding layer. It adds position information to the input token IDs and returns the embedded representation.

The multiple encoder layers, represented by enc_layers, are instances of the previously defined EncoderLayer layer. They apply self-attention and feed-forward networks to the input to obtain the intermediate representations.

The dropout layer, represented by dropout, is a dropout layer that adds dropout regularization to the intermediate representations.

The call method of this layer applies the positional embedding layer, multiple encoder layers, and the dropout layer in sequence to the input token IDs and returns the final output.

The Decoder Layer

CausalSelfAttention
Causal Self-Attention is a mechanism used in Transformer models to model the dependencies between elements in a sequence causally-aware manner. Unlike standard self-attention, which can attend to all positions in a sequence, causal self-attention only attends to positions before the current position in the sequence, making it causally aware. This is typically achieved by masking the attention scores such that the attention mechanism cannot attend to future elements in the sequence. The masking can be implemented by precomputing the attention scores or adding a mask to the attention layer during computation. Causal self-attention is used in the Transformer architecture to model the dependencies between elements in a sequence for NLP tasks such as machine translation, text generation, and language modeling.

This code defines a CausalSelfAttention class that implements a self-attention mechanism with a causal mask. The call method computes self-attention scores between the input "x" using a multi-head attention (MHA) layer with the use_causal_mask option set to True. This enforces the causality constraint, allowing the attention mechanism to attend only to positions before the current position in the sequence. The attention-weighted output is then added to the original "x" and normalized using a layer normalization layer. The resulting representation captures the dependencies between elements in the sequence causally-awarely.

CrossAttention

Cross-Attention in Transformers is a mechanism used to compute attention scores between two sets of inputs in NLP tasks, typically between the query and key/value representations. The query and key/value inputs are typically the output from the self-attention mechanism in the encoder or decoder. The attention scores compute the context-aware representations that capture relationships between the query and key/value inputs. In the Transformer architecture, cross-attention is implemented as a separate module, such as the CrossAttention class in the code you provided.

This code defines a CrossAttention class that implements a multi-head attention mechanism. The call method computes the attention scores between the inputs "x" and "context" using a multi-head attention (MHA) layer. The attention scores are then used to compute the attention-weighted output, which is added to the original "x" and normalized using a layer normalization layer. The attention scores are stored in the last_attn_scores attribute for later use.

Decoder Layer

This is a custom implementation of a single decoder layer in a transformer-based neural network.

It has three sub-layers:

CausalSelfAttention: A self-attention layer to calculate attention scores between each time step in the input sequence (x) and apply the attention mechanism.

CrossAttention: a cross-attention layer to calculate attention scores between the input sequence (x) and a context sequence (context) and apply the attention mechanism.

FeedForward (FFN): a fully connected feed-forward neural network layer.

The call method takes two inputs, the input sequence (x) and the context sequence (context), applies the three sub-layers and returns the final output after the FFN layer. The last attention scores from the cross-attention layer are also stored for later use.

This is a custom implementation of the decoder part in a transformer-based neural network.

It has several sub-layers:

PositionalEmbedding: a layer to add position information to the input tokens.

Dropout: a dropout layer to prevent overfitting by randomly dropping out a certain percentage of the input units.

DecoderLayer: a stack of decoder layer instances. Each layer is an instance of the DecoderLayer class described above.

The call method takes two inputs, the input sequence (x) and the context sequence (context), applies positional embedding and dropout layers, and then applies the stack of decoder layers. The final output after all the decoder layers are returned. The last attention scores from the final decoder layer are also stored for later use.

Transformers

This code defines a Transformer model for NLP tasks. It has two sub-models, an encoder and a decoder, both of which are instances of Encoder and Decoder classes. The input to the model is a tuple of context and target text. The context is passed through the encoder to obtain context representation, and then the target text and the context representation are passed through the decoder to obtain the final output. The final output is then processed through a final linear layer to produce the logits (prediction probabilities) which the model returns.

Training Transformer

In this section, we will be training the Neural Translation Transformer, which will translate one language to another. The code below sets up some hyperparameters for training a Transformer model using TensorFlow's tf.keras API. Specifically:

num_layers specifies the number of layers in the Transformer encoder/decoder stack.
d_model specifies the size of the model's hidden state.
dff specifies the number of neurons in the feed-forward layer in each attention block.
num_heads specifies the number of attention heads in each multi-head attention block.
dropout_rate sets the dropout rate to prevent overfitting.
optimizer sets the optimizer to use during training. In this case, it is the Adam optimizer with a specified learning rate and beta values of 0.9 and 0.98. The epsilon value sets the numerical stability term in the Adam optimizer.

The below code defines two functions for evaluating the performance of a sequence-to-sequence model, such as a Neural Machine Translation (NMT) model:

masked_loss: This function calculates the loss between the target and predicted sequences. It uses the SparseCategoricalCrossentropy loss, which is a suitable loss function for multi-class classification problems where the classes are mutually exclusive (i.e., each sequence position can only belong to one class). The mask variable creates a binary mask to exclude padding tokens (represented by 0) from the loss calculation. The final loss is calculated as the average loss per non-padding token.
masked_accuracy: This function calculates the accuracy of the predicted sequences compared to the target sequences. It converts the predicted sequences into class indices (using tf.argmax) and compares them to the target sequences. The matching variable holds the binary values indicating whether each position in the target and predicted sequences match. The mask variable creates a binary mask to exclude padding tokens from the accuracy calculation. The final accuracy is calculated as the sum of correct predictions divided by the sum of non-padding tokens.

The code below compiles the Transformer model using the tf.keras.Model compile method.

The arguments passed to the compile method are:

loss: The loss function to use during training is the masked_loss function defined earlier.
optimizer: The optimizer to use during training, which is the optimizer variable defined earlier.
metrics: A list of metrics to use during training and evaluation, which is only the masked_accuracy function defined earlier.

After calling compile, the model is ready for training on the parallel corpora using the fit method.

The below code trains the compiled Transformer model on the parallel corpora (train_batches) for 5 epochs using the fit method. In addition, the validation_data argument is set to val_batches to enable the model to be evaluated on validation data after each epoch.

The fit method trains the model by iterating over the training data, making predictions using the current model parameters, computing the loss between the predicted and actual sequences, and updating the model parameters to minimize the loss. This process is repeated for the specified number of epochs.

You may need to adjust the number of epochs and other hyperparameters to perform well on your problem and data.

Outputs:

Run Inference

Below is a TensorFlow (tf) implementation of a Translator module that translates a sentence from Portuguese to English. The Translator module is initialized with two tokenizers (one for Portuguese and one for English) and a Transformer model. The input sentence is first tokenized using the Portuguese tokenizer. Then, the [START] and [END] tokens are added to the sentence and passed through the Transformer model to translate the sentence into English. The resulting tokens are then detokenized using the English tokenizer, and the attention weights of the last iteration of the loop are also calculated. The output is a tuple of the translated sentence (text), the tokenized form of the translated sentence (tokens), and the attention weights.

The print_translation function takes 3 parameters as input: sentence, tokens, and ground_truth. It then prints the input sentence, the predicted tokens, and the ground truth in a readable format. The tokens are first decoded from UTF-8 Encoding before being printed.

Output:

Export Model

The ExportTranslator class is a subclass of tf. A module that wraps a Translator object and exports it as a TensorFlow serving model. It has one method, call, which takes a single argument of a string tensor and returns the translated text as a string tensor. The method is decorated with @tf.function, which means that the method will be converted into a TensorFlow graph and executed as a tensor computation. The input_signature argument to the tf.function decorator defines the expected shape and type of the input tensor.

This code will save a TensorFlow SavedModel of the ExportTranslator class to the directory translator. The saved model will contain a single signature for the method call that takes in a string tensor of shape [] and returns a string tensor. The method inside ExportTranslator is decorated with tf.function, so the model will be optimized for deployment and can be executed as a TensorFlow graph.

Conclusion

In this article, we have learned how to build the Neural Machine Translation Model in Keras and Tensorflow.

The below point summarizes the article:

Neural Machine Translation is a machine translation that uses deep neural networks to translate natural language text. The main components of an NMT system are the encoder, decoder, and attention mechanism.
The encoder converts the source language sentence into a continuous representation, and the decoder generates the target language sentence based on the encoder's output.
The attention mechanism lets the decoder selectively focus on different parts of the encoder's output while generating the target sentence.
NMT systems are trained on large parallel corpora of sentence pairs to learn the translation between languages.
NMT systems have significantly improved over traditional machine translation systems, achieving state-of-the-art performance on several benchmark datasets.