Introduction to Model Optimization in Deep Learning

Overview

Model optimization is one of the most important parts of training and Machine Learning (ML) or Deep Learning Model (DL). Optimization aims to minimize the difference between the predicted output of the Model and the actual output, known as the loss function. Optimization can be done through various optimization algorithms such as gradient descent, stochastic gradient descent, etc. Additionally, regularization can prevent overfitting, which occurs when a model is too complex and performs well on the training data but could be better on unseen data. In this article, we will discuss Model Optimization.

Introduction

Model Optimization is one of the crucial steps after Model training. However, in general, we are still determining how the Model will respond if we change the value of the Hyperparameter. In this article, we will study Model Optimization Frameworks and Hyper-parameters.

What is a Hyper-Parameter?

A model's parameters serve as its representation in machine learning and deep learning. In contrast, the best/optimal hyperparameters are chosen throughout a training phase so that learning algorithms can produce the best results. What exactly are these hyperparameters, then? The user-defined parameters known as hyperparameters explicitly regulate the learning process in the response.

The word "hyper" in this context denotes top-level settings that are utilized to regulate the learning process. Before the learning algorithm starts training the Model, the machine learning engineer chooses and sets the value of the Hyperparameter. As a result, they are independent of the Model and their values cannot be altered during training.

Some examples of hyper-parameters are Batch_size, number of neurons, number of layers, test-train split, optimizer, activations function, etc.

What is Hyper-Parameter Tuning?

Hyperparameter tuning is selecting the optimal set of hyperparameters for a machine-learning model. Hyperparameters are parameters not learned from the data during training but set by the practitioner before training. Examples of hyperparameters include the learning rate, number of hidden layers, and batch size in neural networks. Hyperparameter tuning can be done manually or through grid or random search techniques. The goal is to find the set of hyperparameters that result in the best performance of the Model on a validation set.

Model Parameter and Model Hyperparameter?

Hyperparameters, also known as model hyperparameters, and parameters are frequently confused. Let's examine the differences between the two and their connections to one another intoliminate this misconception.

Model Parameters

A model automatically learns its model parameters, which are internal configuration variables. W Weights or Coefficients, for instance, of independent variables in a linear regression model. Alternatively, the weights or coefficients of the independent variables in a model, the weight and biases of a neural network, or the centroid of a cluster in clustering. The following are some crucial model parameter considerations:

The Model employs them while making predictions.
They are typically not established manually and are learned by the Model from the data.
These comprise the Model's core and are essential to a machine learning algorithm.

Model Hyperparameters

Hyperparameters are parameters that the user has specifically established to regulate the learning process. The following are some crucial model parameter considerations:

The machine learning engineer normally defines these manually.
The precise optimum hyperparameter value for the problem has yet to be discovered. However, a general rule of thumb or a process of trial and error can be used to identify the ideal value.
The rate at which a neural network learns during training is one example of a hyperparameter.

Hyperparameter Bifurcation

Hyperparameters can be broadly split into the following two categories. These categories are discussed below:

Hyperparameter for Optimization

The Model is optimized using optimization parameters. Two of the most important Hyperparameter for model optimization are summarized below:

Learning Rate: The learning rate is a hyperparameter in optimization algorithms that regulates how much the Model must alter each time its weights are updated in response to the predicted mistake. It controls how frequently model parameters are cross-checked. Choosing the optimal learning rate is only possible if it is high. The training process may be slowed down.

Batch Size: The training set is separated into various subsets, called a batch, to speed up the learning process. Number of Epochs: The entire cycle of training the machine learning model is called an epoch. Epoch stands for an incremental learning process. Each Model has a different number of epochs. The appropriate number of epochs is chosen by accounting for a validation error. As a validation error decreases, the number of epochs is raised. It is recommended to stop increasing the number of epochs if the reduction error does not improve over successive epochs.

HyperParameter Specific to Models

Hyperparameters for certain models are integral to the Model's structure. Two of the most important hyper-parameter are discussed below:

Hidden Number of Neurons: There are several hidden units, which are the layers of processors that make up a neural network's input and output units. The Hyperparameter for the neural network's number of hidden units must be specified. It should be somewhere between the input and output layers' sizes. There should be as many hidden units as input and output layers combined.

Number of Layers: Layers are the vertically stacked parts of a neural network. Input layers, hidden layers, and output layers are primarily present. The performance of a 3-layered neural network is superior to a 2-layered network.

What is Model Optimisation?

Deep learning is a branch of machine learning used to carry out difficult tasks like text classification and speech recognition. An activation function, input, output, hidden layer, loss function, and other components comprise a deep learning model. Any deep learning model makes predictions based on previously unseen data and attempts to generalize the data using an algorithm. We require both an optimization method and an algorithm that translates examples of inputs to examples of outputs. When mapping inputs to outputs, an optimization method determines the value of the parameters (weights) that minimizes the error. These hyperparameters significantly impact the effectiveness of the deep learning model. They also have an impact on the Model's speed training. Model Optimisation can be used interchangeably with hyper-parameter tuning, where we find the best value for the model hyper-parameter

Why Model Optimisation?

The hyperparameters sampled from each defined distribution are used to create the Model. These numbers control the number of layers, the number of nodes across each layer, the probability that a layer would drop out, the learning rate, the optimizer, the learning decay for the optimization method, and the activation function. While the deep learning optimizers model is being trained, the weights for each epoch must be adjusted, and the loss function must be minimized.

The loss depends on the model architecture, and model architectures depend on the Hyperparameter, such as activation function, number of nodes in a layer, number of layers, loss function, optimizer, learning rate, and many other parameters. Technically no one has any idea what will result from altering specific hyperparameters. Therefore, we need to select the best and most efficient hyperparameter value to minimize the loss and increase the model statistics according to the problem statement. The final goal of the Model Optimization is to uplift the model prediction quality.

How to Do Model Optimisation?

Understanding the hyperparameters and the business use case is necessary to select the best set. There are two ways to establish them.

Manual Optimization

Each trial with a set of hyperparameters will be carried out by you when performing manual hyperparameter tuning. This method calls for a powerful experiment tracker that can monitor a wide range of data, including photos, logs, and system metrics.

The benefits of manually optimizing hyperparameters are as follows:

Manually adjusting hyperparameters gives you more control over the procedure.
Doing it manually would make sense if you were researching or examining tuning and how it affects the network weights.

Manual hyperparameter tuning has the following drawbacks:

Manual tuning might involve numerous attempts, and keeping track of them can be expensive and time-consuming. - This strategy is not particularly useful when there are numerous hyperparameters to consider.

Automated Optimization

Automated Hyperparameter tuning carries out the operation using pre-existing algorithms. You take the following actions:

Initially, define a collection of hyperparameters and limitations on their values (note: every algorithm requires this set to be a specific data structure, e.g. dictionaries are common while working with algorithms).
The algorithm then takes care of the labor-intensive work. It performs those tests and provides the optimum set of hyperparameters for the best outcomes.

Model Parameter vs. Hyper-parameter

This section will discuss the difference between Model parameters and Hyper-parameter.

Model Parameter	Hyper Parameter
They are necessary to make a prediction.	They are necessary for figuring out the Model's parameters.
Algorithms for Optimization estimate them such as Gradient Descent, Adam, and Adagrad	They are estimated by tweaking the hyper-parameters values
They are not manually adjusted.	They are adjusted manually
How well the Model performs on new data will be determined by the final parameters discovered after training. How effective the training is is determined by the selection of hyperparameters. The learning rate in gradient descent determines how effective and precise the optimization process is in estimating the parameters.	How effective the selection of hyperparameters determines the training. The learning rate in gradient descent determines how effective and precise the optimization process is in estimating the parameters.

Model Optimization Frameworks in Deep Learning

In this section, we will overview the various State of the Model Optimization Frameworks being widely used. A few of the most important Frameworks are mentioned below:

Ray-Tune

Ray Tune is an open-source library for distributed hyperparameter tuning of machine learning models. It is built on top of Ray, a library for distributed computing. Ray Tune provides an easy-to-use interface for specifying the hyperparameters you want to tune and their possible range of values, and then uses a powerful algorithm, such as random search, grid search, and Bayesian Optimization, to find the best set of hyperparameters for a given model and dataset.

Ray Tune supports several optimization methods such as random search, grid search, and Bayesian optimization algorithms like TPE and Hyperband. It also allows you to define your custom search algorithm if needed. The library also allows you to handle constraints and variables such as continuous, categorical, and integer.

Ray Tune also allows parallel execution of trials and uses a ray to manage resources and perform distributed execution. It allows to run of experiments on a local machine or a cluster, and also allows to use of early stopping, and also allows to save and resumes trials. Here is an example of how to use Ray Tune to perform Optimization for a Keras model:
In this example, we define a function create_model that builds our neural network with the given hyperparameters. Then we call the tune.run function of Ray Tune and pass it the create_model function along with the configuration of the hyperparameters. Finally, we use grid search to try different values of optimizer and hidden_size.
Optuna

Optuna is an open-source library for the hyperparameter tuning of machine-learning models. First, it provides an easy-to-use interface for specifying the hyperparameters you want to tune and their possible range of values. Then it uses a powerful tree-structured Parzen Estimator (TPE) algorithm to find the best set of hyperparameters for a given model and dataset.

Optuna supports several optimization methods, such as random search, grid search, and the TPE algorithm. TPE is a Bayesian optimization algorithm that uses a probabilistic model to predict the performance of different combinations of hyperparameters and selects the next set of hyperparameters to try based on the predicted improvement. It also allows you to define your custom search algorithm if needed. The library also allows you to handle constraints and variables such as continuous, categorical, and integer.

Optuna also provides built-in integration with several machine-learning libraries such as Keras, TensorFlow, XGBoost, LightGBM, and more. It also supports parallelization, allows to use of early stopping, and also allows saving and resuming trials. Here is an example of how to use Optuna to perform Optimization for a Keras model:
In this example, we define an objective function, which Optuna will optimize. The function creates the Model, trains it, and evaluates the Model's performance on the test set. KerasPruningCallback stops training trials that perform poorly based on the validation loss. Then we create a study, set the optimization direction as maximizing, and run the Optimization with 100 trials.

Once the Optimization is complete, the best combination of hyperparameters can be accessed by calling study.best_params, and the best value of the objective function can be accessed by calling study.best_value. Optional also provides a web-based visualization tool to analyze the results of Optimization, which can be used to understand the relationship between the hyperparameters and the objective function.
Hyperopt

HyperOpt is an open-source library for hyperparameter tuning of machine learning models. HyperOpt provides an easy-to-use interface for specifying the hyperparameters you want to tune and their possible range of values. It then uses advanced optimization algorithms such as Tree-structured Parzen Estimator (TPE) and Random Search find the best set of hyperparameters for a given model and dataset.

HyperOpt allows you to define custom search spaces, objectives, and constraints and provides several built-in optimization algorithms like TPE, Random Search, Annealing, and Simulated Annealing. HyperOpt also allows you to define your own custom optimization algorithm if needed.

Here is an example of how to use HyperOpt to perform TPE optimization for a Keras model:
In this example, we define a function create_model that builds our neural network with the given hyperparameters and a space that specifies the possible values for optimizer and hidden_size. Then we call the fmin function of Hyperopt and pass it the create_model with max_evals of ten, and finally, I extract the best Hyperparameter by implementing the hyperopt.space_eval() function.
BayesianOptimisation

Bayesian Optimization is a global optimization method well-suited for optimizing expensive black-box functions, such as the performance of a machine learning model with a set of hyperparameters. It is a probabilistic method that uses a Bayesian model to predict the performance of different combinations of hyperparameters and selects the next set of hyperparameters to try based on the predicted improvement.

The process starts by building a probabilistic model (usually a Gaussian process) of the objective function (the Model's performance) based on the initial set of evaluations. The Model is then used to predict the performance of new sets of hyperparameters, and an acquisition function is used to decide the next set of hyperparameters to evaluate. Finally, the process is repeated until a satisfactory set of hyperparameters is found.

The main advantage of Bayesian Optimization over other optimization methods is its ability to effectively balance exploration (trying new sets of hyperparameters) and exploitation (trying sets of hyperparameters predicted to perform well). Furthermore, it has been shown that Bayesian Optimization can find good solutions faster than other optimization methods like grid search or random search for high-dimensional and expensive-to-evaluate problems. It is implemented in libraries such as GPyOpt, Optuna, SHERPA, and many more.
SHERPA

SHERPA is a library for hyperparameter tuning of machine learning models. SHERPA provides an easy-to-use interface for specifying the hyperparameters you want to tune and their possible range of values. It then uses a powerful algorithm to find the best set of hyperparameters for a given model and dataset.

SHERPA supports several optimization methods such as grid search, random search, and advanced optimization algorithms like Bayesian Optimization and Multi-Objective Optimization. It also allows you to define your custom search algorithm if needed.

It also provides a parallelization mechanism that allows you to run multiple trials simultaneously, which can significantly reduce the time required to find the optimal set of hyperparameters.

Here is an example of how to use SHERPA to perform a grid search for a Keras model:
In this example, we define a function create_model that builds our neural network with the given hyperparameters and a parameter_space that specify the possible values for optimizer and hidden_size. Then we create an instance of the GridSearch class and pass it to the Optimizer class along with the objective function, the training, and testing data. Finally, the optimizer.run() method starts the optimization process.

Once the Optimization is complete, the best combination of hyperparameters can be accessed by calling results.best(). SHERPA can also be used with other machine learning libraries such as scikit-learn, XGBoost, etc.
GPyOpt

GPyOpt is an open-source library for Bayesian Optimization that can be used to optimize the hyperparameters of machine learning models. GPyOpt is built on GPy, a Python library for Gaussian process modeling. It allows you to perform efficient global Optimization of expensive black-box functions.

GPyOpt provides several acquisition functions, such as Expected Improvement, Probability of Improvement, and Upper Confidence Bound. It also allows you to define your custom acquisition function if needed. The library also allows you to handle constraints and variables such as continuous, categorical, and integer. Here is an example of how to use GPyOpt to perform Bayesian Optimization for a Keras model:
In this example, we define a function create_model that builds our neural network with the given hyperparameters and a function model_performance that trains the Model and returns the negative of the accuracy as the performance metric. We then specify the bounds for the optimizer and hidden_size hyperparameters in the bounds variable. Then we create an instance of the BayesianOptimization class and pass the model_performance function, the bounds, and the acquisition type as EI. The run_optimization method is then called on the optimizer object with the maximum number of iterations.

Once the Optimization is complete, the best combination of hyperparameters can be accessed by calling optimizer.x_opt. GPyOpt can also optimize the hyperparameters of other models such as SVM, Random Forest, and many more.
Metric Optimisation Engine (MOE)

A Metric Optimization Engine is a tool or system that can optimize a specific metric or set of metrics of a machine learning model. The goal of a metric optimization engine is to find the best set of hyperparameters or other configurations that result in the highest performance of the Model for a given metric or set of metrics. These engines work by specifying possible values for each Hyperparameter and other configurations. Then, the algorithm will train and evaluate a model for each combination of these values. Finally, the combination of values that results in the best performance on the specified metric or set of metrics is chosen as the optimal set of values.

Some examples of Metric Optimization Engines are Optuna, Hyperopt, and Spearmint. These engines use Bayesian Optimization, an efficient global optimization method well-suited for high-dimensional and expensive-to-evaluate problems such as hyperparameter tuning. These engines can be used for model selection, hyperparameter tuning, and with different machine learning libraries and frameworks such as TensorFlow, PyTorch, scikit-learn, and many more.
Keras Tuner

Keras Tuner is an open-source library for hyperparameter tuning for Keras models. It allows you to perform efficient hyperparameter searches using a simple and flexible interface. Keras Tuner provides several built-in tuning methods, such as random search, grid search, and Bayesian Optimization. It also allows you to define your custom search algorithm if needed. Here is an example of how to use the Keras Tuner to perform a random search for a Keras model:
In this example, we define a function build_model that builds our neural network, and we use the RandomSearch class from Keras Tuner to perform a random search on this Model. First, we pass the build_model function, the objective metric we want to optimize ('val_accuracy'), the maximum number of trials (5), and a directory for saving the trial results. Then we call the search method on the tuner object with the training and validation data and the number of epochs.

Once the search is complete, the best Model can be accessed by calling tuner.get_best_models(num_models=1)[0].

Keras Tuner also provides a more advanced tuning method called Hyperband. It is a simple and efficient algorithm that can quickly identify high-performing configurations while being more efficient than traditional grid search or random search.
SigOpt

SigOpt is an optimization platform that can optimize the hyperparameters of machine learning models. It uses Bayesian Optimization, an efficient global optimization method well-suited for high-dimensional and expensive-to-evaluate problems such as hyperparameter tuning.

SigOpt allows you to specify the hyperparameters you want to optimize and their possible range of values. Then it uses a probabilistic model to predict the performance of different combinations of hyperparameters and selects the next set of hyperparameters to try based on the predicted improvement. This process is repeated until a satisfactory set of hyperparameters is found.

SigOpt also provides a python client library that can be easily integrated with popular machine learning libraries such as Keras, TensorFlow, PyTorch, and scikit-learn to perform the Optimization. An example of how to use SigOpt with a Keras model would be:
In this example, we create an experiment on SigOpt and define the optimizer and hidden_size hyperparameters. Then we define a function that creates the Model with the given hyperparameters. Then we loop over several iterations and call the SigOpt API to get the next set of hyperparameters to try. Finally, we train the Model with those hyperparameters, evaluate the performance and send the result back to SigOpt.

SigOpt will use this feedback to update its probabilistic Model and select new sets of hyperparameters to try. The optimization process will continue until the desired number of iterations is reached.
Grid Search

We build a grid of potential hyperparameter values in the grid search approach. Each iteration tries a set of hyperparameters in a certain sequence. It tracks the model performance when fitting it with every set of hyperparameters. The best Model with the best hyperparameters is then returned.

In Keras, grid search can be implemented using the KerasClassifier or KerasRegressor wrapper classes from the keras.wrappers.scikit_learn module. These wrapper classes allow you to use Keras models in scikit-learn's grid search and other machine learning workflows. Here's an example of how to use GridSearchCV to perform grid-search for a Keras neural network:
In this example, we define a function create_model that builds our neural network. Then we use KerasClassifier to wrap our Model to use it in scikit-learn's grid search. Next, we define a dictionary param_grid specifying the hyperparameters we want to tune and their possible values, optimizer, and hidden_size. Then we pass the Model and the parameter grid to GridSearchCV, specifying a 3-fold cross-validation. The fit method is then called on the GridSearchCV object with the training data, and it will train and evaluate a model for each combination of hyperparameter values.

Once the grid search is complete, the best combination of hyperparameters can be accessed by calling grid_result.best_params_.

Conclusion

In this article, we have learned about Model Optimization and the Frameworks available to optimize any model in Keras. The following is the takeaway from this article:

Model Parameter and a Hyperparameter are two different terminologies.
How does Model Optimization benefit the Machine Learning Practioner to select the best hyper-parameter for their respective Model?
The different types of approaches/Frameworks available to optimize Models in Keras and Tensorflow.
Different types of hyper-parameter optimization frameworks such as Optuna, SHERPA, Sigopt, Keras Tuner, and others.