Hidden Markov Model - Scaler Topics

Overview

The Hidden Markov Model in Artificial Intelligence is a framework to statistically represent various systems that are sequential as well as probabilistic in nature. Further, these real-life systems generally have some parameters which are not visible to the ones outside the system. However, these parameters are capable of affecting the visible behavior of the system. Therefore, a Hidden Markov Model exploits this probabilistic relationship between the hidden parameters and the visible observations of a system and helps derive the crucial hidden parameters. As a result, it finds extensive applications in real-life, including Natural Language Processing, Time Series Analysis, and many more.

Introduction

In real life, various processes are sequential in nature and change according to time. Moreover, these changes are not entirely deterministic and can be associated with some probabilities. The Hidden Markov Model in Artificial Intelligence is a mathematical technique for modeling such sequential and stochastic processes and identifying the probabilistic relationship between its sequence of hidden states and its sequence of observations. Before plunging into the details of the Hidden Markov Model in Artificial Intelligence, we briefly present the Markov Property and the Markov Chain.

Markov Property: It states that the future state of an environment depends only on the present state. In other words, the current state already has information about the past states. Therefore, we can write

P ( s_{t+1} \vert s_1, s_2, ..., s_{t-1}, s_t ) = P ( s_{t+1} \vert s_t )

where $s_{i}$ denotes the future state at time step $i$ and $P(.)$ denotes probability. LHS represents the conditional probability of state $s_{t+1}$ provided that the previous states are $s_1...s_t$ . RHS indicates the conditional probability of future state $s_{t+1}$ provided that the current state is $s_t$ .

Markov Chain (or) Markov Process: A Markov Chain/Process comprises a set of states(that obey the Markov Property) and probabilities of transitioning from one state to the other. We use this for modeling the probability of a sequence of these states. An example Markov Chain can be seen in Figure-1.

Figure-1

Markov chain example

Adding to the complexity of a Markov Chain, a Hidden Markov Model in Artificial Intelligence has some or all of its states as hidden, i.e., we can not see the exact state of the environment. However, the hidden state of the system emits observations with some probabilities. We can use a sequence of these observations to determine the most probable sequence of hidden states the environment has been in. In the next section, we formally define a Hidden Markov Model in Artificial Intelligence.

What is the Hidden Markov Model?

A Hidden Markov Model models a Markov Process with unobservable states using the following components.

Initial probability distribution of the hidden states denoted as $\pi$ .
The probabilities of transitioning from one state to another state. This is represented by matrix $A$ , where $A_{ij}$ denotes the probability of transitioning from state $x_i$ to state $x_j$ . We call these the transition probabilities.
The probabilities of obtaining various observations due to a state. This is denoted by the matrix $B$ where $B_{ij}$ indicates the likelihood of observation $y_j$ being emitted from state $x_i$ . We also term this as the emission probabilities.

The matrices $A$ and $B$ constitute the Hidden Markov Model $\lambda$ . In line with the Markov Property, the Hidden Markov Model in Artificial Intelligence is based on two assumptions:

The next hidden state depends only on the current hidden state. Mathematically, we can write

P(x_t \vert x_1, x_2...x_{t-1}) = P(x_t \vert x_{t-1}) \;\;\;\;\;\;\;\;\;\; \text{(Markov Process)}

The observation depends only on the current state. Mathematically, we can write

P(y_t \vert x_1, x_2...x_{t}) = P(y_t \vert x_{t}) \;\;\;\;\;\;\;\;\;\; \text{(Output Independence)}

We can construct the following algorithm to define a Hidden Markov Model in Artificial Intelligence for the concerned problems.

Step-1: Define the state space and the observation state space of the given problem.
Step-2: Define the initial probability distribution $\pi$ of the hidden states of the problem.
Step-3: Define the transition probabilities $A$ from one state to the other.
Step-4: Define the probability $B$ of obtaining observations from different hidden states.
Step-5: After defining all the above parameters of the Hidden Markov Model, train it to find the solution.
Step-6: Decode the most probable sequence of hidden states from the trained model.
Step-7: Evaluate the trained model.

The implementation of the above algorithm can be found in the subsequent sections. Now, let us take an example to illustrate Hidden Markov Model in Artificial Intelligence. Bob's mood depends on the weather outside. If the weather is sunny, there is a $90 \%$ chance that Bob will be happy and a $10 \%$ chance that Bob will be sad. However, if the weather is rainy, Bob will be happy, with a probability of $30 \%$ , and sad, with a likelihood of $70 \%$ . Bob's friend Alice is sitting inside her house and can not see the weather outside. However, she can observe the mood of her friend Bob and can accordingly guess the weather outside. This situation can be modeled using a Hidden Markov Model where the weather acts as the hidden state and Bob's mood acts as the observation emitted from the hidden states. This can be visualized, as shown in Figure-2.

hidden markov model in artificial intelligence example

In the next section, we present the various algorithms for solving different components of the Hidden Markov Model in Artificial Intelligence.

Problems to be Solved in a Hidden Markov Model

There are three problems associated with a Hidden Markov Model in Artificial Intelligence, and they are as follows.

Problem 1 (Likelihood): Given a Hidden Markov Model denoted by $\lambda = (A, B)$ and an observation sequence of $T$ time steps $Y = \{ y_i \}, 1 \leq i \leq T$ , we have to determine the probability of the observation sequence, i.e., $P(Y \vert \lambda)$ .
Problem 2 (Decoding): Given a Hidden Markov Model represented by $\lambda = (A, B)$ and an observation sequence of $T$ time steps $Y = \{ y_i \}, 1 \leq i \leq T$ , we have to determine the most probable sequence of hidden states $X = \{ x_i \}, 1 \leq i \leq T$ .
Problem 3 (Learning): Given an observation sequence of $T$ time steps $Y = \{ y_i \}, 1 \leq i \leq T$ and a set of hidden states $X = \{ x_i \}, 1 \leq i \leq T$ , we have to learn the parameters $A$ and $B$ of the Hidden Markov Model while determining the optimal model that maximizes the probability of $Y$ .

Problem 1 (Likelihood)

There are two algorithms to solve this problem of the Hidden Markov Model in Artificial Intelligence: forward algorithm and backward algorithm. Here, we present the details of the forward algorithm, which is based on dynamic programming. The forward algorithm has three phases: initialization, recursion, and termination.

Initialization

We compute the first forward variable $\alpha_1$ for $i^{th}$ state by multiplying the likelihood of state $x_i$ and the emission probability of the first observation $y_1$ from the state $x_i$ . Mathematically, we can write the following equation for all $N$ hidden states.

\alpha_1(i) = \pi_iB_{i1} \;\;\;\;\;\;\;\;\;\;\; \forall 1 \leq i \leq N

Recursion

For computing the $t+1^{th}$ forward variable $\alpha_{t+1}(j)$ of $j^{th}$ state, we compute the sum over all $i$ the product of $t^{th}$ forward variable of $i^{th}$ state, transition probability $A_{ij}$ from state $x_i$ to $x_j$ and the emission probability $B_{j,t+1}$ of observation $y_{t+1}$ from state $x_j$ . We can write the following recursive expression for this step.

\alpha_{t+1}(j) = \sum_{i=1}^{N} \alpha_t(i)A_{i,j}B_{j,t+1} \;\;\;\;\;\;\;\;\;\;\; \forall 1 \leq j \leq N

Termination

After applying the recursive step for $T$ time steps, we compute the final desired output $P(Y \vert \lambda)$ by summing $T^{th}$ forward variable of all states. Therefore, we can write the following expression.

P(Y \vert \lambda) = \sum_{i=0}^{N} \alpha_T(i)

Problem 2 (Decoding)

We use a dynamic programming-based solution called the Viterbi algorithm for this problem. This algorithm also has three phases, namely initialization, recursion, and termination, which are presented below.

Initialization

We compute the first Viterbi variable $\delta_1$ for the $i^{th}$ state by multiplying the likelihood of state $x_i$ and the emission probability of the first observation $y_1$ from the state $x_i$ . Mathematically, we can write the following equation for all $N$ hidden states.

\delta_1(i) = \pi_iB_{i1} \;\;\;\;\;\;\;\;\;\;\; \forall 1 \leq i \leq N

Recursion

For computing the $t+1^{th}$ Viterbi variable $\delta_{t+1}(j)$ of $j^{th}$ state, we compute the maximum over all $i$ the product of $t^{th}$ Viterbi variable of $i^{th}$ state, transition probability $A_{ij}$ from state $x_i$ to $x_j$ and the emission probability $B_{j,t+1}$ of observation $y_{t+1}$ from state $x_j$ . We can write the following recursive expression for this step.

\delta_{t+1}(j) = \max_{i=1}^{N} \delta_t(i)A_{i,j}B_{j,t+1} \;\;\;\;\;\;\;\;\;\;\; \forall 1 \leq j \leq N

Termination

After applying the recursive step for $T$ time steps, we can figure out the most probable hidden state of time step $T+1$ by selecting the index $i$ with the highest $T^{th}$ Viterbi value $\delta_{T}(i)$ . Similarly, for each $1 \leq t \leq T$ , we can select the most probable hidden state $x_t$ by choosing the index $j$ with the highest $t^{th}$ Viterbi value $\delta_t(j)$ . In this manner, we can determine the most probable sequence of hidden states $X = \{ x_i \}, 1 \leq i \leq T$ .

Problem-3 (Learning)

This problem can also be solved using a dynamic programming-based approach called Forward-Backward Algorithm or Baum-Welch Algorithm. This comprises four phases, namely initialization, forward phase, backward phase, and update phase. The forward phase involves the following recursive function.

\alpha(X_{k}) = P[Y_{0:k}, X_k] = \sum_{X_{k-1}} \alpha(X_{k-1}) P(X_k \vert X_{k-1}) P(Y_k \vert X_k)

The backward phase is executed using the following recursive function.

\beta(X_k) = P[Y_{k+1:T}, X_k] = \sum_{X_{k+1}} \beta(X_{k+1}) P(X_{k+1} \vert X_k) P(Y_{k+1} \vert X_{k+1})

For more details on this algorithm, please consult this link.

The following section describes a hands-on implementation of the Hidden Markov Model in Artificial Intelligence using Python.

Implementation of the Hidden Markov Model in Python

We will use Python to implement the Hidden Markov Model of weather-driven Bob's mood presented earlier. In this problem, the weather denotes the hidden state. There are two hidden states, namely Sunny and Rainy. Bob's mood is based on the weather outside. In this problem, Bob's mood is the observation we make. If the weather is Sunny, the probability of Bob being Happy is $80 \%$ , and Sad is $20 \%$ . On the other hand, if the weather is Rainy, the probability of Bob being Happy is $30 \%$ , and Sad is $70 \%$ .

Alice is staying inside her home and can not see the weather outside. However, she can observe Bob's mood and guess the most probable weather outside. For Alice, the initial probability of the weather being Sunny is $60 \%$ , and Rainy is $40 \%$ . Having stated the problem statement, we would model this as a Hidden Markov Model in the subsequent steps.

Step-1: Import the required libraries Our script would use numpy, seaborn, matplotlib, and hmmlearn libraries.
Step-2: Define the model parameters In this part, we will describe the hidden states, the observations, the initial probability distribution of hidden states, transition probabilities, and emission probabilities.
Step-3: Create a Hidden Markov Model instance and feed the model parameters In this step, we would create an instance of the Hidden Markov Model using the hmmlearn library and feed the model parameters defined in the previous step.
Step-4: Take an observation sequence and use the Hidden Markov Model instantiated in the previous step for decoding the problem and computing the most probable sequence of hidden states.
The entire script for generating the most probable sequence of hidden states is as follows.
The final output of the above script comes out to be the following.

The observation sequence was (Happy, Sad, Happy, Sad, Sad, Happy). As a result, the most probable hidden state sequence came out to be (Sunny, Rainy, Rainy, Rainy, Rainy, Sunny). In the next section, we describe the application of the Hidden Markov Model in Natural Language Processing.

Hidden Markov Models in NLP

As is evident from the definition of the Hidden Markov Model in Artificial Intelligence, it deals with sequential stochastic data. Examples of such time series data include audio, video, and text data. The analysis of text data is done under the purview of Natural Language Processing. Therefore, Hidden Markov Models in Artificial Intelligence are suitable for application in various tasks of Natural Language Processing. One such task is Part Of Speech Tagging, abbreviated as POS Tagging. In the following sub-section, we describe the details of POS Tagging.

What is POS Tagging?

In any English sentence, each word belongs to a specific part of speech. There are $9$ commonly used parts of speech in the English Language, namely nouns, pronouns, verbs, adverbs, articles, adjectives, prepositions, conjunctions, and interjections. Now, one of the tasks in Natural Language Processing is POS Tagging, which involves classifying each word in a sentence as one of the nine parts of speech, as shown in Figure-3. In other words, we have to tag each word with its corresponding POS. POS tagging is essential for establishing communication between a machine and a human. One of the most common ways to perform POS Tagging is using the NLTK library in Python. In the next section, we illustrate how we can perform POS Tagging using Hidden Markov Model in Artificial Intelligence.

pos tagging

POS Tagging with Hidden Markov Model

The Hidden Markov Model in Artificial Intelligence is one of the solutions for performing POS Tagging. In this model, the nine parts of speech form the hidden states. Naturally, transition probabilities represent the probabilities of the occurrence of one part of speech after the other. For instance, we need to know the probabilities of the occurrence of a verb after a noun, a noun after an article, and so on. Further, we can not observe the underlying parts of speech, but we can observe the words. Therefore, the emission probabilities would denote the probabilities of the words when the hidden parts of speech are given. As discussed in previous sections, POS Tagging reduces to Problem-2, i.e., decoding the sequence of hidden states(parts of speech) when the observation sequence(sequence of words in the sentence) is given.

For instance, let us say that we have the following corpus of sentences or observation sequences.

Mary Jane can see Will.
Will Jane spot Mary?
The spot will see Mary.
Mary will pat Spot.

Now, we can build the emission probability matrix as shown below.

Words	Noun	Modal	Verb
mary	4/9	0	0
jane	2/9	0	0
will	1/9	3/4	0
spot	2/9	0	1/4
can	0	1/4	0
see	0	0	2/4
pat	0	0	1/4

The word "will" appears $1$ time as a noun and $3$ times as a modal. Moreover, a noun has appeared $9$ times in the corpus, whereas a modal has appeared $4$ times. Hence, the emission probability of the observation word "will" when the hidden state part of speech is given as a noun is $1/9$ . Similarly, the emission probability of "will" when a modal is provided as part of speech is $3/4$ . Similarly, we can populate the rest of the matrix. Further, we can construct the transition probability matrix, shown below.

	Noun	Modal	Verb	End
Start	3/4	1/4	0	0
Noun	1/9	3/9	1/9	4/9
Model	1/4	0	3/4	0
Verb	4/4	0	0	0

Here, "Start" and "End" refer to the Start and End tokens added at the ends of each sentence to denote its beginning and end. In the corpus, the Start token appeared $3$ times before a noun and $1$ time before a modal. Moreover, the Start token appeared $4$ times in the corpus. Therefore, the transition probability of the next hidden state part of speech being a noun is $3/4$ when the current hidden state is the Start token. Similarly, the transition probability of the hidden state modal when the current state is a Start token is $1/4$ . Similarly, we can populate the rest of the matrix.

Now, let's take the observation sequence of words "Will Jane spot Mary?". Further, let us take the sequence of hidden states: Start, Modal, Noun, Verb, Noun, and End. We can calculate the likelihood of this sequence by multiplying the respective probabilities shown in Figure-3.

pos tagging

The likelihood of this sequence will come out as

$( 1 / 4 ) * ( 3 / 4 ) * ( 3 / 4 ) * ( 2 / 9 ) * ( 1 / 1 ) * ( 1 / 4 ) * ( 1 / 9 ) * ( 4 / 9 ) * ( 4 / 9 ) = 0.0001714678$

Since the likelihood of the hidden sequence is positive, the sequence of parts of speech is correct. Had it been $0$ , we would have concluded that the hidden state sequence is impossible. Therefore, we can have a larger corpus leading to transition and emission probability matrices closer to the real-life scenario. We can accordingly model the problem of POS Tagging as Hidden Markov Model and perform decoding to get the sequence of parts of speech with the highest likelihood. This was a real-life application of the Hidden Markov Model in Artificial Intelligence, and we present more such applications in brief in the next section.

Applications of the Hidden Markov Model

The Hidden Markov Model in Artificial Intelligence has a multitude of real-life applications due to its ability to model stochastic sequential processes while also handling hidden states of the system. Some of those applications are listed as follows.

Computational Finance: A Hidden Markov Model can be used to figure out various financial indicators and can help investors adjust their portfolios to maximize their profits.
Speech Recognition: Given sequential time-series audio data, a Hidden Markov Model can be used to identify the most probable sequence of underlying words leading to speech recognition.
Speech Synthesis: This involves the creation of the most probable speech audio when a text is given. This is useful in various smart devices used in our day-to-day life.
Part of Speech Tagging: This involves figuring out the most likely sequence of parts of speech annotated to each of the words of a sentence.
Machine Translation: Given a text in one language, a Hidden Markov Model can be used to translate it into another language.
Handwriting Recognition: Given an image of a handwritten text, a machine can use a Hidden Markov Model to figure out the most probable sequence of characters written in the text.
Activity Recognition: A Hidden Markov Model can be used to identify the most probable activity from a time-series data stream.

Conclusion

In this article, we learn about the following aspects of a Hidden Markov Model in Artificial Intelligence.

Definition of Hidden Markov Model along with an example
The problems of Likelihood, Decoding, and Learning, along with their solutions
Implementation of a Hidden Markov Model using Python
Real-life application of the Hidden Markov Model in Part of Speech Tagging in Natural Language Processing
Various other real-life applications of the Hidden Markov Model in Artificial Intelligence