Partially Observable Markov Decision Processes (POMDPs)

Learn via video courses
Topics Covered

Overview

Partially Observable Markov Decision Processes (POMDP) is a mathematical framework to model sequential decision-making processes with the constraint that the decision maker has incomplete information about the environment. This is useful in various domains like Artificial Intelligence, Reinforcement Learning, robotics, game theory, and many others.

Pre-requisites

  • Concepts of conditional probability
  • Markov Decision Processes
  • Basic Programming knowledge

Introduction

In Artificial Intelligence, an agent/decision-maker needs to go through various decision-making processes to take appropriate actions in an environment to maximize the gain of rewards. Partially Observable Markov Decision Process (POMDP) is a mathematical framework to model sequential decision-making processes in real-life scenarios wherein the decision-maker does not have complete information about the environment in which it works. Therefore, POMDP finds its applications in various domains like AI, operations research, economics, robotics, and many others. We formally define POMDP in the next section.

What are POMDPs?

A POMDP can be represented by the tuple (S,A,T,R,Ω,O,γ)(S, A, T, R, \Omega, O, \gamma) where,

  • S:S: refers to the set of states in the environment.
  • A:A: denotes the set of possible actions taken by the decision maker.
  • T:T: refers to a conditional transition probability distribution of the next state when the current state and chosen action are given.
  • R:R: is a function that takes state and action as input and provides the corresponding reward as output.
  • Ω:\Omega: refers to the set of possible observations made by the decision-maker.
  • O:O: is a conditional probability distribution of the observations when the current state and chosen action are given.
  • γ:\gamma: represents the discount factor used to compute the expected sum of discounted future rewards.

Initially, the environment is in some state sSs \in S. The decision maker chooses an action aAa \in A leading to the transition of the environment to the next state sSs' \in S with the conditional probability given by T(ss,a)T(s' \vert s, a). However, unlike Markov Decision Process (MDP), the decision maker can not know about the actual state of the environment but can only make some observations. Therefore, the decision maker makes an observation oΩo \in \Omega. Further, the observation oo depends on the new state ss' and the chosen action aa with the conditional probability O(os,a)O(o \vert s', a). Ultimately, the decision maker receives a reward denoted by R(s,a)R(s, a). This entire process is repeated for multiple iterations, leading to a sequence of decisions of actions the decision maker takes.

Given the above framework, the goal of the decision maker, in a similar fashion as the Markov Decision Process (MDP), is to maximize the expected sum of discounted future rewards denoted by E[t=0γtrt]E[\sum_{t=0}^{\infty}\gamma^{t}r_{t}] where rtr_{t} indicates the reward obtained at time step tt. Now, let us take a simple example of a problem that can be formulated as a POMDP.

Suppose there is a robot that can move in a 3×33 \times 3 grid environment. At each time step, the robot can take one of four actions: move up, move down, move left, or move right. However, the robot's movements are not entirely deterministic. If it tries to move in a specific direction, it has a 70% chance of succeeding and a 30% chance of moving in a random direction. The robot also has a sensor that can detect whether it is in the top row, middle row, or bottom row of the grid world. However, the sensor is not entirely reliable. It has an 80% chance of giving the correct reading and a 20% chance of giving a random reading.

The goal of the robot is to navigate to a particular cell (denoted as the "target cell") in the grid world. However, the robot doesn't know the location of the target cell initially. It only knows that it is in one of the nine cells in the grid world. At each time step, the robot can take action to move and observe the sensor reading, and based on this information, it must decide what action to take next to maximize its chances of reaching the target cell.

This problem can be formulated as a POMDP, where the state of the system is the robot's location (unknown to the robot), and the belief state is a probability distribution over the possible locations of the target cell. The observation is the reading from the sensor, and the action is the robot's movement. The objective is to find a policy that maximizes the expected cumulative reward, which could be defined as the negative Manhattan distance between the robot's current location and the target cell. This example is a simplified version of a classic problem in robotics called the "grid localization" problem, which is often used as a benchmark for testing POMDP algorithms.

Solving POMDPs

Formulation of POMDP as Belief State MDP

Let us define bb as belief which is a probability distribution of states that the decision maker thinks the environment might be in. In other words, for a state ss, b(s)b(s) represents the probability/belief of the decision maker that the environment is in state ss. After choosing an action aa and making the observation oo, the decision maker needs to update the belief to bb'. Assuming that belief follows the Markovian property, i.e., the current belief state depends only on the previous belief state, we can write bb' as a function of bb, aa, and oo. Therefore, we can write b=τ(b,a,o)b' = \tau(b, a, o).

Now, once the decision maker chooses action aa and makes observation oo with the current belief of a state ss being b(s)b(s), we can write the updated belief of a state ss' as

b(s)=ηO(os,a)sST(ss,a)b(s)b'(s') = \eta O(o \vert s', a) \sum_{s \in S} T(s' \vert s, a)b(s)

where,

η=1/Pr(ob,a)\eta = 1/\text{Pr}(o \vert b, a) and Pr(ob,a)=sSO(os,a)sST(ss,a)b(s)\text{Pr}(o \vert b, a) = \sum_{s' \in S} O(o \vert s', a) \sum_{s \in S} T(s' \vert s, a)b(s).

Informally stating for intuition, we first get the probability that the environment was in state ss by b(s)b(s). Further, we get the conditional probability of the current state being ss' provided that the previous state and chosen action were ss and aa, respectively. However, we want to encompass all cases when transitioning to the state ss'. Therefore, we do this operation using a summation for all possible previous states. Furthermore, we compute the probability that the state ss' and chosen action aa lead to observation oo. Lastly, to calculate the probability/belief of ss', we also need to divide it by the summation of values about each possible current state ss'.

Now, we can formulate POMDP as a belief state MDP which can be defined as the tuple (B,A,τ,r,γ)(B, A, \tau, r, \gamma) where

  • B:B: is the set of belief states.
  • A:A: corresponds to the set of actions of the original POMDP.
  • τ:\tau: denotes the transition function, which takes the current belief and chosen action as input and renders the new belief state as output.
  • r:r: is the reward function that takes the current belief and chosen action as input and generates the corresponding reward.
  • γ:\gamma: refers to the same discount factor used in the original POMDP.

Before proceeding, we must define τ\tau and rr in terms of the original POMDP. We can write

τ(b,a,b)=oΩPr(bb,a,o)Pr(oa,b)\tau(b, a, b') = \sum_{o \in \Omega} \text{Pr}(b' \vert b, a, o)\text{Pr}(o \vert a, b)

The value of Pr(bb,a,o)\text{Pr}(b' \vert b, a, o) will be 11 if we get bb' after performing the belief update using bb, aa, and oo. Otherwise, the value would be 00. Further, we can define the reward function as

r(b,a)=sSb(s)R(s,a)r(b, a) = \sum_{s \in S} b(s)R(s,a)

The intuition behind these formulations can be derived similarly to the one behind the belief update formula. One crucial observation to be made here is that the belief MDP has a continuous state space, unlike the original POMDP, which had discrete states. The reason is that the belief states are probability distributions that can take any values between 00 and 11. In the subsequent sub-sections, we present the techniques to solve the belief MDP and hence the POMDP as well.

Exact Value Iteration

Before proceeding to the algorithm, we need to define the notation Vk(si)V^{k}(s_{i}) as the maximum possible expected sum of discounted future rewards if the decision maker takes decisions for kk time steps and starts from state sis_i in the case of a discrete MDP. Further, we write V1(si)=r(si)=V^{1}(s_i) = r(s_i) = reward gained at state sis_i. To compute Vk(si)V^{k}(s_{i}), we can make use of Vk1(si)V^{k-1}(s_{i}) by the following equation

Vk+1(si)=maxaA[r(b,a)+γsjSTijaVk(sj)]V^{k+1}(s_{i}) = \max\limits_{{a \in A}} \bigg[r(b, a) + \gamma \sum_{s_j \in S} T_{ij}^{a}V^{k}(s_j) \bigg]

Now, as kk \rightarrow \infty, we get V(si)V^{*}(s_{i}), which is the value function and is a solution to the Bellman optimality equation and denotes the maximum possible discounted future reward.

However, in the case of a belief state MDP, we have a continuous state space. Therefore, we need to define value function over our continuous belief state space. We can write

V(b)=maxaA[r(b,a)+γoΩPr(ob,a)V(τ(b,a,o))]V^{*}(b) = \max\limits_{{a \in A}} \bigg[r(b, a) + \gamma \sum_{o \in \Omega} Pr(o \vert b, a)V^{*}(\tau(b, a, o)) \bigg]

Now, if we plot this value function against the state utilities, then we get a piecewise linear convex graph. In other words, the value function of POMDPs can be represented as the maximum of linear segments. The linear segments correspond to the various possible actions. For simplicity, if we take two states and two actions, a1a1, and a2a2, then one possible plot of the value function can be as follows.

pomdp

A crucial point to consider is that Horizon kk segments are linear combinations of Horizon k1k-1 segments. Horizon kk refers to the value function obtained after kk iterations. Therefore, the value iteration algorithm can be broadly stated as follows

  1. Calculate value function vectors for each action
  2. Transform value function with observations
  3. Repeat steps 1 and 2 for the next horizon value function until convergence

Exact Value Iteration is an exact algorithm to find the optimal solution of POMDP. However, the time complexity is exponential, and the belief dimensionality increases with the number of states.

Policy Iteration

Let us define Vπ(b)V^{\pi}(b) as the expected discounted future reward following policy π\pi from belief state bb in a discrete MDP. A policy tells us about the action that should be chosen when the decision maker is in a particular state. Now, a policy π\pi^* is an optimal policy if it yields the maximum expected discounted future reward.

Therefore, our goal is to find an optimal policy that can be done using policy iteration. The policy iteration algorithm can be stated as follows:

  1. Initialise a policy π0\pi^0.

  2. Perform policy evaluation by computing the value function VπkV^{\pi^{k}} and update it using Bellman's equation.

  3. Perform policy improvement by

    πk+1(b)=argmaxa[r(b,a)+γoΩPr(ob,a)Vπk(τ(b,a,o))]\pi^{k+1}(b) = argmax_{a} \bigg[r(b, a) + \gamma \sum_{o \in \Omega} Pr(o \vert b, a)V^{\pi^{k}}(\tau(b, a, o)) \bigg]
  4. Repeat steps 2 and 3 until convergence between πk\pi^k and πk+1\pi^{k+1}.

Witness Algorithm

The Witness algorithm finds places where the value function does not remain optimal and operates action-by-action and observation-by-observation to build up value vectors. The algorithm can be broadly described as follows:

  1. Start with value vectors at the corners of the graph where the states are known.
  2. Define a Bellman's equation-based linear program that finds a point in the belief state space where the value of the function becomes incorrect.
  3. Add a new vector which is a linear combination of the segments of the old value function.
  4. Iterate until no point is found where the value of the function becomes incorrect.

There are various other solutions to POMDP like HSVI, Greedy solutions like Dual Mode control, Q-MDP, and many others. In the next section, we show a tool available in the R programming language to solve POMDPs through programming.

The Functionality of the "pomdp" Package in R Language

There are two steps while using the pomdp package of the R language in solving a POMDP, and they are as follows:

  • Defining the POMDP problem using POMDP()
  • Solving the problem defined above by using solve_POMDP()

Defining a POMDP Problem

To define a POMDP problem using POMDP(), we need to pass the elements of the problem as arguments to the function, and they are listed below

  • states: Set of states SS
  • actions: Set of actions AA
  • observations: Set of observations Ω\Omega
  • transition_prob: Conditional transition probabilities of states T(s′\vert s,a)
  • observation_prob: Conditional observation probabilities O(o \vert s′,a)
  • reward: Reward function R(s, a)
  • discount: Discount factor γ[0,1]\gamma \in [0,1]
  • horizon: Number of periods to iterate the POMDP.
  • terminal_values: A vector of state utilities at the end of the horizon.
  • start: Initial probability distribution b0b_0 over the system states SS
  • max: Denotes whether the POMDP is a maximization or a minimization problem
  • name: Assign a name to the POMDP problem

For more details on the input format, please refer to the documentation of POMDP()

Solving a POMDP Problem

To solve a POMDP problem defined using POMDP(), we can use the function solve_POMDP(). This function takes an argument model, which signifies the POMDP model created by POMDP(). Another important argument is the method representing the algorithm to solve the POMDP. The available options for the solving algorithm are grid (default), enum, twopass, witness, and incprune. In the next section, we present a renowned POMDP problem and solve it using pomdp.

The Tiger Problem

The Tiger Problem is a renowned POMDP problem wherein a decision maker stands between two closed doors on the left and right. A tiger is put behind one of the doors with equal probability, and treasure is placed behind the other. Therefore, the problem has two states tiger-left and tiger-right. The decision maker can either listen to the roars of the tiger or can open one of the doors. Therefore, the possible actions are listen, open-left, and open-right. However, the observation capability of the decision-maker is not 100% accurate. Therefore, there is a 15% probability that the decision maker listens and interprets wrongly the side on which the tiger is present. Naturally, if the decision maker opens the door with the tiger, then they will get hurt. Therefore, we assign a reward of -100 in this case. On the other hand, the decision maker will gain a reward of 10 if they choose the door with treasure. Once a door is opened, the tiger is again randomly assigned to one of the doors, and the decision maker gets to choose again.

Now, we formulate this POMDP using the following code

Now, to solve the above Tiger Problem using the default grid method, which uses point-based value iteration, we can execute the following code.

The computed policy for the Tiger Problem can be represented as a graph shown below.

pomdp-1

As can be interpreted from the above graph, the decision maker starts at the "Initial Belief" state, wherein there is an equal probability of the tiger being behind the left or right door. Once an observation of "tiger-left" is made, the decision maker again chooses to listen. If the second observation is different, then the decision maker returns to the "Initial Belief" state. However, if the same observation is made twice, the decision maker chooses to open the right door. A similar case happens for the observation "tiger-right". Once the reward is obtained, the problem is reset, and the decision maker returns to the "Initial Belief" state. In the subsequent sections, we present extensions to the POMDP problem and its applications.

Extensions of POMDP

Monte Carlo POMDPs

Monte Carlo POMDPs involve continuous state space as well as action space. It consists of the approximation of belief space and value function with Monte Carlo methods.

POMDPs with Belief State Compression

This extension involves the reduction of dimensionality of belief space using Exponential Principal Component Analysis (E-PCA) and hence finds its applications in mobile robot navigation.

Applications of POMDP

There are numerous applications of POMDPs and their extended versions in real life. Some of them are listed below.

  • Sensor placement
  • Mobile robot navigation
  • Games like poker, hidden object search, and many others
  • Transportation and logistics
  • Financial portfolio optimization

Conclusion

In this article, we learned about

  • Mathematical definition of POMDP and its similarity to real-life problems
  • Formulation of POMDP as a belief state MDP and various techniques to solve it
  • The pomdp package in R language and its use in the Tiger Problem
  • Extensions and applications of POMDP