Understanding Markov Decision Processes (MDPs)

At first, Markov Decision Processes (MDPs) can be a bit overwhelming. They are a fundamental concept in the field of reinforcement learning and you may encounter them in many papers, blog posts and tutorials. The notation used, unfortunately, often invites a form of proof by intimidation whereby you have to figure out what the authors mean when they remove expectation subscripts, remove summation variables when writing equations 🫤

In this post, I want to build a step-by-step understanding of MDPs by starting from the basics, writing out the equations needed and most importantly, have an intuition for what is going on.

Step 1: States

When we talk about sequential processes in computer science, we often iterate over some set of states. The states, denoted by $\mathcal{S}$, is a set and for almost all real-world problems, it can be really large in size. A single state usually means the current configuration of a process, like the state of a chess board. Let’s define a really simple set of states covering a possible commute to work:

graph TD
  Home
  Work
  Late

which has three states $|\mathcal{S}| = 3$ in total. We haven’t done much yet but thinking about the number of states, how to define them in the problem domain is really important. You’ll often hear things like “the state space exponentially is exponentially larger” which may refer to the size of this set.

Sometimes what makes a problem environment hard is the number of states it may have. For example, think about the number of possible board configurations of Chess versus Go. The reason this state space size comes up often is because we end up searching across it to find potentially good or bad states. In other words, like many things in computer science and artificial intelligence, it is a search problem that brings our attention to states and their total size. What are we searching? A way to win. Some states like check-mate in chess are more desirable. In our example above, we might like to find ways to not be Late to Work.

Step 2: Transitions

The second building block is transitions. Life would be very boring if we were always stuck in a single state, which we denote using lower case $s$, such that $s \in \mathcal{S}$. Imagine, we were always at $s = \text{Home}$ or perhaps even better always at $s = \text{Work}$. Luckily, we can transition between states. For example our life could look like:

graph LR
  Home --> Late
  Late --> Work
  Work --> Home

This is deterministic, in the sense that every day we are always late to work, such is life. We also know for certain that if we are at $s = \text{Work}$ and we’ll go back home. But sometimes we might get stuck at work but not always. Hence, there is a probability of what will happen to us next.

To capture this probability, we need to say that our current state is a random variable $S_t$. Suppose after we wake up at home, whether we are late or not is at the mercy of the London underground 🚆. Most of the time it works, but sometimes, let’s say 1 in 10 chance, things become perfectly miserable:

graph LR
  Home -->|0.1| Late
  Home -->|0.9| Work

We can formalise this situation with transition probabilities:

$$Pr(S_{t+1} = s' | S_t = s)$$

which asks the question: What is the probability of ending up in state $s'$ given we are at state $s$ at time $t$? I’m using $Pr$ in this case to be super clear about probabilities but $P$ is also a common symbol.

In university level lecture slides, you’ll often start with manageable notation like above. But at some point, notation will go on holiday 🏖

Let’s be very clear and write out the probabilities for the example above:

$$ \begin{align} Pr(S_{t+1} &= \text{Late} | S_t = \text{Home}) = 0.1 \\ Pr(S_{t+1} &= \text{Work} | S_t = \text{Home}) = 0.9 \end{align} $$

And we can go ahead and complete the picture:

graph LR
  Home -->|0.1| Late
  Home -->|0.9| Work
  Late -->|1.0| Work
  Work -->|0.05| Work
  Work -->|0.95| Home

assuming when we are late, we always end up at work and when we are at work, there is a 5 percent chance (so out of 100 visits to the office, 5 times), we end up working a bit more. If we curate all these probabilities, we get a transition matrix:

$$ P_{ss'} = \begin{bmatrix} 0.0 & 0.1 & 0.9 \\ 0.0 & 1.0 & 0.0 \\ 0.95 & 0.0 & 0.05 \end{bmatrix} $$

where $P_{ss'}$ is the probability of transitioning from state $s$ to state $s'$. Notice how that the rows sum to 1 which is a requirement for a valid probability distribution.

We have successfully created a discrete-time Markov chain: you have some states, stochastic (means having a random probability) transitions between them. This is also sometimes referred to as a Markov process. But we made one key assumption to make all this happen.

Markov Property

Just like Bayes, the name Markov gets thrown around a lot in the machine learning community, especially if you are dealing with reinforcement learning. The term Markov or Markovian, often refers to the Markov property.

The Markov property means that the future is independent of the past given the present.

Future: This is our random variable one time-step ahead, $S_{t+1}$
Present: The current state $S_t$.
Past: What happened since the dawn of time $S_0, S_1, \ldots, S_{t-1}$.

describes the following relationship:

$$ Pr(S_{t+1} | S_t) = Pr(S_{t+1} | S_0, S_1, \ldots, S_t) $$

so the probability of the next state (the future) given the current state is equal to the probability of the next state given all the states that came before it (the past). The past doesn’t matter. Practically, what does it mean? It means that when you are work, you don’t think about whether you were late or not. You can continue on with your day (if you were a Markov chain of course) even if you were magically teleported to work.

A little side note on how things are tackled in machine learning and reinforcement learning. When we talk about how the future evolves based solely on the current state, we are describing the process itself (the environment) and not necessarily the agent. The environment is memoryless but the actors may store past experiences in their internal representation. For example, a computer program simulating chess only needs to know the current state of the board to determine what happens next when an action is taken (like moving a piece) while the actors may remember the past moves to strategise better.

Step 4: Actions

If we are navigating the world as a concious agent, then we are going to make decisions. Recall that the title of this post was Markov Decisions Processes. Actions are those decisions. We can choose to take the bus or the underground when commuting to work. Our actions have consequences and so on. How do we bake this into our current formalism?

Similar to states, all the actions we can take will be a set $\mathcal{A}$ which could include $\text{underground} \in \mathcal{A}$ and $\text{bus} \in \mathcal{A}$. These actions will influence what happens to us next, i.e. what is our next state. If I take the bus there is higher chance of getting in late because the traffic in London is ~~terrible~~ great. So it seems natural that how we transition between states would be dependent on the action $a$ we take. Similar to states, which action I take might be random, hence the current action at time-step $t$ is a random variable $A_t$. I’m saying random in the probabilistic sense rather than truly random in the colloquial sense; an unbiased coin is regarded as truly random but a biased coins output is still random. We now update our definition of transition probabilities to include the actions:

$$Pr(S_{t+1} = s' | S_t = s, A_t = a)$$

which reads: given a state $s$ and action $a$ at time-step $t$, what is the probability of ending up in state $s'$?

The picture we had from before has now grown in size:

graph LR
  Home --> BusW(Bus)
  Home --> UndW(Underground)
  BusW -->|0.8| Late
  BusW -->|0.2| Work
  UndW -->|0.1| Late
  UndW -->|0.9| Work
  Late --> Arrive(Arrive)
  Arrive -->|1.0| Work
  Work --> Stay(Stay)
  Work --> BusH(Bus)
  Work --> UndH(Underground)
  Stay -->|1.0| Work
  BusH -->|1.0| Home
  UndH -->|1.0| Home

which captures some properties of commuting in London:

If we take the bus to work, there is a higher chance (80 percent) of being late.
If we decide to stay at work, we for sure (100 percent) stay at work.
When going back home, we always make it regardless of which transport option we took.

So which action should we take? Ah, yes, that is indeed the question. The only question in fact reinforcement learning is concerned with. The policy describes which actions an agent takes. We also don’t have to be strict (deterministic) on which actions we take:

$$ \pi(a, s) = Pr(A_t = a | S_t = s) $$

hence, the policy describes a distribution over actions given a state. We can say each action is equally likely (a uniform distribution), or something nonsensical like actions that start with the letter ‘b’ are twice as likely. It gives us the freedom to decide. But how do we know if an action or a policy better than any other action?

Step 5: Rewards

How do we know if an action or a policy is better than others? In this formalisation, the world / environment tells us. Each transition is associated with a reward which could be positive or negative. The reward just tells us good or bad, nothing more and nothing less. Like with states and actions, the reward we get might not be fixed. It may vary, for example, if we decide to take the bus to work, it has cleaner air than the underground unless it is super busy and smells of sweat. Let’s call this random reward variable $R_{t+1}$ and it is one time-step in the future so we evaluate it the immediate reward after taking an action at time $t$. So how much reward do we expect to get if take an action:

$$\mathbb{E}[R_{t+1} | S_t = s, A_t = a] = \sum_{r} r Pr(R_{t+1} = r | S_t = s, A_t = a)$$

which reads: the expected reward after taking the action $a$ in state $s$. I assumed the rewards are discrete and hence $R_{t+1}$ is a discrete random variable of which the expectation is a sum. If it was continuous then the expectation would involve an integral but the underlying principle is the same. If the reward is deterministic, then all this expectation stuff is unnecessary. The reward functions in most reinforcement learning settings are in fact deterministic but occasionally you might get a game like Elden Ring in which taking the same action from the same spot (state) might not always yield the same outcome (reward).

Let’s now complete the full Markov Decision Process with our running example:

graph LR
  Home --> BusW(Bus)
  Home --> UndW(Underground)
  BusW -->|0.8, R: -1 or -4| Late
  BusW -->|0.2, R: 5| Work
  UndW -->|0.1, R: -5| Late
  UndW -->|0.9, R: 1| Work
  Late --> Arrive(Arrive)
  Arrive -->|1.0, R: 0| Work
  Work --> Stay(Stay)
  Work --> BusH(Bus)
  Work --> UndH(Underground)
  Stay -->|1.0, R: -1 or 1| Work
  BusH -->|1.0, R: 5| Home
  UndH -->|1.0, R: 1| Home

As you can see, we added rewards to each action to state transition some of which are probabilistic:

While taking the bus to work is more likely to be late, it gives less of a penalty than taking the underground and being late. But there is a 50 percent chance the reward will be -1 or -4, i.e. 50 percent chance the bus will be stinking. I assumed it’s 50 percent, but the frequentist in me would put that probability at 0.8 👀
Similarly, staying at work has a probabilistic reward of -1 or 1. If you are a workaholic, you might get a reward of 1, otherwise, you get a penalty of -1.
All other actions have deterministic rewards which are roughly what you would expect. Going home is always a good thing but taking the bus home is better than the underground.

According to this MDP, would you always take the bus back home?

Conclusion

In conclusion, the Markov Decision Process is a collection of states, transitions, actions and rewards. It describes the environment in which actions are taken, rewards are given out and the system transitions between states. This formalisation is the one used in reinforcement learning.

You can imagine how slowly adding more states, actions and rewards to just model a simple commute to work can get out of hand. The ultimate MDP some might argue is the real world but what are the rewards? People value different things and the world changes over time. I’d like to invite you to think about psychology as a field of study and take a break from the rigid formulations of computer science.

Step 1: States#

Step 2: Transitions#

Markov Property#

Step 4: Actions#

Step 5: Rewards#

Conclusion#