My personal notes from the Reinforcement Learning course at UCL by DeepMind. I wrote those notes while I was reviewing the course to “really” understand it. Much appreciation to David Silver for uploading the lecture videos on YouTube.

✏️ Table of Contents

  • Recommended Textbooks
  • What differentiate RL from Supervised & Unsupervised learning?
  • Reward
  • RL is a unified framework for sequential decision making
  • Agent & Environment
  • History & State
  • Environment State
  • Agent State
  • Information State (Markov state)
  • Rat Example
  • Fully Observable Environments
  • Partially Observable Environments
  • Atari Example
  • Major Components of an RL Agent
  • Policy, Value function & Model in terms of the Maze Example
  • Categorizing RL agents
  • Problems within RL
  • Planning
  • Exploration and Exploitation
  • Prediction & Control
  • Gridworld Example
  • Conclusion

In the below picture, you can see the recommended textbooks. The most practical and concise of the two is the second one where you can read it here.

🆚 What differentiate RL from Supervised & Unsupervised learning?

  • No supervision: No one tell us what is the right action to take. Instead it is a trial & error paradigm. There is just a reward signal saying if that it was a good (worth 3 points) or a bad move (worth -10 points). No one inform us what is the best action to take in a given situation, no superviser.
  • No instant feedback: The second major distinction is that when you get that feedback (good or bad) it doesn't come instantaneously, it may be delayed by many steps. In the RL paradigm you make a decision now and it may be that in many steps later that you actually see whether that was a good or bad decision. The decisions that you make unfold over time and it may be that its only retrospectively that you realize that was a bad move. At the time, it might seem as a good move (got some positive reward for couple of steps) before something catastrophical happened; giving high negative reward.
  • Time really matters in RL: It is a sequential decision making processes where the agent make decisions (on step after another it pick actions) and see how much reward it gets. Its ultimate goal is to optimise those rewards. For example, as a robot moves into an environment what it sees now it will be highly correlated with what it will see in one second.
  • In RL, the agent's actions (move around the environment) affect the subsequent data that it receives.

🎁 Reward

One of the most fundamental quantities in RL is the idea of reward. A reward is just a number ( a scalar feedback signal). The agent's goal is to maximize the cumulative sum of those rewards over time into the future.

If there is no intermediate reward what happens?

In that case, we define the end of an episode and a reward at the end of the episode. Now the sum of the rewards is exactly how well you did in that episode and the goal of the agent is to pick actions to maximize that expected sum of the rewards at the end of the episode.

What about if it is a time-based goal?

The goal now is to try to do something in the shortest amount of time. In that case we simply define the reward to be -1 per time step. It is a well defined objective which force the agent to minimize the time to achieve its goal.

🧠 RL is a unified framework for sequential decision making

The goal for all those problems is always the same: select actions so as to maximize the future reward. We basically want to pick a sequence of actions in order to the get the most total reward. What that means is that we have to plan ahead; the reward might not come now but it may come in some future steps. In some cases, it might be wise to sacrifice some good rewards now so as to get more reward later. So you can't be greedy when you do RL, you have to think ahead.

For example:

  • In a financial investment you have to spend some money now but you believe that later on you will get more money back. (may take months to mature)
  • When driving a helicopter it is wise to land a helicopter to refuel sacrificing some reward in order to avoid a disaster (crash) later on.

🤖 Agent & Environment

Our goal is to build that brain (agent-this is what we are controlling). The agent is responsible for taking actions (what investments to make, moves etc). At each step action-decision is taken based on the observation and the reward that it is receiving.

The agent is always interacting with the environment (receiving snapshots of the world). The environment is generating the next observation and the reward based on the action it has received. Note that we do not have any control over the environment except from the action that we decide.

This interaction continue on and on  (trial and error loop). The RL agent receives a  stream of data (observations, actions & rewards). Those experiences is the data that we use to train the agent.

📜 History & State

This stream of experiences (observations, actions, reward) that come through, we call them history. The history is what the agent has seen so far. Those are all the knowledge that the agent has about the environment. They might be other things but the agent doesn't know about them. Thus, the algorithms build are based only to the experiences that the agent has been exposed to.

The algorithm is just a mapping of this history to actions. From the agent point of view, what happens next  depends on the history.

The environment looks what has received and also the history to decide what is going to do next (what observation to produce).

This history actual gover the nature of how things proceed; determines what happen next. But it is not very useful because it is massive and we don't want to go back every time. What instead is useful is the state, which is like a summary of the information which is used to determine what will happen next. In other words, if we can replace the history by some concise summary that captures all the information needed to determine what happen next then we have a much better chance to do RL.

There are many different definitions of the state but we are going to explore three of them.

🌳 Environment State

It is basically the information used within the environment to determine what happen next (internal process of an emulator). There is some state internally that summarize everything that it has seen and split out the very next observation and reward.

The environment state is not visible to the agent is more like a formalism that helps us to understand what an environment is; which help us to build the algorithm. The algorithm doesn't see what happen internally (inside the environment) it just receive the reward and an observation as output out of the environment. Even if we had access it might not be helpful and therefore having its own subjective representation it may be a good thing.

Thus, the environment state doesn't tell us anything useful for building an algorithm. We don't see that.

Multi-Agent System ?

From the perspective of one individual agent it can consider all the other agents part of the environment. The fact that there are other agents wandering around the environment doesn't really change the formalism.

🦾 Agent State

Within our algorithm we will have a set of numbers that we will use to basically capture what have happened to the agent so far (everything it has seen) and use those numbers to pick the next action. The agent state is whatever information has been used to pick our next action; our decision of how to process the observations, what to remember and what to throw away. When building an algorihm we are always refering to the agent state. Our goal is to really understand how to pick actions given a state that summarizes everything we have seen so far.

🗿Information State (Markov state)

It is a more mathematical definition of state, a state representation that contains all useful information from history. We can throw away all the previous states and just retain the current state and you will get the same characterisation of the future with the advantage of working with sth more compact.

The state fully characterise:

  • the distribution over the future actions, observations and rewards.
  • the way the environment will evolve going into the future
  • the distribution of the events that will happen

What tell us is that if we can make decisions based on our current state and those decisions will still be optimal because we haven't thrown away any useful information. A Markov state contains enough information to characterize all future rewards.

If only we had access to the environment state that would be great because the environment state is Markov since it fully characterises what will happen next (next observation, reward). In addition, all the history is a Markov state but not a useful one.

How does that fit with an agent controlling a helicopter?

A Markov state for the helicopter would be:

  • current position
  • velocity
  • wind direction
  • and speed

If you know all of this it doesn't matter what was the position of the helicopter ten minutes ago (its previous trajectory) since it doesn't make any difference to where the helicopter will be the next moment. The only thing that matters is where it is right now, wind direction etc it.

In contrast, let's assume that you pick an imperfect state representation that is not Markov having just the position but not the velocity. Now where the helicopter is doesn't fully determine where the helicopter will go next because you don't know the velocity so you have to look back to determine its velocity.

🐀 Rat Example

  • What if agent state = last 3 items in sequence?
    Like the first row you should expect to get electrocuted.
  • What if agent state = counts for lights, bells and levers?
    Like the second row you should expect to get a cheese.
  • What if agent state = complete sequence?
    We don't know. It might be true that what happen next depends on the entire sequence and you need to see all three things before knowing what will happen next. In that case we don't have the data to decide yet.

So, our job is to build an agent state that will do the best job of predicting what will happen next.

👀 Fully Observable Environments

That's the nice case where we get to see everything, the agent really get the chance of looking inside the environment state. The observation that we see is the same as the agent state which is the same as the environment state.

🔍 Partially Observable Environments

For example:

  • A robot with a camera might not know where exactly it is, unless it has a GPS, it gets to see just the camera stream; it has to figure out on its own where it is within the room.
  • A trading agent might only observes the current prices without knowing the trends or the origin of these prices. It has to decide what to do next based on partial information.
  • A poker playing agent only observes public cards and doesn't know what is hiding in other people's hand.

We can build this agent state by:

  • The naive approach is just to remember everything (history)
  • Build belief (probabilistic, Bayesian approach) by keeping a probability distribution over where the agent thinks it is in the environment. All those vectors of probabilities will define the state we are going to use to actual decide what to do next.
  • Instead of using probabilities we can use a NN. So you just have a linear transformation and some nonlinearity of your old state to a new state that takes into account the latest observations producing a new state.

🕹 Atari Example

DeepMind used the last four state (pictures) to decide what to do next. Keep in mind that the emulator (environments state) decides both reward and observation based on the action received by the joystick. No one tell us how this emulator works or the rules of the game. So, it is a purely RL problem.

❗️Major Components of an RL Agent

  • Policy: is how the agent pick its actions. A map from state to action. We want the police to be as such so that we get the maximum reward.
  • Value function: it tell us how good it is to be in a particular state, how good it is to take a particular action, how much reward we expect to get if we take that action in that particular state. Key idea of RL estimating how well we are doing in a particular situation. Basically it tell us how much total reward we expect to get from some state onward if we are going to follow this particular behavior. If we can compare those things, we now have the basis to make good decisions. Gamma is just a discounting factor which tell use that we care more about immediate reward than latter ones.

Is there a horizon of how far ahead we are looking?

The horizon is given by this discount factor (gamma), for every step we discount the reward a little bit more (going into the future) where essentially we do not care for things very far in the future.

Is there a risk trade-off that you can make?

Can you balance variance for those reward predictions? We define a function that we only care about the achieved reward. So risk is already factor in. The behavior that maximize reward is the one that correctly trades off agent's risk. In trading there are some risk-sensitive Markov Decision processes which you explicit try to account for it.

  • Model: agent's view of the environment and use that model of the environment to help make a plan and figure out what to do next. However you can have model free methods (is not a requirement) where you have to build a model of the environment.

Policy, Value function & Model are not always required for an agent but are pieces that may or may not be used.

📍Policy, Value function & Model in terms of the Maze Example

Policy: mapping from state to action

Value Function: the cell before reaching the goal has value of -1 as we need only one step to reach our goal. In the first intersection we know we have to go up as the value function reduce to -14 instead of going down -16. Using this information we can build our optimal policy.

Model: Try to build its own map of the environment, its own understanding of the dynamics of the environment so far. It's not reality it is just the agent's model of reality. -1 is the agent model of how much immediate reward it gets in every step; since the reward was -1 per step no matter where it moves. It is one step prediction of reward not a long term prediction of reward.

✂️ Categorizing RL agents

We can now taxonomise our RL agent  according to those three quantities:

Value based algorithm contains a value function so the policy is kind of implicit. It will look at the value function and pick the best action. It picks greedy action with respect to the value function.

In the Policy based agent instead of representing inside the agent the value function, we explicite represent the policy without storing the value function.

The actor critic combines both the value function and the policy aiming to get the best of both world.

The fundamental distinct in RL is between model free and model based.

Model free: we do not explicitly try to understand the environment. We analyze the experiences and based on them we determine the policy avoiding the indirect step of determining of how the environment works.

Model based: we first build the model of the environment and then we plan, look at it.

❌ Problems within RL

RL: The agent isn't told how the environment works, it has to figure out stuff for itself. It manage to do that by interacting with the environment (trial & error) and figuring out the best policy that it is going to maximize its reward.  

Planning: the model of the environment is fully known, the agent knows all the rules of the game (i.e the dynamics (differential equation) describing the wind is given for the example with the helicopter). The agent has access to this perfect model, computation time to interact with it and as a result of those interactions improves its policy. It doesn't actual have to do anything with the real world .

In RL the environment is unknown, in planning the environment is known. One way to do RL is to first learn how the environment works and then do planning.

📝 Planning

The agent knows how the emulator works and it is able to look ahead and plan accordingly. It can simply query the emulator without having to interact with the real environment; build the whole search tree. It is a different problem setup as the agent knows in advance the rules of the game.

⚖️ Exploration and Exploitation

How we balance exploration & exploitation efficiently is another problem in RL. RL is a trial and error learning as we don't know how the environment looks like. We have to discover which part of the space are bad and which are good. But by doing that we might lose reward through the game. The goal is to find the optimal policy but without giving up opportunities to exploit the things it has to discover along the way.

Exploration: is like giving up some reward in order to find out more about the environment. In the long run you might ended up getting more reward.

Exploitation: exploit the information you already have found to maximize its reward.

For example, if you continue going to your favorite restaurant you give up the opportunity to explore other restaurants so you might end up with something sub-optimal. At the same time you do not want to explore other restaurants all the time and get shit food. So you need to balance those things out and find the best restaurant while eating delicious food along the way. Some examples:

🔧 Prediction & Control

It is another distinction in RL:

  • Prediction: how well I will do if I follow my current policy, how much reward I will get.
  • Control: what's the optimal policy; which direction to go to get the most reward.

In RL we have to solve the prediction problem first in order to solve the control problem afterwards. We have to evaluate all our policies in order to figure out the optimal policy.

🎮 Gridworld Example

The agent gets -1 per time step but if it land to point A it gets a +10 reward and teleport to A'. Similarly thing occur also from B to B' but with a +5 reward. Finding out what reward we will get by moving around the grid is a prediction problem.

Previously in the best state we were getting 8.8 and now by behaving optimal in the gridworld we get a very different answer. The optimal value according to the best optimal behavior give us 24 units of reward. The reason is that we will follow the optimal strategy and move along this links (A to A').

So the optimization problem is very different than the prediction problem.

🤖 Conclusion

This brings us to the end of this article. Hope you like my notes and learn something new.

Thanks for reading; if you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌

💪💪💪💪 As always keep studying, keep creating 🔥🔥🔥🔥