Reinforcement Learning Tutorial | Reinforcement Learning Example | Reinforcement Learning Algorithms

What is Reinforcement Learning ?


Reinforcement Learning

  • Reinforcement Learning may be a feedback-based Machine learning technique in which an agent learns to behave in an environment by performing the actions and seeing the results of actions. For every good action, the agent gets positive feedback, and for every bad action, the agent gets negative feedback or penalty.
  • In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled data, unlike supervised learning.
  • Since there's no labeled data, therefore the agent is bound to learn by its experience only.
  • RL solves a selected kind of problem where decision making is sequential, and therefore the goal is long-term, like game-playing, robotics, etc.
  • The agent interacts with the environment and explores it by itself.
  • The first goal of an agent in reinforcement learning is to boost the performance by getting the maximum positive rewards.
  • The agent learns with the method of hit and trial, and supported the experience, it learns to perform the task during a better way.
  • We will say that "Reinforcement learning may be a kind of machine learning method where an intelligent agent (computer program) interacts with the environment and learns to act within that." How a Robotic dog learns the movement of his arms is an example of Reinforcement learning.
  • It is a core a part of AI, and every one AI agent works on the concept of reinforcement learning. We don't got to pre-program the agent, because it learns from its own experience without any human intervention.


  • Suppose there's an AI agent present within a maze environment, and his goal is to find the diamond.
  • The agent interacts with the environment by performing some actions, and supported those actions, the state of the agent gets changed, and it also receives a reward or penalty as feedback.
  • The agent continues doing these three things (take action, change state/remain within the same state, and obtain feedback), and by doing these actions, he learns and explores the environment.
  • The agent learns that what actions cause positive feedback or rewards and what actions lead to negative feedback penalty.
  • As a positive reward, the agent gets a positive point, and as a penalty, it gets a negative point.
 What is Reinforcement Learning

What is Reinforcement Learning?

Terms used in Reinforcement Learning


Terms used in Reinforcement Learning

  • Agent () :
    • An entity which will perceive/explore the environment and influence it.
  • Environment () :
    • A situation during which an agent is present or surrounded by. In RL, we assume the stochastic environment, which suggests it's random in nature.
  • Action () :
    • Actions are the moves taken by an agent within the environment.
  • State () :
    • State may be a situation returned by the environment after each action taken by the agent.
  • Reward () :
    • A feedback returned to the agent from the environment to evaluate the action of the agent.
  • Policy () :
    • Policy may be a strategy applied by the agent for the next action based on the present state.
  • Value () :
    • It's expected long-term retuned with the discount factor and opposite to the short-term reward.
  • Value Function :
    • It specifies the worth of a state that's the entire amount of reward. It’s an agent which should be expected beginning from that state.
  • Model of the environment :
    • This mimics the behavior of the environment. It helps you to form inferences to be made and also determine how the environment will behave.
  • Model based methods :
    • It's a method for solving reinforcement learning problems which use model-based methods.
  • Q-value () :
    • It's mostly almost like the value, but it takes one additional parameter as a current action (a).

Key Features of Reinforcement Learning

  • In RL, the agent isn't instructed about the environment and what actions need to be taken.
  • It is supported the hit and trial process.
  • The agent takes subsequent action and changes states consistent with the feedback of the previous action.
  • The agent may get a delayed reward.
  • The environment is stochastic, and therefore the agent must explore it to succeed in to get the maximum positive rewards.

Approaches to Implement Reinforcement Learning

  • There are mainly 3 ways to implement reinforcement-learning in ML, which are:
    • Value Based
    • Policy Based
    • Model Based

Approaches to implement Reinforcement Learning


  • The value-based approach is close to find the optimal value function, which is that the maximum value at a state under any policy.
  • The agent expects the long-term return at any state(s) under policy π.


  • Policy-based approach is to find out the optimal policy for the maximum future rewards without using the value function.
  • During this approach, the agent tries to use such a policy that the action performed in each step helps to maximize the longer term reward.
  • The policy-based approach has mainly two kinds of policy:
    • Deterministic : An equivalent action is produced by the policy (π) at any state.
    • Stochastic : During this policy, probability determines the produced action.


  • Within the model-based approach, a virtual model is made for the environment, and therefore the agent explores that environment to learn it.
  • There's no particular solution or algorithm for this approach because the model representation is different for every environment.

Elements of Reinforcement Learning

  • The Elements of Reinforcement Learning, which are given below:
    • Policy
    • Reward Signal
    • Value Function
    • Model of the environment

Elements of Reinforcement Learning


  • A policy are often defined as how an agent behaves at a given time.
  • It maps the perceived states of the environment to the actions taken on those states.
  • A policy is that the core element of the RL because it alone can define the behavior of the agent.
  • In some cases, it may be an easy function or a lookup table, whereas, for other cases, it may involve general computation as an search process.
  • It might be deterministic or a stochastic policy:
    • For Deterministic Policy : a = π(s)
    • For Stochastic Policy : π(a | s) = P[At =a | St = s]

Reward Signal

  • The goal of reinforcement learning is defined by the reward signal.
  • At each state, the environment sends an immediate signal to the learning agent, and this signal is understood as a reward signal.
  • These rewards are given consistent with the good and bad actions taken by the agent.
  • The agent's main objective is to maximize the entire number of rewards permanently actions.
  • The reward signal can change the policy, like if an action selected by the agent results in low reward, then the policy may change to pick other actions within the future.

Value Function

  • The Value function gives information about how good things and action are and how much reward an agent can expect.
  • A reward indicates the immediate signal for every good and bad action, whereas a value function specifies the good state and action for the future.
  • The value function depends on the reward as, without reward, there might be no value. The goal of estimating values is to realize more rewards.


  • The last element of reinforcement learning is that the model, which mimics the behavior of the environment.
  • With the help of the model, one can make inferences about how the environment will behave. Such as, if a state and an action are given, then a model can predict the next state and reward.
  • The model is used for planning, which suggests it provides how to require a course of action by considering all future situations before actually experiencing those situations. The approaches for solving the RL problems with the help of the model are termed because the model-based approach. Comparatively, an approach without using a model is called a model-free approach.

How does Reinforcement Learning Work ?

  • To understand the working process of the RL, we'd like to think about two main things:
    • Environment:
      • It are often anything like an area, maze, football ground, etc.
    • Agent:
      • An intelligent agent like AI robot.
 How Reinforcement Learning Work

How Reinforcement Learning Work

Let's take an example of a maze environment that the agent must explore. Consider the below image

 Reinforcement Learning Bellman Equation

Reinforcement Learning Bellman Equation

  • In the above image, the agent is at the very first block of the maze. The maze is consisting of an S6 block, which may be a wall, S8 a fire pit, and S4 a diamond block.
  • The agent cannot cross the S6 block, because it may be a solid wall. If the agent reaches the S4 block, then get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It can take four actions: move up, move down, move left, and move right.
  • The agent can take any path to reach to the final point, but he must make it in possible fewer steps. Suppose the agent considers the trail S9-S5-S1-S2-S3, so he will get the reward point.
  • The agent will attempt to remember the preceding steps that it's taken to succeed in the final step. To memorize the steps, it assigns 1 value to every previous step. Consider the below step:
 Reinforcement Learning Bellman Equation2

Reinforcement Learning Bellman Equation

  • Now, the agent has successfully stored the previous steps assigning the 1 value to every previous block. But what is going to the agent do if he starts moving from the block, which has 1 value block on both sides ? Consider the below diagram:
 Reinforcement Learning Bellman Equation3

Reinforcement Learning Bellman Equation

  • It will be a difficult condition for the agent whether he should go up or down as each block has an equivalent value. So, the above approach isn't suitable for the agent to reach the destination. Hence to solve the matter, we'll use the Bellman equation, which is that the main concept behind reinforcement learning.

The Bellman Equation

  • The Bellman equation was introduced by the Mathematician Richard Ernest Bellman within the year 1953, and hence it's called as a Bellman equation. It’s related to dynamic programming and used to calculate the values of a choice problem at a particular point by including the values of previous states.
  • It is the way of calculating the value functions in dynamic programming or environment that leads to modern reinforcement learning.
  • The key-elements used in Bellman equations are:
    • Action performed by the agent is mentioned as "a"
    • State occurred by performing the action is "s."
    • The reward/feedback obtained for every good and bad action is "R."
    • A discount factor is Gamma "γ."

The Bellman equation are often written as:

V(s) = max [R(s,a) + γV(s`)]

V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.

  • In the above equation, we are taking the max of the complete values because the agent tries to find the optimal solution always.
  • So now, using the Bellman equation, we will find value at each state of the given environment. We will start from the block, which is next to the target block.

Read Also

For 1st block

V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.

For 2nd block

V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no reward at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9

For 3rd block

V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is no reward at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81

For 4th block

V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there is no reward at this state also.
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73

For 5th block

V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there is no reward at this state also.
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
 Reinforcement Learning Bellman Equation

Reinforcement Learning Bellman Equation

  • Now, we'll move further to the 6th block, and here agent may change the route because it always tries to find the optimal path. So now, let's consider from the block next to the fire pit.
 Reinforcement Learning Bellman Equation2

Reinforcement Learning Bellman Equation2

Read Also

  • Now, the agent has three options to move; if he moves to the blue box, then he will feel a bump if he moves to the fire pit, then he will get the -1 reward. But here we are taking only positive rewards, so for this, he will move to upwards only. The entire block values are going to be calculated using this formula. Consider the below image:
 Reinforcement Learning Bellman Equation

Reinforcement Learning Bellman Equation

Types of Reinforcement Learning

  • There are mainly two sorts of reinforcement learning, which are:
    • Positive Reinforcement
    • Negative Reinforcement

Reinforcement Types

Positive Reinforcement

  • The positive reinforcement learning means adding something to extend the tendency that expected behavior would occur again.
  • It impacts positively on the behavior of the agent and increases the strength of the behavior.
  • This type of reinforcement can sustain the changes for an extended time, but an excessive amount of positive reinforcement may cause an overload of states which will reduce the results.

Negative Reinforcement

  • The negative reinforcement learning is opposite to the positive reinforcement because it increases the tendency that the specific behavior will occur again by avoiding the negative condition.
  • It are often more effective than the positive reinforcement depending on situation and behavior, but it provides reinforcement only to satisfy minimum behavior.

How to represent the agent state ?

  • We can represent the agent state using the Markov State that contains all the specified information from the history. The State St is Markov state if it follows the given condition:
P[St+1 | St ] = P[St +1 | S1,......, St]
  • The Markov state follows the Markov property, which says that the future is independent of the past and may only be defined with the present. The RL works on fully observable environments, where the agent can observe the environment and act for the new state. The entire process is understood as Markov Decision process, which is explained below:

Markov Decision Process

  • Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If the environment is completely observable, then its dynamic can be modeled as a Markov Process. In MDP, the agent constantly interacts with the environment and performs actions; At each action, the environment responds and generates a new state.
  • MDP is used to describe the environment for the RL, and almost all the RL problem can be formalized using MDP.
  • MDP contains a tuple of four elements (S, A, Pa, Ra):
    • A set of finite States S
    • A set of finite Actions A
    • Rewards received after transitioning from state S to state S', due to action a.
    • Probability Pa.
  • MDP uses Markov property, and to better understand the MDP, we need to learn about it.

Read Also

Markov Property

  • It says that "If the agent is present in the current state S1, performs an action a1 and move to the state s2, then the state transition from s1 to s2 only depends on the current state and future action and states do not depend on past actions, rewards, or states."
  • Or, In other words, as per Markov Property, the current state transition does not depend on any past action or state. Hence, MDP is an RL problem that satisfies the Markov property. Such as in a Chess game, the players only focus on the current state and do not need to remember past actions or states.

Finite MDP

  • A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we consider only the finite MDP.

Markov Process

  • Markov Process is a memory less process with a sequence of random states S1, S2, ....., St that uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S, P) on state S and transition function P. These two components (S and P) can define the dynamics of the system.

Reinforcement Learning Algorithms

  • Reinforcement learning algorithms are mainly utilized in AI applications and gaming applications. The most used algorithms are:



  • Q-learning is an Off policy RL algorithm, which is used for the temporal difference Learning. The temporal difference learning methods are the way of comparing temporally successive predictions.
  • It learns the value function Q (S, a), which suggests how good to require action "a" at a specific state "s."
  • The below flowchart explains the working of Q- learning:
 Reinforcement Learning Algorithms

Reinforcement Learning Algorithms

State Action Reward State action (SARSA)

  • SARSA stands for State Action Reward State action, which is an on-policy temporal difference learning method. The on-policy control method selects the action for every state while learning using a specific policy.
  • The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and every one pairs of (s-a).
  • The main difference between Q-learning and SARSA algorithms is that unlike Q-learning, the utmost reward for subsequent state isn't required for updating the Q-value within the table.
  • In SARSA, new action and reward are selected using an equivalent policy, which has determined the first action.
  • The SARSA is known as because it uses the quintuple Q(s, a, r, s', a'). Where,
    • s: original state
    • a: Original action
    • r: reward observed while following the states
    • s' and a': New state, action pair.

Deep Q Neural Network (DQN)

  • As the name suggests, DQN may be a Q-learning using Neural networks.
  • For an enormous state space environment, it'll be a challenging and complex task to define and update a Q-table.
  • To solve such a problem, we will use a DQN algorithm. Where, rather than defining a Q-table, neural network approximates the Q-values for every action and state.

Now, we'll explain the Q-learning.

Q-Learning Explanation

  • Q-learning may be a popular model-free reinforcement learning algorithm based on the Bellman equation.
  • The main objective of Q-learning is to find out the policy which may inform the agent that what actions should be taken for maximizing the reward under what circumstances.
  • It is an off-policy RL that attempts to find the simplest action to take at a current state.
  • The goal of the agent in Q-learning is to maximize the worth of Q.
  • The value of Q-learning are often derived from the Bellman equation.
 Reinforcement Learning q Learning Explanation

  • In the equation, we have various components, including reward, discount factor (γ), probability, and end states s'. But there is no any Q-value is given so first consider the below image:
 Reinforcement Learning q Learning Explanation

Reinforcement Learning q Learning Explanation

  • In the above image, we will see there's an agent who has three values options, V(s1), V(s2), V(s3). As this is often MDP, so agent only cares for the present state and therefore the future state. The agent can attend any direction (Up, Left, or Right), so he must decide where to go for the optimal path. Here agent will take a move as per probability bases and changes the state. But if we would like some exact moves, so for this, we'd like to form some changes in terms of Q-value. Consider the below image:
 Reinforcement Learning q Learning Explanation

Reinforcement Learning q Learning Explanation

  • Q- Represents the standard of the actions at each state. So rather than using a value at each state, we'll use a pair of state and action, i.e., Q(s, a). Q-value specifies that which action is more lubricative than others, and consistent with the simplest Q-value, the agent takes his next move. The Bellman equation are often used for deriving the Q-value.
  • To perform any action, the agent will get a reward R(s, a), and also he will find yourself on a particular state, therefore the Q -value equation will be:
 Reinforcement Learning q Learning Explanation
  • Hence, we can say that, V(s) = max [Q(s, a)]
 Reinforcement Learning q Learning Explanation

  • The above formula is used to estimate the Q-values in Q-Learning.

What is 'Q' in Q-learning ?

  • The Q stands for quality in Q-learning, which suggests it specifies the standard of an action taken by the agent.


  • A Q-table or matrix is made while performing the Q-learning.
  • The table follows the state and action pair, i.e., [s, a], and initializes the values to zero.
  • After each action, the table is updated, and therefore the q-values are stored within the table.
  • The RL agent uses this Q-table as a reference table to pick the simplest action based on the q-values.

Difference between Reinforcement Learning and Supervised Learning

  • The Reinforcement Learning and Supervised Learning both are the part of machine learning, but both kinds of learnings are far opposite to every other. The RL agents interact with the environment, explore it, take action, and get rewarded. Whereas supervised learning algorithms learn from the labeled dataset and, on the idea of the training, predict the output.

Reinforcement Learning Vs Supervised Learning

Reinforcement Learning Supervised Learning
RL works by interacting with the environment. Supervised learning works on the existing dataset.
RL helps to take decisions sequentially. In Supervised learning, decisions are made when input is given.
There is no labeled dataset is present. The labeled dataset is present.
No previous training is provided to the learning agent. Training is provided to the algorithm so that it can predict the output.
The RL algorithm works like the human brain works when
making some decisions.
Supervised Learning works as when a human learns things in the supervision of a guide.

Reinforcement Learning Application


Reinforcement Learning Applications


  • RL are often used for adaptive control like Factory processes, admission control in telecommunication, and Helicopter pilot is an example of reinforcement learning.

Game Playing

  • RL are often utilized in Game playing like tic-tac-toe, chess, etc.


  • RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.


  • In various automobile manufacturing companies, the robots use deep reinforcement learning to select goods and put them in some containers.

Finance Sector

  • The RL is currently used in the finance sector for evaluating trading strategies.


  • RL are often used for optimizing the chemical reactions.


  • RL is now used for business strategy planning.


  • From the above discussion, we will say that Reinforcement Learning is one among the foremost interesting and useful parts of Machine learning. In RL, the agent explores the environment by exploring it without human intervention. It’s the most learning algorithm that's utilized in AI. But there are some cases where it shouldn't be used, like if you've got enough data to solve the matter, then other ML algorithms are often used more efficiently. The most issue with the RL algorithm is that a number of the parameters may affect the speed of the learning, like delayed feedback.

Related Searches to Reinforcement Learning Tutorial | Reinforcement Learning Example | Reinforcement Learning Algorithms