You are on page 1of 2

Reinforcement Learning Cheat Sheet Optimal Similarly, we can do the same for the Q function:

π
Agent-Environment Interface v∗ (s) = max v (s) (5)
π
 
qπ (s, a) = Eπ Gt |St = s, At = a
Action-Value (Q) Function  ∞ 
We can also denoted the expected reward for state, action γ k Rt+k+1 |St = s, At = a
P
= Eπ
pairs. k=0
 
qπ (s, a) = Eπ Gt |St = s, At = a (6)  ∞ 
γ k Rt+k+2 |St = s, At = a
P
= Eπ Rt+1 + γ
Optimal k=0
 ∞ 
The optimal value-action function:
γ Rt+k+2 |St+1 = s0
P k
p(s0 , r|s, a) r + γEπ
P
=
The Agent at each step t receives a representation of the s0 ,r k=0
environment’s state, St ∈ S and it selects an action At ∈ A(s). q∗ (s, a) = max q π (s, a) (7)
π
p(s0 , r|s, a) r + γVπ (s0 )
 
Then, as a consequence of its action the agent receives a
P
=
reward, Rt+1 ∈ R ∈ R. Clearly, using this new notation we can redefine v ∗ , equation s0 ,r
5, using q ∗ (s, a), equation 7: (10)
Policy
A policy is a mapping from a state to an action v∗ (s) = max qπ∗ (s, a) (8)
πt (s|a) (1)
a∈A(s) Dynamic Programming
That is the probability of select an action At = a if St = s. Intuitively, the above equation express the fact that the value Taking advantages of the subproblem structure of the V and Q
of a state under the optimal policy must be equal to the function we can find the optimal policy by just planning
Reward expected return from the best action from that state.
The total reward is expressed as:
Policy Iteration
H
X
k
Bellman Equation
Gt = γ rt+k+1 (2) We can now find the optimal policy
k=0
An important recursive property emerges for both Value (4)
and Q (6) functions if we expand them. 1. Initialisation
Where γ is the discount factor and H is the horizon, that can
V (s) ∈ R, (e.g V (s) = 0) and π(s) ∈ A for all s ∈ S,
be infinite.
Value Function ∆←0
Markov Decision Process   2. Policy Evaluation
vπ (s) = Eπ Gt |St = s while ∆ ≥ θ (a small positive number) do
A Markov Decision Process, MPD, is a 5-tuple
(S, A, P, R, γ) where:  ∞  foreach s ∈ S do
= Eπ
P
γ k Rt+k+1 |St = s v ← V (s)
finite set of states:
p(s0 , r|s, a) r + γV (s0 )
P P  
k=0 V (s) ← π(a|s)
s∈S a s0 ,r
finite set of actions:  ∞ 
∆ ← max(∆, |v − V (s)|)
γ k Rt+k+2 |St = s
P
a∈A = Eπ Rt+1 + γ
(3) k=0 end
state transition probabilities: end
p(s0 |s, a) = P r{St+1 = s0 |St = s, At = a} X XX
= π(a|s) p(s0 , r|s, a) (9) 3. Policy Improvement
expected reward for state-action-nexstate: policy-stable ← true
a s0 r
r(s0 , s, a) = E[Rt+1 |St+1 = s0 , St = s, At = a] | {z } foreach s ∈ S do
Sum of all
 probabilities ∀ possible r
Value Function  ∞  old-action ← π(s)P
p(s0 , r|s, a) r + γV (s0 )
 
kR s0 π(s) ← argmax
P
Value function describes how good is to be in a specific state s r + γ Eπ γ t+k+2 |St+1 =
 k=0
 a s0 ,r
under a certain policy π. For MDP:  }
| {z policy-stable ← old-action = π(s)
Vπ (s) = E[Gt |St = s] (4) Expected reward from st+1
end
Informally, is the expected return (expected cumulative if policy-stable return V ≈ v∗ and π ≈ π∗ , else go to 2
p(s0 , r|s, a) r + γvπ (s0 )
P PP  
discounted reward) when starting from s and following π = π(a|s) Algorithm 1: Policy Iteration
a s0 r

Value Iteration Initialise V (s) ∈ R, e.gV (s) = 0 Monte Carlo Methods


∆←0
while ∆ ≥ θ (a small positive number) do
foreach s ∈ S do
We can avoid to wait until V (s) has converged and instead do v ← V (s)
policy improvement and truncated policy evaluation step in V (s) ← max
P
p(s0 , r|s, a) r + γV (s0 )
  Monte Carlo (MC) is a Model Free method, It does not require
one operation a s0 ,r complete knowledge of the environment. It is based on
∆ ← max(∆, |v − V (s)|)
end
end
averaging sample returns for each state-action pair. The Forward View Sarsa(λ) Deep Q Learning
following algorithm gives the basic implementation ∞ Created by DeepM ind, Deep Q Learning, DQL, substitutes
(n)
X
Initialise for all s ∈ S, a ∈ A(s) : qtλ = (1 − λ) λn−1 qt the Q function with a deep neural network called Q-network.
n=1 It also keep track of some observation in a memory in order to
Q(s, a) ← arbitrary
Forward-view Sarsa(λ): use them to train the network.
π(s) ← arbitrary
Returns(s, a) ← empty list h i 
(r + γ max Q(s0 , a0 ; θi−1 ) − Q(s, a; θi ))2

Q(st , at ) ← Q(st , at ) + α qtλ − Q(st , at )
while forever do Li (θi ) = E(s,a,r,s0 )∼U (D) |
 a | {z } 
Choose S0 ∈ S and A0 ∈ A(S0 ), all pairs have
{z }
prediction
Initialise Q(s, a) arbitrarily and target
probability > 0 Q(terminal − state, ) = 0 (13)
Generate an episode starting at S0 , A0 following π foreach episode ∈ episodes do Where θ are the weights of the network and U (D) is the
foreach pair s, a appearing in the episode do Choose a from s using policy derived from Q (e.g., experience replay history.
G ← return following the first occurrence of s, a -greedy)
Append G to Returns(s, a)) Initialise replay memory D with capacity N
while s is not terminal do Initialise Q(s, a) arbitrarily
Q(s, a) ← average(Returns(s, a)) Take action a, observer r, s0
end foreach episode ∈ episodes do
Choose a0 from s0 using policy derived from Q while s is not terminal do
foreach s in the episode do (e.g., -greedy)
π(s) ← argmax Q(s, a) With probability  select a random action
Q(s, a) ← Q(s, a) + α r + γQ(s0 , a0 ) − Q(s, a)
 
a a ∈ A(s)
end s ← s0 otherwise select a = maxa Q(s, a; θ)
end a ← a0 Take action a, observer r, s0
end Store transition (s, a, r, s0 ) in D
Algorithm 3: Monte Carlo first-visit end Sample random minibatch of transitions
For non-stationary problems, the Monte Carlo estimate for, Algorithm 4: Sarsa(λ) (sj , aj , rj , s0j ) from D
e.g, V is: Set yj ←
Temporal Difference - Q Learning
 
V (St ) ← V (St ) + α Gt − V (St ) (11)
for terminal s0j
(
rj
Where α is the learning rate, how much we want to forget Temporal Difference (TD) methods learn directly from raw
rj + γ max Q(s , a ; θ) for non-terminal s0j
0 0
about past experiences. experience without a model of the environment’s dynamics. a
TD substitutes the expected discounted reward Gt from the Perform gradient descent step on
Sarsa episode with an estimation: (yj − Q(sj , aj ; Θ))2
  s ← s0
Sarsa (State-action-reward-state-action) is a on-policy TD V (St ) ← V (St ) + α Rt+1 + γV (St+1 − V (St ) (12) end
control. The update rule: The following algorithm gives a generic implementation. end
Initialise Q(s, a) arbitrarily and Algorithm 6: Deep Q Learning
Q(terminal − state, ) = 0
Q(st , at ) ← Q(st , at ) + α [rt + γQ(st+1 , at+1 ) − Q(st , at )] foreach episode ∈ episodes do Copyright
c 2018 Francesco Saverio Zuppichini
while s is not terminal do https://github.com/FrancescoSaverioZuppichini/Reinforcement-
Learning-Cheat-Sheet
n-step Sarsa Choose a from s using policy derived from Q
(e.g., -greedy)
Define the n-step Q-Return
Take action a, observer r, s0
Q(s, a) ← h
0 0
i
q (n) = Rt+1 + γRt + 2 + . . . + γ n−1 Rt+n + γ n Q(St+n ) Q(s, a) + α r + γ max Q(s , a ) − Q(s, a)
a0
s ← s0
n-step Sarsa update Q(S, a) towards the n-step Q-return end
h i end
(n)
Q(st , at ) ← Q(st , at ) + α qt − Q(st , at ) Algorithm 5: Q Learning

Double Deep Q Learning

You might also like