Professional Documents
Culture Documents
π
Agent-Environment Interface v∗ (s) = max v (s) (5)
π
qπ (s, a) = Eπ Gt |St = s, At = a
Action-Value (Q) Function ∞
We can also denoted the expected reward for state, action γ k Rt+k+1 |St = s, At = a
P
= Eπ
pairs. k=0
qπ (s, a) = Eπ Gt |St = s, At = a (6) ∞
γ k Rt+k+2 |St = s, At = a
P
= Eπ Rt+1 + γ
Optimal k=0
∞
The optimal value-action function:
γ Rt+k+2 |St+1 = s0
P k
p(s0 , r|s, a) r + γEπ
P
=
The Agent at each step t receives a representation of the s0 ,r k=0
environment’s state, St ∈ S and it selects an action At ∈ A(s). q∗ (s, a) = max q π (s, a) (7)
π
p(s0 , r|s, a) r + γVπ (s0 )
Then, as a consequence of its action the agent receives a
P
=
reward, Rt+1 ∈ R ∈ R. Clearly, using this new notation we can redefine v ∗ , equation s0 ,r
5, using q ∗ (s, a), equation 7: (10)
Policy
A policy is a mapping from a state to an action v∗ (s) = max qπ∗ (s, a) (8)
πt (s|a) (1)
a∈A(s) Dynamic Programming
That is the probability of select an action At = a if St = s. Intuitively, the above equation express the fact that the value Taking advantages of the subproblem structure of the V and Q
of a state under the optimal policy must be equal to the function we can find the optimal policy by just planning
Reward expected return from the best action from that state.
The total reward is expressed as:
Policy Iteration
H
X
k
Bellman Equation
Gt = γ rt+k+1 (2) We can now find the optimal policy
k=0
An important recursive property emerges for both Value (4)
and Q (6) functions if we expand them. 1. Initialisation
Where γ is the discount factor and H is the horizon, that can
V (s) ∈ R, (e.g V (s) = 0) and π(s) ∈ A for all s ∈ S,
be infinite.
Value Function ∆←0
Markov Decision Process 2. Policy Evaluation
vπ (s) = Eπ Gt |St = s while ∆ ≥ θ (a small positive number) do
A Markov Decision Process, MPD, is a 5-tuple
(S, A, P, R, γ) where: ∞ foreach s ∈ S do
= Eπ
P
γ k Rt+k+1 |St = s v ← V (s)
finite set of states:
p(s0 , r|s, a) r + γV (s0 )
P P
k=0 V (s) ← π(a|s)
s∈S a s0 ,r
finite set of actions: ∞
∆ ← max(∆, |v − V (s)|)
γ k Rt+k+2 |St = s
P
a∈A = Eπ Rt+1 + γ
(3) k=0 end
state transition probabilities: end
p(s0 |s, a) = P r{St+1 = s0 |St = s, At = a} X XX
= π(a|s) p(s0 , r|s, a) (9) 3. Policy Improvement
expected reward for state-action-nexstate: policy-stable ← true
a s0 r
r(s0 , s, a) = E[Rt+1 |St+1 = s0 , St = s, At = a] | {z } foreach s ∈ S do
Sum of all
probabilities ∀ possible r
Value Function ∞ old-action ← π(s)P
p(s0 , r|s, a) r + γV (s0 )
kR s0 π(s) ← argmax
P
Value function describes how good is to be in a specific state s r + γ Eπ γ t+k+2 |St+1 =
k=0
a s0 ,r
under a certain policy π. For MDP: }
| {z policy-stable ← old-action = π(s)
Vπ (s) = E[Gt |St = s] (4) Expected reward from st+1
end
Informally, is the expected return (expected cumulative if policy-stable return V ≈ v∗ and π ≈ π∗ , else go to 2
p(s0 , r|s, a) r + γvπ (s0 )
P PP
discounted reward) when starting from s and following π = π(a|s) Algorithm 1: Policy Iteration
a s0 r