ML at Icl Reinforcement Learning: in A Nutshell

ML @ ICL
Reinforcement learning
In a nutshell
1
Supervised learning
Given:
● objects and answers
(x , y)
● algorithm family a θ (x)→ y
● loss function L( y ,a θ (x))
Find:
θ'← argminθ L( y ,a θ (x))

2
Supervised learning
Given:
● objects and answers
(x , y)
[banner,page], ctr
● algorithm family a θ (x)→ y
linear / tree / NN
● loss function L( y ,a θ (x))
MSE, crossentropy
Find:
θ'← argminθ L( y ,a θ (x))

3
Supervised learning
Great... except if we have no reference answers
4
Online Ads
We have:
● YouTube at your disposal
● Live data stream
(banner & video features, #clicked)
● (insert your favorite ML toolkit)
We want:
● Learn to pick relevant ads
Ideas? 5
Duct tape approach
Common idea:
● Initialize with naïve solution
● Get data by trial and error and error and error and error
● Learn (situation) → (optimal action)
● Repeat
6
Giant Death Robot (GDR)
We have:
● Evil humanoid robot
● A lot of spare parts
to repair it :)
We want:
● Enslave humanity
● Learn to walk forward
Ideas? 7
Duct tape approach (again)
Common idea:
● Initialize with naïve solution
● Get data by trial and error and error and error and error
● Learn (situation) → (optimal action)
● Repeat
8
Duct tape approach
9
Problems
Problem 1:
● What exactly does the “optimal action”
mean?
Extract as much
money as you can
right now
VS
Make user happy
so that he would
visit you again 10
Problems
Problem 2:
● If you always follow the “current optimal”
strategy, you may never discover something

better.
● If you show the same banner to 100% users,

you will never learn how other ads affect them.
Ideas?
11
Duct tape approach
12
Reinforcement learning
13
What-what learning?
Supervised learning Reinforcement learning
● Learning to approximate ● Learning optimal strategy

reference answers by trial and error
● Needs correct answers ● Needs feedback on agent's

own actions
● Model does not affect the ● Agent can affect it's own
input data observations
What-what learning?
Unsupervised learning Reinforcement learning
● Learning underlying ● Learning optimal strategy

data structure by trial and error
● No feedback required ● Needs feedback on agent's

own actions
● Model does not affect the ● Agent can affect it's own
input data observations
What is: bandit
observation action
Agent Feedback
Examples:
– banner ads (RTB)
– recommendations
– medical treatment
16
What is: bandit
observation action
Agent Feedback
Examples:
– recommendations
17
What is: bandit
observation action
Agent Feedback
Examples:
Q: what's observation, action and
– recommendations feedback in the banner ads problem?
18
What is: bandit
observation action
Agent Feedback
user features show
time of year click,
banner
trends money
Examples:
– recommendations
19
What is: bandit
observation action
Agent Feedback
user features show
time of year click,
banner
trends money
Q: You're Yandex/Google/Youtube.
There's a kind of banners that would
have great click rates: the “clickbait”.
Is it a good idea to show clickbait?

20
What is: bandit
observation action
Agent Feedback
Q: You're Yandex/Google/Youtube.
There's a kind of banners that would
have great click rates: the “clickbait”.
Is it a good idea to show clickbait?

21
No, no one will trust you after that!
What is: decision process
Agent
observation
action
22
Environment
What is: decision process
Agent
Can do anything
observation
action
Can't even see
(worst case)
23
Environment
Reality check: web
● Cases:
● Pick ads to maximize profit
● Design landing page to
maximize user retention

● Recommend movies to users
● Find pages relevant to queries
● Example
● Observation – user features
● Action – show banner #i
● Feedback – did user click?
24
Reality check: dynamic systems
25
Reality check: dynamic systems
● Cases:
● Robots
● Self-driving vehicles
● Pilot assistant
● More robots!
● Example
● Observation: sensor feed
● Action: voltage sent to motors
● Feedback: how far did it move
forward before falling
26
Reality check: videogames
27
● Q: What are observations, actions and feedback?
Reality check: videogames
28
Other use cases
● Personalized medical treatment
● Even more games (Go, chess, etc)
29
Other use cases
● Conversation systems
– learning to make user happy
● Quantitative finance
– portfolio management
● Deep learning
– optimizing non-differentiable loss
– finding optimal architecture
30
The MDP formalism
Markov Decision Process

● Environment states:
s ∈S
● Agent actions:
a ∈A
● Rewards
r∈ℝ
● Dynamics: P(s t+ 1∣s t , at ) 31
The MDP formalism
a ∈A
s ∈S
P(s t+ 1∣s t , at )

Markov assumption
P(s t+ 1∣s t , at , s t−1 , at −1 )=P(s t +1∣s t , a t )

32
The MDP formalism
a ∈A
s ∈S
P(s t+ 1∣s t , at )

Markov assumption
P(s t+ 1∣s t , at , s t−1 , at −1 )=P(s t +1∣s t , a t )

33
Total reward
Total reward for session:

R=∑ r t
t
Agent's policy:
π (a∣s)=P(take action a∣in state s )
Problem: find policy with

highest reward:
π (a∣s): E π [R ]→max 34
Objective
The easy way:
E π R is an expected sum of rewards

that agent with policy π earns per session
The hard way:
E E E ... E [r 0 +r 1+r 2 +...+r T ]

s0 ∼ p( s0), a0∼π(a∣s0 ), s1 , r0 ∼P(s' ,r∣s,a) sT ,rT ∼P(s' ,r∣sT −1 ,at −1 )
35
Objective
The easy way:
E π R is an expected sum of rewards

that agent with policy π earns per session
The hard way:
E E E ... E [r 0 +r 1+r 2 +...+r T ]

s0 ∼ p( s0), a0∼π(a∣s0 ), s1 , r0 ∼P(s' ,r∣s,a) sT ,rT ∼P(s' ,r∣sT −1 ,at −1 )
36
How do we solve it?
General idea:
Play a few sessions
Update your policy
Repeat
37
Crossentropy method
Intialize policy
Repeat:
– Sample N[100] sessions
– Pick M[25] best sessions, called elite sessions
– Change policy so that it prioritizes

actions from elite sessions
38
Step-by-step view
39
Step-by-step view
elites
40
Step-by-step view
41
Step-by-step view
42
Step-by-step view
43
Step-by-step view
44
Step-by-step view
45
Step-by-step view
46
Step-by-step view
47
Step-by-step view
48
Step-by-step view
49
Tabular crossentropy method
● Policy is a matrix
π (a∣s)= A s , a
● Sample N games with that policy

● Get M best sessions (elites)
Elite=[( s 0 , a0 ) ,(s 1 , a1 ) ,(s 2 , a2 ),... ,( s k , a k )]
50
π (a∣s)= A s , a

● Take M best sessions (elites)
● Aggregate by states
∑ [s t =s][at =a]
s t , at ∈Elite
π (a∣s)=
∑ [ st =s ] 51
s t ,a t ∈ Elite
π (a∣s)= A s , a

● Take M best sessions (elite)
● Aggregate by states
took a at s In M best
π (a∣s)= games
was at s
52
Grim reality
If your environment has infinite/large state space
53
Approximate crossentropy method
● Policy is approximated
– Neural network predicts π W ( a∣s) given s
– Linear model / Random Forest / ...
Can't set π (a∣s) explicitly
All state-action pairs from M best sessions

Elite=[( s 0 , a0 ),(s 1 , a1 ) ,(s 2 , a2 ) ,... ,( s k , a k )]
54
Neural network predicts π W ( a∣s) given s
All state-action pairs from M best sessions
Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]
Maximize likelihood of actions in “best” games
π=argmax
π
∑ log π( ai∣si )
s i ,ai ∈Elite
55
● Initialize NN weights W 0 ←random
● Loop:
– Sample N sessions
– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,( s k , a k )]
– W i +1=W i + α ∇ [ ∑ log πW (ai∣s i )]

i
s i , ai ∈ Elite
56
● Initialize NN weights nn = MLPClassifier(...)
● Loop:
– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]
– nn.fit(elite_states,elite_actions)
57
Continuous action spaces
● Continuous state space
2
● Model π W ( a∣s)=N (μ ( s) , σ )
– Mu(s) is neural network output
– Sigma is a parameter or yet another network output
● Loop:
– elite = take M best sessions and concatenate
–
W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite
What changed? 58
● Initialize NN weights nn = MLPRegressor(...)
● Loop:
– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]
– nn.fit(elite_states,elite_actions)
Almost nothing! 59
Tricks
● Remember sessions from 3-5 past iterations
– Threshold and use all of them when training
– May converge slower if env is easy to solve.
● Regularize with entropy

– to prevent premature convergence.
● Parallelize sampling
● Use RNNs if partially-observable (later) 60

ML at Icl Reinforcement Learning: in A Nutshell

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML at Icl Reinforcement Learning: in A Nutshell

Uploaded by

Copyright:

Available Formats

ML @ ICL

θ'← argminθ L( y ,a θ (x))

θ'← argminθ L( y ,a θ (x))

Great... except if we have no reference answers

● Learn (situation) → (optimal action)

● Learn (situation) → (optimal action)

strategy, you may never discover something

● If you show the same banner to 100% users,

● Learning to approximate ● Learning optimal strategy

● Needs correct answers ● Needs feedback on agent's

● Learning underlying ● Learning optimal strategy

● No feedback required ● Needs feedback on agent's

Is it a good idea to show clickbait?

Is it a good idea to show clickbait?

● Design landing page to

maximize user retention

● Find pages relevant to queries

● Action – show banner #i

● Feedback – did user click?

● Action: voltage sent to motors

● Feedback: how far did it move

forward before falling

● Even more games (Go, chess, etc)

Markov Decision Process

Markov Decision Process

P(s t+ 1∣s t , at , s t−1 , at −1 )=P(s t +1∣s t , a t )

Markov Decision Process

P(s t+ 1∣s t , at , s t−1 , at −1 )=P(s t +1∣s t , a t )

Total reward for session:

π (a∣s)=P(take action a∣in state s )

Problem: find policy with

The easy way:

E π R is an expected sum of rewards

E E E ... E [r 0 +r 1+r 2 +...+r T ]

The easy way:

E π R is an expected sum of rewards

E E E ... E [r 0 +r 1+r 2 +...+r T ]

Play a few sessions

Update your policy

– Pick M[25] best sessions, called elite sessions

– Change policy so that it prioritizes

● Sample N games with that policy

Elite=[( s 0 , a0 ) ,(s 1 , a1 ) ,(s 2 , a2 ),... ,( s k , a k )]

● Sample N games with that policy

● Sample N games with that policy

If your environment has infinite/large state space

Can't set π (a∣s) explicitly

All state-action pairs from M best sessions

All state-action pairs from M best sessions

Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]

Maximize likelihood of actions in “best” games

– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,( s k , a k )]

– W i +1=W i + α ∇ [ ∑ log πW (ai∣s i )]

– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]

– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]

● Regularize with entropy

You might also like