You are on page 1of 60

ML @ ICL

Reinforcement learning
In a nutshell

1
Supervised learning

Given:
● objects and answers
(x , y)
● algorithm family a θ (x)→ y
● loss function L( y ,a θ (x))
Find:

θ'← argminθ L( y ,a θ (x))


2
Supervised learning

Given:
● objects and answers
(x , y)
[banner,page], ctr
● algorithm family a θ (x)→ y
linear / tree / NN
● loss function L( y ,a θ (x))
MSE, crossentropy
Find:

θ'← argminθ L( y ,a θ (x))


3
Supervised learning

Great... except if we have no reference answers

4
Online Ads
Great... except if we have no reference answers

We have:
● YouTube at your disposal
● Live data stream
(banner & video features, #clicked)
● (insert your favorite ML toolkit)

We want:
● Learn to pick relevant ads

Ideas? 5
Duct tape approach

Common idea:
● Initialize with naïve solution

● Get data by trial and error and error and error and error

● Learn (situation) → (optimal action)

● Repeat

6
Giant Death Robot (GDR)
Great... except if we have no reference answers

We have:
● Evil humanoid robot
● A lot of spare parts
to repair it :)

We want:
● Enslave humanity
● Learn to walk forward

Ideas? 7
Duct tape approach (again)

Common idea:
● Initialize with naïve solution

● Get data by trial and error and error and error and error

● Learn (situation) → (optimal action)

● Repeat

8
Duct tape approach

9
Problems

Problem 1:
● What exactly does the “optimal action”

mean?

Extract as much
money as you can
right now
VS
Make user happy
so that he would
visit you again 10
Problems

Problem 2:
● If you always follow the “current optimal”

strategy, you may never discover something


better.

● If you show the same banner to 100% users,


you will never learn how other ads affect them.

Ideas?

11
Duct tape approach

12
Reinforcement learning

13
What-what learning?
Supervised learning Reinforcement learning

● Learning to approximate ● Learning optimal strategy


reference answers by trial and error

● Needs correct answers ● Needs feedback on agent's


own actions

● Model does not affect the ● Agent can affect it's own
input data observations
What-what learning?
Unsupervised learning Reinforcement learning

● Learning underlying ● Learning optimal strategy


data structure by trial and error

● No feedback required ● Needs feedback on agent's


own actions

● Model does not affect the ● Agent can affect it's own
input data observations
What is: bandit

observation action
Agent Feedback

Examples:
– banner ads (RTB)
– recommendations
– medical treatment
16
What is: bandit

observation action
Agent Feedback

Examples:
– banner ads (RTB)
– recommendations
– medical treatment
17
What is: bandit

observation action
Agent Feedback

Examples:
– banner ads (RTB)
Q: what's observation, action and
– recommendations feedback in the banner ads problem?
– medical treatment
18
What is: bandit

observation action
Agent Feedback
user features show
time of year click,
banner
trends money
Examples:
– banner ads (RTB)
– recommendations
– medical treatment
19
What is: bandit

observation action
Agent Feedback
user features show
time of year click,
banner
trends money

Q: You're Yandex/Google/Youtube.
There's a kind of banners that would
have great click rates: the “clickbait”.

Is it a good idea to show clickbait?


20
What is: bandit

observation action
Agent Feedback

Q: You're Yandex/Google/Youtube.
There's a kind of banners that would
have great click rates: the “clickbait”.

Is it a good idea to show clickbait?


21
No, no one will trust you after that!
What is: decision process

Agent
observation

action
22
Environment
What is: decision process

Agent
Can do anything
observation

action
Can't even see
(worst case)

23
Environment
Reality check: web
● Cases:
● Pick ads to maximize profit

● Design landing page to

maximize user retention


● Recommend movies to users

● Find pages relevant to queries

● Example
● Observation – user features

● Action – show banner #i

● Feedback – did user click?

24
Reality check: dynamic systems

25
Reality check: dynamic systems

● Cases:
● Robots

● Self-driving vehicles

● Pilot assistant

● More robots!

● Example
● Observation: sensor feed

● Action: voltage sent to motors

● Feedback: how far did it move

forward before falling

26
Reality check: videogames

27
● Q: What are observations, actions and feedback?
Reality check: videogames

28
● Q: What are observations, actions and feedback?
Other use cases
● Personalized medical treatment

● Even more games (Go, chess, etc)

29
● Q: What are observations, actions and feedback?
Other use cases
● Conversation systems
– learning to make user happy
● Quantitative finance
– portfolio management
● Deep learning
– optimizing non-differentiable loss
– finding optimal architecture

30
The MDP formalism

Markov Decision Process


● Environment states:
s ∈S
● Agent actions:
a ∈A
● Rewards
r∈ℝ
● Dynamics: P(s t+ 1∣s t , at ) 31
The MDP formalism

a ∈A
s ∈S

P(s t+ 1∣s t , at )

Markov Decision Process


Markov assumption

P(s t+ 1∣s t , at , s t−1 , at −1 )=P(s t +1∣s t , a t )


32
The MDP formalism

a ∈A
s ∈S

P(s t+ 1∣s t , at )

Markov Decision Process


Markov assumption

P(s t+ 1∣s t , at , s t−1 , at −1 )=P(s t +1∣s t , a t )


33
Total reward

Total reward for session:


R=∑ r t
t
Agent's policy:

π (a∣s)=P(take action a∣in state s )

Problem: find policy with


highest reward:

π (a∣s): E π [R ]→max 34
Objective

The easy way:

E π R is an expected sum of rewards


that agent with policy π earns per session
The hard way:

E E E ... E [r 0 +r 1+r 2 +...+r T ]


s0 ∼ p( s0), a0∼π(a∣s0 ), s1 , r0 ∼P(s' ,r∣s,a) sT ,rT ∼P(s' ,r∣sT −1 ,at −1 )

35
Objective

The easy way:

E π R is an expected sum of rewards


that agent with policy π earns per session
The hard way:

E E E ... E [r 0 +r 1+r 2 +...+r T ]


s0 ∼ p( s0), a0∼π(a∣s0 ), s1 , r0 ∼P(s' ,r∣s,a) sT ,rT ∼P(s' ,r∣sT −1 ,at −1 )

36
How do we solve it?

General idea:

Play a few sessions

Update your policy

Repeat

37
Crossentropy method
Intialize policy

Repeat:
– Sample N[100] sessions

– Pick M[25] best sessions, called elite sessions

– Change policy so that it prioritizes


actions from elite sessions
38
Step-by-step view

39
Step-by-step view

elites

40
Step-by-step view

41
Step-by-step view

42
Step-by-step view

43
Step-by-step view

44
Step-by-step view

45
Step-by-step view

46
Step-by-step view

47
Step-by-step view

48
Step-by-step view

49
Tabular crossentropy method
● Policy is a matrix

π (a∣s)= A s , a

● Sample N games with that policy


● Get M best sessions (elites)

Elite=[( s 0 , a0 ) ,(s 1 , a1 ) ,(s 2 , a2 ),... ,( s k , a k )]

50
Tabular crossentropy method
● Policy is a matrix

π (a∣s)= A s , a

● Sample N games with that policy


● Take M best sessions (elites)
● Aggregate by states
∑ [s t =s][at =a]
s t , at ∈Elite
π (a∣s)=
∑ [ st =s ] 51
s t ,a t ∈ Elite
Tabular crossentropy method
● Policy is a matrix

π (a∣s)= A s , a

● Sample N games with that policy


● Take M best sessions (elite)
● Aggregate by states

took a at s In M best
π (a∣s)= games
was at s
52
Grim reality

If your environment has infinite/large state space

53
Approximate crossentropy method
● Policy is approximated
– Neural network predicts π W ( a∣s) given s
– Linear model / Random Forest / ...

Can't set π (a∣s) explicitly

All state-action pairs from M best sessions


Elite=[( s 0 , a0 ),(s 1 , a1 ) ,(s 2 , a2 ) ,... ,( s k , a k )]
54
Approximate crossentropy method
Neural network predicts π W ( a∣s) given s

All state-action pairs from M best sessions

Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]

Maximize likelihood of actions in “best” games

π=argmax
π
∑ log π( ai∣si )
s i ,ai ∈Elite
55
Approximate crossentropy method
● Initialize NN weights W 0 ←random

● Loop:
– Sample N sessions

– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,( s k , a k )]

– W i +1=W i + α ∇ [ ∑ log πW (ai∣s i )]


i
s i , ai ∈ Elite
56
Approximate crossentropy method
● Initialize NN weights nn = MLPClassifier(...)

● Loop:
– Sample N sessions

– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]

– nn.fit(elite_states,elite_actions)

57
Continuous action spaces
● Continuous state space
2
● Model π W ( a∣s)=N (μ ( s) , σ )
– Mu(s) is neural network output
– Sigma is a parameter or yet another network output
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate

W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite

What changed? 58
Approximate crossentropy method
● Initialize NN weights nn = MLPRegressor(...)

● Loop:
– Sample N sessions

– Elite=[(s 0 , a0 ),(s 1 , a1 ),(s 2 , a2 ),... ,(s k , a k )]

– nn.fit(elite_states,elite_actions)

Almost nothing! 59
Tricks
● Remember sessions from 3-5 past iterations
– Threshold and use all of them when training
– May converge slower if env is easy to solve.

● Regularize with entropy


– to prevent premature convergence.

● Parallelize sampling
● Use RNNs if partially-observable (later) 60

You might also like