Professional Documents
Culture Documents
Reinforcement learning
In a nutshell
1
Supervised learning
Given:
● objects and answers
(x , y)
● algorithm family a θ (x)→ y
● loss function L( y ,a θ (x))
Find:
Given:
● objects and answers
(x , y)
[banner,page], ctr
● algorithm family a θ (x)→ y
linear / tree / NN
● loss function L( y ,a θ (x))
MSE, crossentropy
Find:
4
Online Ads
Great... except if we have no reference answers
We have:
● YouTube at your disposal
● Live data stream
(banner & video features, #clicked)
● (insert your favorite ML toolkit)
We want:
● Learn to pick relevant ads
Ideas? 5
Duct tape approach
Common idea:
● Initialize with naïve solution
● Get data by trial and error and error and error and error
● Repeat
6
Giant Death Robot (GDR)
Great... except if we have no reference answers
We have:
● Evil humanoid robot
● A lot of spare parts
to repair it :)
We want:
● Enslave humanity
● Learn to walk forward
Ideas? 7
Duct tape approach (again)
Common idea:
● Initialize with naïve solution
● Get data by trial and error and error and error and error
● Repeat
8
Duct tape approach
9
Problems
Problem 1:
● What exactly does the “optimal action”
mean?
Extract as much
money as you can
right now
VS
Make user happy
so that he would
visit you again 10
Problems
Problem 2:
● If you always follow the “current optimal”
Ideas?
11
Duct tape approach
12
Reinforcement learning
13
What-what learning?
Supervised learning Reinforcement learning
● Model does not affect the ● Agent can affect it's own
input data observations
What-what learning?
Unsupervised learning Reinforcement learning
● Model does not affect the ● Agent can affect it's own
input data observations
What is: bandit
observation action
Agent Feedback
Examples:
– banner ads (RTB)
– recommendations
– medical treatment
16
What is: bandit
observation action
Agent Feedback
Examples:
– banner ads (RTB)
– recommendations
– medical treatment
17
What is: bandit
observation action
Agent Feedback
Examples:
– banner ads (RTB)
Q: what's observation, action and
– recommendations feedback in the banner ads problem?
– medical treatment
18
What is: bandit
observation action
Agent Feedback
user features show
time of year click,
banner
trends money
Examples:
– banner ads (RTB)
– recommendations
– medical treatment
19
What is: bandit
observation action
Agent Feedback
user features show
time of year click,
banner
trends money
Q: You're Yandex/Google/Youtube.
There's a kind of banners that would
have great click rates: the “clickbait”.
observation action
Agent Feedback
Q: You're Yandex/Google/Youtube.
There's a kind of banners that would
have great click rates: the “clickbait”.
Agent
observation
action
22
Environment
What is: decision process
Agent
Can do anything
observation
action
Can't even see
(worst case)
23
Environment
Reality check: web
● Cases:
● Pick ads to maximize profit
● Example
● Observation – user features
24
Reality check: dynamic systems
25
Reality check: dynamic systems
● Cases:
● Robots
● Self-driving vehicles
● Pilot assistant
● More robots!
● Example
● Observation: sensor feed
26
Reality check: videogames
27
● Q: What are observations, actions and feedback?
Reality check: videogames
28
● Q: What are observations, actions and feedback?
Other use cases
● Personalized medical treatment
29
● Q: What are observations, actions and feedback?
Other use cases
● Conversation systems
– learning to make user happy
● Quantitative finance
– portfolio management
● Deep learning
– optimizing non-differentiable loss
– finding optimal architecture
30
The MDP formalism
a ∈A
s ∈S
P(s t+ 1∣s t , at )
a ∈A
s ∈S
P(s t+ 1∣s t , at )
π (a∣s): E π [R ]→max 34
Objective
35
Objective
36
How do we solve it?
General idea:
Repeat
37
Crossentropy method
Intialize policy
Repeat:
– Sample N[100] sessions
39
Step-by-step view
elites
40
Step-by-step view
41
Step-by-step view
42
Step-by-step view
43
Step-by-step view
44
Step-by-step view
45
Step-by-step view
46
Step-by-step view
47
Step-by-step view
48
Step-by-step view
49
Tabular crossentropy method
● Policy is a matrix
π (a∣s)= A s , a
50
Tabular crossentropy method
● Policy is a matrix
π (a∣s)= A s , a
π (a∣s)= A s , a
took a at s In M best
π (a∣s)= games
was at s
52
Grim reality
53
Approximate crossentropy method
● Policy is approximated
– Neural network predicts π W ( a∣s) given s
– Linear model / Random Forest / ...
π=argmax
π
∑ log π( ai∣si )
s i ,ai ∈Elite
55
Approximate crossentropy method
● Initialize NN weights W 0 ←random
● Loop:
– Sample N sessions
● Loop:
– Sample N sessions
– nn.fit(elite_states,elite_actions)
57
Continuous action spaces
● Continuous state space
2
● Model π W ( a∣s)=N (μ ( s) , σ )
– Mu(s) is neural network output
– Sigma is a parameter or yet another network output
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
–
W i +1=W i + α ∇ [ ∑ log πW ( ai∣s i )]
i
s i , ai ∈ Elite
What changed? 58
Approximate crossentropy method
● Initialize NN weights nn = MLPRegressor(...)
● Loop:
– Sample N sessions
– nn.fit(elite_states,elite_actions)
Almost nothing! 59
Tricks
● Remember sessions from 3-5 past iterations
– Threshold and use all of them when training
– May converge slower if env is easy to solve.
● Parallelize sampling
● Use RNNs if partially-observable (later) 60