Professional Documents
Culture Documents
7. Repeated Games
Dana Nau University of Maryland
Repeated Games
!! Used by game theorists, economists, social and behavioral scientists
Roshambo
Prisoners Dilemma: 2 1 C D C 3, 3 5, 0 D 0, 5 1, 1
an iteration or a round
!! Usually each agent knows what all
the agents did in the previous iterations, but not what theyre doing in the current iteration
!! Thus, an imperfect-information
Iterated Prisoners Dilemma, with 2 iterations: Agent 1: Round 1: Round 2: Total payoff: C D 3+5 = 5 Agent 2: C C 3+0 = 3
Strategies
!! The repeated game has a much bigger strategy space than the stage game !! One kind of strategy is a stationary strategy:
!! Use the same strategy
at every iteration
!! More generally, an Iterated Prisoners Dilemma with 2 iterations:
previous iterations
Backward Induction
!! If the number of iterations is finite and known, we can use backward
Agent 2: D D D D
lim $ j =1 ri( j ) / k
k "#
!! Agent is future discounted reward is the discounted sum of the payoffs, i.e.,
$
!
1.! The agent cares more about the preset than the future 2.! The agent cares about the future, but the game ends at any round with probability 1 " !
Nau: Game Theory 7
Example
!! Some well-known strategies for the Iterated Prisoners Dilemma:
always defect ! Grim: cooperate until the other agent defects, then defect forever ! Tit-for-Tat (TFT): cooperate on the first move. On the nth move, repeat the other agent (n1)th move ! Tester: defect on move 1. If the other agent retaliates, play TFT. Otherwise, randomly intersperse cooperation and defection
TFT Tester C D C C C C C D C C C C C C
!! If the discount factor is large enough, each of the following is a Nash equilibrium
!! (TFT, TFT), (TFT,GRIM), and (GRIM,GRIM)
Nau: Game Theory 8
Nash equilibrium whose average payoffs are (p1, p2, , pn) if and only if
!! G has a mixed-strategy profile (s1, s2, , sn) with the following
property:
! For each i, sis payoff would be # pi if the other agents used minimax strategies against i
response to show that in every equilibrium, an agents average payoff # the agents minimax value
!! Show how to construct an equilibrium that
gives each agent i the average payoff pi, given certain constraints on (p1, p2, , pn) ! In this equilibrium, the agents cycle in lock-step through a sequence of game outcomes that achieve (p1, p2, , pn) ! If any agent i deviates, then the others punish i forever, by playing their minimax strategies against i
!! Theres a large family of such theorems,
becomes vacuous
!! Suppose we iterate a two-player zero-sum game G
!! Let V be the value of G (from the Minimax Theorem) !! If agent 2 uses a minimax strategy against 1, then 1s maximum payoff is V
! If agent 1 plays a non-minimax strategy s1 and agent 2 plays his/her best response, 2s expected payoff will be higher than V
www.cs.ualberta.ca/~darse/rsbpc1.html
!! Round-robin tournament:
! 55 programs, 1000 iterations for each pair of programs ! Lowest possible score = 55000, highest possible score = 55000
!! Average over 25 tournaments:
P1
!! Widely used to study the emergence of
P2
Cooperate
Defect
3, 3 5, 0
0, 5 1, 1
Nash equilibrium
! It could establish and maintain cooperations with many other agents ! It could prevent malicious agents from taking advantage of it
TFT AllD C D D D D D D D D D D D D D
TFT Grim C C C C C C C C C C C C C C
TFT TFT C C C C C C C C C C C C C C
TFT Tester C D C C C C C D C C C C C C
Example:
!! A real-world example of the IPD, described in Axelrods book:
!! World War I trench warfare
!! Incentive to cooperate:
!! If I attack the other side, then theyll retaliate and Ill get hurt !! If I dont attack, maybe they wont either
that a noise gremlin will change some of the actions ! Cooperate (C) becomes Defect (D), and vice versa !! Can use this to model accidents !! Compute the score using the changed action !! Can also model misinterpretations !! Compute the score using the original action
C C C
C C D C
Noise
Example of Noise
out to investigate. We found our men and the Germans standing on their respective parapets. Suddenly a salvo arrived but did no damage. Naturally both sides got down and our men started swearing at the Germans, when all at once a brave German got onto his parapet and shouted out: We are very sorry about that; we hope no one was hurt. It is not our fault. It is that damned Prussian artillery.
!! The salvo wasnt the German infantrys intention
!! They didnt expect it nor desire it
Nau: Game Theory 19
who both use TFT !! One accident or misinterpretation can cause a long string of retaliations
Retaliation
Retaliation
C C C C D C D C ...
C C C D C C D C D ...
Noise"
Retaliation
Retaliation
Discussion
!! The British army officers story:
!! a German shouted, ``We are very sorry about that; we hope no one was
past behavior
!! The British had ample evidence that the German infantry wanted to
to the noise
!! IPD agents often behave deterministically
!! For others to cooperate with you it helps if youre predictable
!! From the other agents recent behavior, build a model $ of the other
agents strategy
!! Use the model to filter noise !! Use the model to help plan our next move
Au & Nau. Accident or intention: That is the question (in the iterated prisoners dilemma). AAMAS, 2006. Au & Nau. Is it accidental or intentional? A symbolic approach to the noisy iterated prisoners dilemma. In G. Kendall (ed.), The Iterated Prisoners Dilemma: 20 Years On. World Scientific, 2007.
if our last move was m and their last move was m' then P[their next move will be C]
!! Four rules: one for each of (C,C), (C,D), (D,C), and (D,D)
1, (C, D)
1, (D, C )
0, (D, D)
behavior
!! If an agents behavior changes, then the probabilities in $ will change
!! e.g., after Grim defects a few times, the rules will give a very low
Noise Filtering
!! Suppose the applicable rule is
deterministic
!! P[their next move will be C] = 0 or 1
The other agent cooperates when I do So I wont retaliate here. I think these defections are actually noise
C C C C C C C D C C C C D C C C C C : :
Nau: Game Theory 26
Change of Behavior
!! Anomalies in observed behavior can be due
anytime
!! E.g., if noise affects one of Agent 1s
C C CD C C C : : :
C C C D D D D D :
Nau: Game Theory 27
actions, this may trigger a change in Agent 2s behavior ! Agent 1 does not know this
!! How to distinguish noise from a real change
of behavior?
The other agent cooperates when I do The defections might be accidents, so I shouldnt lose my temper too soon I think the other agents has really changed, so Ill change mine too
C C C C C C D D :
C C C D D D D D :
Nau: Game Theory 28
Move generation
!! Modified version of game-tree search
!! Use the policy $ to predict probabilities of the other agents moves !! Compute expected utility) for move x as
u1(x) = ! y"{C,D} u1(x,y) # P(y | ", previous moves) where x = my move, y = other agents move
!! Choose the move with the highest expected utility
(C,C)
(C,D) (D,C)
(D,D)
Suppose we have the rules 1. (C,C) % 0.7 2. (C,D) % 0.4 3. (D,C) % 0.1 4. (D,D) % 0.1
(C,C)
Example
(C,D) (D,C)
(D,D)
C C D C
C D C C ??
u1(C) = 0.7 u1(C,C) + 0.3 u1(C,D) = 2.1 + 0 = 2.1 u1(D) = 0.7 u1(D,C) + 0.3 u1(D,D) = 3.5 + 0.3 = 3.8
! So D looks better
!! Is D really what we should choose?
Suppose we have the rules 1. (C,C) % 0.7 2. (C,D) % 0.4 3. (D,C) % 0.1 4. (D,D) % 0.1
(C,C)
Example
(C,D) (D,C)
(D,D)
C C D C
C D C C ??
retaliate with P=0.9 ! The depth-1 search didnt see this !! But if we search to depth d>1, well see it !! C will look better and well choose it instead
!! In general, its best look far ahead
! e.g., 60 moves
! Makes the search polynomial in the search depth ! Can easily search to depth 60 ! Equivalent to solving an acyclic MDP of depth 60
!! This generates fairly good moves
(C,C)
(C,D) (D,C)
top 10 places
!! Two agents scored
master-and-slaves strategies
! 1 master, 19 slaves
!! When a slave plays with its master
an agent not in its team ! It defects => minimizes the other agents payoff
Comparison
!! Analysis
!! Each master-slaves teams average score was much lower than DBSs !! If BWIN and IMM01 had each been restricted to ! 10 slaves,
Summary
!! Finitely repeated games backward induction !! Infinitely repeated games
!! average reward, future discounted reward !! equilibrium payoffs
!! Non-equilibrium strategies
!! opponent modeling in roshambo !! iterated prisoners dilemma with noise
! opponent models based on observed behavior ! detection and removal of noise ! game-tree search against the opponent model
!! 20th anniversary IPD competition