Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of...

33
Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michal Kempka April 9, 2018

Transcript of Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of...

Page 1: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

Institute of Computing SciencePoznan University of Technology

Reinforcement LearningAlgorytmy

Michał Kempka

April 9, 2018

Page 2: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

1

Przypomnienie

Michał Kempka | Algorytmy

Page 3: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

2

Przypomnienie MDP

Jak sformalizowac uczenie ze wzmocnieniem?

Michał Kempka | Algorytmy

Page 4: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

3

Przypomnienie MDP - wzorki

I S - skonczony zbiór stanów,I A - skonczony zbiór akcji,I Pa(s,s′) = P(st+1 = s′|st = s,at = a) - model przejsc,

prawdopodobienstwo, ze bedac w stanie s, robiac akcje a,znajdziemy sie w stanie s’,

I Ra(s,s′) - nagroda przyznawana za przejscie ze stanu s dostanu s’ wykonujac akcje a,

I γ ∈ [0,1] - “discount factor” - jak bardzo patrzymy w przyszłosc

Michał Kempka | Algorytmy

Page 5: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

4

MDP - rozwiazanie

max∞∑

t=0

γtRat (st , st+1)

Analogie do uczenia nadzorowanego:I akcje→ klasyI loss→ -nagrodyI agent→ klasyfikator

Michał Kempka | Algorytmy

Page 6: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

5

Równanie Bellmana[1]

State Value

V (s) = maxa∈A

∑s′

Pa(s, s′)(Ra(s, s′ + γV (s′)))

Action-State Value (Q-value)

Q(s,a) =∑

s′

Pa(s, s′)(Ra(s, s′) + γmaxa′∈A

(Q(s′,a′)))

Michał Kempka | Algorytmy

Page 7: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

6

Full Disclosure

Troche podpatrywałem slajdy z przedmiotu MiSIO (ISWD) WojtkaJaskowskiego.

Michał Kempka | Algorytmy

Page 8: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

7

Polityka (policy)

Polityka nazywamy mapowanie π:

π : S → A

State Value

a = π(s) = argmaxa∈A

∑s′

Pa(s, s′)(Ra(s, s′ + γVπ(s′)))

Action-State Value (Q-value)

a = π(s) = argmaxa∈A

∑s′

Pa(s, s′)(Ra(s, s′) + γmaxa′∈A

(Qπ(s′,a′)))

Michał Kempka | Algorytmy

Page 9: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

8

Polityka (policy)

Polityka nazywamy mapowanie π:

π : S → A

State Value

a = π(s) = argmaxa∈A

∑s′

Pa(s, s′)(Ra(s, s′ + γVπ(s′)))

Action-State Value (Q-value)(model-free)

a = π(s) = argmaxa∈A

Qπ(s,a)

Michał Kempka | Algorytmy

Page 10: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

9

Value Iteration

Algorithm 1 Value Iteration1: initialize V[.], P(s,a,s’), R(s,a,s’) arbitrarily2: repeat3: Vt−1 = V4: for s ∈ S do5: for a ∈ A do6: V (s) = 07: for s′ ∈ S do8: V (s)+ = P(s,a, s′)(R(s,a,a′) + γVt−1(s′))9: end for

10: end for11: end for12: until V (.)converges

Michał Kempka | Algorytmy

Page 11: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

10

Temporal Difference Learning (TD)

Algorithm 2 Temporal Difference Learning1: initialize V[.] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose an action somehow9: observe < s,a, s′, r >

10: Vtarget = r + γV (s′)11: TDerror = V (s)− Vtarget12: V (s) = V (s)− η(TDerror )13: until V (.) converges

Michał Kempka | Algorytmy

Page 12: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

11

Temporal Difference Learning (TD)

Algorithm 3 Temporal Difference Learning1: initialize V[.] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose an action somehow9: observe < s,a, s′, r >

10: V (s) = V (s)− η(V (s)− (r + γV (s′)))11: until V (.) converges

Co sie stanie jesli η = 1?

Michał Kempka | Algorytmy

Page 13: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

11

Temporal Difference Learning (TD)

Algorithm 4 Temporal Difference Learning1: initialize V[.] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose an action somehow9: observe < s,a, s′, r >

10: V (s) = V (s)− η(V (s)− (r + γV (s′)))11: until V (.) converges

Co sie stanie jesli η = 1?

Michał Kempka | Algorytmy

Page 14: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

12

Exploracja

Jak robic akcje?

ε-greedy policy!:Start with ε ≈ 1

I with probability ε make a random actionI with probability 1− ε make the best action according to your

current policy π(s)

I decrease ε as you wish (unless ε = 0)

Michał Kempka | Algorytmy

Page 15: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

12

Exploracja

Jak robic akcje?ε-greedy policy!:Start with ε ≈ 1

I with probability ε make a random actionI with probability 1− ε make the best action according to your

current policy π(s)

I decrease ε as you wish (unless ε = 0)

Michał Kempka | Algorytmy

Page 16: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

13

Ale jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Vπ(s)?

Michał Kempka | Algorytmy

Page 17: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

13

Ale jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Vπ(s)?

Michał Kempka | Algorytmy

Page 18: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

14

Ale jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Vπ(s)?

a = π(s) = argmaxa∈A

∑s′

Pa(s, s′)(Ra(s, s′ + γVπ(s′)))

Potrzebujemy miec P i R!

Michał Kempka | Algorytmy

Page 19: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

15

Q-learning na ratunek!

Algorithm 5 Q-learning1: initialize Q[..] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >

10: Q(s,a) = Q(s,a)− η(Q(s,a)− (r + maxa′∈A

γQ(s′,a′)))

11: until Q(..) converges

Michał Kempka | Algorytmy

Page 20: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

15

Q-learning na ratunek!

Algorithm 6 Q-learning1: initialize Q[..] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >

10: Q(s,a) = Q(s,a)− η(Q(s,a)− (r + maxa′∈A

γQ(s′,a′)))

11: until Q(..) converges

Michał Kempka | Algorytmy

Page 21: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

16

Jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Qπ(s, .)?

a = π(s) = argmaxa∈A

Qπ(s,a)

Nie potrzebujemy miec P i R - czyli modelu swiata! Dlatego oq-learningu mówi sie, ze jest model-free i wszyscy go kochaja(przynajmniej kochali przez wiele lat).

Michał Kempka | Algorytmy

Page 22: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

16

Jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Qπ(s, .)?

a = π(s) = argmaxa∈A

Qπ(s,a)

Nie potrzebujemy miec P i R - czyli modelu swiata! Dlatego oq-learningu mówi sie, ze jest model-free i wszyscy go kochaja(przynajmniej kochali przez wiele lat).

Michał Kempka | Algorytmy

Page 23: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

17

Aproksymatory

Przechowywanie Q w LUT (look-up-tables) jest zabójcze dla realnyproblemów, wiec potrzebujemy parametryzowanych aproksymatorówwartosci Q czyli funkcji:

Q : Q(s,a,Θ)→ R

Michał Kempka | Algorytmy

Page 24: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

18

Q-learning na ratunek!

Algorithm 7 Q-learning1: initialize Θ arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >

10: loss = (Q(s,a,Θ)− (r + maxa′∈A

γQ(s′,a′,Θ))))2

11: Θ = Θ− η∇loss12: until Θ converges

Michał Kempka | Algorytmy

Page 25: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

19

SARSA - on-policy learning

Algorithm 8 SARSA1: initialize Θ arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r ,a′ >

10: loss = (Q(s,a,Θ)− (r + γQ(s′,a′,Θ))))2

11: Θ = Θ− η∇loss12: until Θ converges

Michał Kempka | Algorytmy

Page 26: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

20

Ekstrakcja cech

Czesto przestrzen stanów jest ogromna i skomplikowana wiec robiłosie to co w innych obszarach ML czyli feature engineering i naszafunkcja zmienia sie w

Q : Q(extractf eatures(s),a,Θ)→ R

Michał Kempka | Algorytmy

Page 27: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

21

Deep Learning

Ale z nadejsciem głebokich sieci neuronowych troche siepozmieniało!

Michał Kempka | Algorytmy

Page 28: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

22

Atari i DeepMind

Human-level control through deep reinforcement learning

Michał Kempka | Algorytmy

Page 29: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

23

Atari Games

Michał Kempka | Algorytmy

Page 30: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

24

DQN Code

Michał Kempka | Algorytmy

Page 31: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

25

DQN wazne pomysły

I replay memoryI network freezingI deep-networksI frame-stackingI frame-skipping

Michał Kempka | Algorytmy

Page 32: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

26

Co dalej?

I kurs AI na Berkley z edX (wykłady takze na YT)I Wykłady Davida Silvera (z Deep Mind)I popatrzec na AIGym

Michał Kempka | Algorytmy

Page 33: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

27

References I

[1] Richard S. Sutton and Andrew G. Barto.Introduction to Reinforcement Learning.MIT Press, Cambridge, MA, USA, 1st edition, 1998.

Michał Kempka | Algorytmy