Download - Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018

Institute of Computing SciencePoznan University of Technology

Reinforcement LearningAlgorytmy

Michał Kempka

April 9, 2018

1

Przypomnienie

Michał Kempka | Algorytmy

2

Przypomnienie MDP

Jak sformalizowac uczenie ze wzmocnieniem?


3

Przypomnienie MDP - wzorki

I S - skonczony zbiór stanów,I A - skonczony zbiór akcji,I Pa(s,s′) = P(st+1 = s′|st = s,at = a) - model przejsc,

prawdopodobienstwo, ze bedac w stanie s, robiac akcje a,znajdziemy sie w stanie s’,

I Ra(s,s′) - nagroda przyznawana za przejscie ze stanu s dostanu s’ wykonujac akcje a,

I γ ∈ [0,1] - “discount factor” - jak bardzo patrzymy w przyszłosc


4

MDP - rozwiazanie

max∞∑

t=0

γtRat (st , st+1)

Analogie do uczenia nadzorowanego:I akcje→ klasyI loss→ -nagrodyI agent→ klasyfikator


5

Równanie Bellmana[1]

State Value

V (s) = maxa∈A

∑s′

Pa(s, s′)(Ra(s, s′ + γV (s′)))

Action-State Value (Q-value)

Q(s,a) =∑

s′

Pa(s, s′)(Ra(s, s′) + γmaxa′∈A

(Q(s′,a′)))


6

Full Disclosure

Troche podpatrywałem slajdy z przedmiotu MiSIO (ISWD) WojtkaJaskowskiego.


http://www.cs.put.poznan.pl/wjaskowski/pub/teaching/wmio/lectures/

http://www.cs.put.poznan.pl/wjaskowski/

http://www.cs.put.poznan.pl/wjaskowski/

7

Polityka (policy)

Polityka nazywamy mapowanie π:

π : S → A

State Value

a = π(s) = argmaxa∈A

∑s′

Pa(s, s′)(Ra(s, s′ + γVπ(s′)))

Action-State Value (Q-value)


∑s′

Pa(s, s′)(Ra(s, s′) + γmaxa′∈A

(Qπ(s′,a′)))


8

Polityka (policy)

Polityka nazywamy mapowanie π:

π : S → A

State Value


∑s′


Action-State Value (Q-value)(model-free)


Qπ(s,a)


9

Value Iteration

Algorithm 1 Value Iteration1: initialize V[.], P(s,a,s’), R(s,a,s’) arbitrarily2: repeat3: Vt−1 = V4: for s ∈ S do5: for a ∈ A do6: V (s) = 07: for s′ ∈ S do8: V (s)+ = P(s,a, s′)(R(s,a,a′) + γVt−1(s′))9: end for

10: end for11: end for12: until V (.)converges


10

Temporal Difference Learning (TD)

Algorithm 2 Temporal Difference Learning1: initialize V[.] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose an action somehow9: observe < s,a, s′, r >

10: Vtarget = r + γV (s′)11: TDerror = V (s)− Vtarget12: V (s) = V (s)− η(TDerror )13: until V (.) converges


11



10: V (s) = V (s)− η(V (s)− (r + γV (s′)))11: until V (.) converges

Co sie stanie jesli η = 1?


12

Exploracja

Jak robic akcje?

ε-greedy policy!:Start with ε ≈ 1

I with probability ε make a random actionI with probability 1− ε make the best action according to your

current policy π(s)

I decrease ε as you wish (unless ε = 0)


12

Exploracja

Jak robic akcje?ε-greedy policy!:Start with ε ≈ 1

I with probability ε make a random actionI with probability 1− ε make the best action according to your

current policy π(s)

I decrease ε as you wish (unless ε = 0)


13

Ale jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Vπ(s)?


14

Ale jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Vπ(s)?


∑s′


Potrzebujemy miec P i R!


15

Q-learning na ratunek!

Algorithm 5 Q-learning1: initialize Q[..] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >

10: Q(s,a) = Q(s,a)− η(Q(s,a)− (r + maxa′∈A

γQ(s′,a′)))

11: until Q(..) converges


15


Algorithm 6 Q-learning1: initialize Q[..] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >

10: Q(s,a) = Q(s,a)− η(Q(s,a)− (r + maxa′∈A

γQ(s′,a′)))

11: until Q(..) converges


16

Jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Qπ(s, .)?


Qπ(s,a)

Nie potrzebujemy miec P i R - czyli modelu swiata! Dlatego oq-learningu mówi sie, ze jest model-free i wszyscy go kochaja(przynajmniej kochali przez wiele lat).


17

Aproksymatory

Przechowywanie Q w LUT (look-up-tables) jest zabójcze dla realnyproblemów, wiec potrzebujemy parametryzowanych aproksymatorówwartosci Q czyli funkcji:

Q : Q(s,a,Θ)→ R


18


Algorithm 7 Q-learning1: initialize Θ arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >

10: loss = (Q(s,a,Θ)− (r + maxa′∈A

γQ(s′,a′,Θ))))2

11: Θ = Θ− η∇loss12: until Θ converges


19

SARSA - on-policy learning

Algorithm 8 SARSA1: initialize Θ arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r ,a′ >

10: loss = (Q(s,a,Θ)− (r + γQ(s′,a′,Θ))))2

11: Θ = Θ− η∇loss12: until Θ converges


20

Ekstrakcja cech

Czesto przestrzen stanów jest ogromna i skomplikowana wiec robiłosie to co w innych obszarach ML czyli feature engineering i naszafunkcja zmienia sie w

Q : Q(extractf eatures(s),a,Θ)→ R


21

Deep Learning

Ale z nadejsciem głebokich sieci neuronowych troche siepozmieniało!


22

Atari i DeepMind

Human-level control through deep reinforcement learning


23

Atari Games


24

DQN Code


25

DQN wazne pomysły

I replay memoryI network freezingI deep-networksI frame-stackingI frame-skipping


26

Co dalej?

I kurs AI na Berkley z edX (wykłady takze na YT)I Wykłady Davida Silvera (z Deep Mind)I popatrzec na AIGym


https://courses.edx.org/courses/BerkeleyX/CS188.1x-4/1T2015/course/

https://www.youtube.com/watch?v=2pWv7GOvuf0

https://gym.openai.com/envs

27

References I

[1] Richard S. Sutton and Andrew G. Barto.Introduction to Reinforcement Learning.MIT Press, Cambridge, MA, USA, 1st edition, 1998.