Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of...

Institute of Computing SciencePoznan University of Technology

Reinforcement LearningAlgorytmy

Michał Kempka

April 9, 2018

Przypomnienie

Michał Kempka | Algorytmy

Przypomnienie MDP

Jak sformalizowac uczenie ze wzmocnieniem?

Przypomnienie MDP - wzorki

I S - skonczony zbiór stanów,I A - skonczony zbiór akcji,I Pa(s,s′) = P(st+1 = s′|st = s,at = a) - model przejsc,

prawdopodobienstwo, ze bedac w stanie s, robiac akcje a,znajdziemy sie w stanie s’,

I Ra(s,s′) - nagroda przyznawana za przejscie ze stanu s dostanu s’ wykonujac akcje a,

I γ ∈ [0,1] - “discount factor” - jak bardzo patrzymy w przyszłosc

MDP - rozwiazanie

max∞∑

γtRat (st , st+1)

Analogie do uczenia nadzorowanego:I akcje→ klasyI loss→ -nagrodyI agent→ klasyfikator

Równanie Bellmana[1]

State Value

V (s) = maxa∈A

∑s′

Pa(s, s′)(Ra(s, s′ + γV (s′)))

Action-State Value (Q-value)

Q(s,a) =∑

Pa(s, s′)(Ra(s, s′) + γmaxa′∈A

(Q(s′,a′)))

Full Disclosure

Troche podpatrywałem slajdy z przedmiotu MiSIO (ISWD) WojtkaJaskowskiego.

Polityka (policy)

Polityka nazywamy mapowanie π:

π : S → A

State Value

a = π(s) = argmaxa∈A

∑s′

Pa(s, s′)(Ra(s, s′ + γVπ(s′)))

Action-State Value (Q-value)

∑s′

Pa(s, s′)(Ra(s, s′) + γmaxa′∈A

(Qπ(s′,a′)))

Polityka (policy)

Polityka nazywamy mapowanie π:

π : S → A

State Value

∑s′

Action-State Value (Q-value)(model-free)

Qπ(s,a)

Value Iteration

Algorithm 1 Value Iteration1: initialize V[.], P(s,a,s’), R(s,a,s’) arbitrarily2: repeat3: Vt−1 = V4: for s ∈ S do5: for a ∈ A do6: V (s) = 07: for s′ ∈ S do8: V (s)+ = P(s,a, s′)(R(s,a,a′) + γVt−1(s′))9: end for

10: end for11: end for12: until V (.)converges

Temporal Difference Learning (TD)

Algorithm 2 Temporal Difference Learning1: initialize V[.] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose an action somehow9: observe < s,a, s′, r >

10: Vtarget = r + γV (s′)11: TDerror = V (s)− Vtarget12: V (s) = V (s)− η(TDerror )13: until V (.) converges

10: V (s) = V (s)− η(V (s)− (r + γV (s′)))11: until V (.) converges

Co sie stanie jesli η = 1?

10: V (s) = V (s)− η(V (s)− (r + γV (s′)))11: until V (.) converges

Co sie stanie jesli η = 1?

Exploracja

Jak robic akcje?

ε-greedy policy!:Start with ε ≈ 1

I with probability ε make a random actionI with probability 1− ε make the best action according to your

current policy π(s)

I decrease ε as you wish (unless ε = 0)

Exploracja

Jak robic akcje?ε-greedy policy!:Start with ε ≈ 1

I with probability ε make a random actionI with probability 1− ε make the best action according to your

current policy π(s)

I decrease ε as you wish (unless ε = 0)

Ale jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Vπ(s)?

∑s′

Potrzebujemy miec P i R!

Q-learning na ratunek!

Algorithm 5 Q-learning1: initialize Q[..] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >

10: Q(s,a) = Q(s,a)− η(Q(s,a)− (r + maxa′∈A

γQ(s′,a′)))

11: until Q(..) converges

Algorithm 6 Q-learning1: initialize Q[..] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >

10: Q(s,a) = Q(s,a)− η(Q(s,a)− (r + maxa′∈A

γQ(s′,a′)))

11: until Q(..) converges

Jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Qπ(s, .)?

Qπ(s,a)

Nie potrzebujemy miec P i R - czyli modelu swiata! Dlatego oq-learningu mówi sie, ze jest model-free i wszyscy go kochaja(przynajmniej kochali przez wiele lat).

Jak wybrac najlepsza akcje?

Jak wybrac a = π(s) majac Qπ(s, .)?

Qπ(s,a)

Nie potrzebujemy miec P i R - czyli modelu swiata! Dlatego oq-learningu mówi sie, ze jest model-free i wszyscy go kochaja(przynajmniej kochali przez wiele lat).

Aproksymatory

Przechowywanie Q w LUT (look-up-tables) jest zabójcze dla realnyproblemów, wiec potrzebujemy parametryzowanych aproksymatorówwartosci Q czyli funkcji:

Q : Q(s,a,Θ)→ R

Algorithm 7 Q-learning1: initialize Θ arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >

10: loss = (Q(s,a,Θ)− (r + maxa′∈A

γQ(s′,a′,Θ))))2

11: Θ = Θ− η∇loss12: until Θ converges

SARSA - on-policy learning

Algorithm 8 SARSA1: initialize Θ arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r ,a′ >

10: loss = (Q(s,a,Θ)− (r + γQ(s′,a′,Θ))))2

11: Θ = Θ− η∇loss12: until Θ converges

Ekstrakcja cech

Czesto przestrzen stanów jest ogromna i skomplikowana wiec robiłosie to co w innych obszarach ML czyli feature engineering i naszafunkcja zmienia sie w

Q : Q(extractf eatures(s),a,Θ)→ R

Deep Learning

Ale z nadejsciem głebokich sieci neuronowych troche siepozmieniało!

Atari i DeepMind

Human-level control through deep reinforcement learning

Atari Games

DQN Code

DQN wazne pomysły

I replay memoryI network freezingI deep-networksI frame-stackingI frame-skipping

Co dalej?

I kurs AI na Berkley z edX (wykłady takze na YT)I Wykłady Davida Silvera (z Deep Mind)I popatrzec na AIGym

References I

[1] Richard S. Sutton and Andrew G. Barto.Introduction to Reinforcement Learning.MIT Press, Cambridge, MA, USA, 1st edition, 1998.

Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of...

Documents

Transcript of Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of...

Algorytmy oświetlenia globalnego · 2018. 4. 19. · Algorytmy oświetlenia Algorytmy oświetlenia bezpośredniego (directillumination) tylko światło padające bezpośrednio na

Machine Intelligence:: Deep Learning Week 3oduerr/docs/lecture03.pdf · • Overview of googleml learning cloud for deep learning ... epoch being finished ... – Step functions are

E learning – podstawowe pojęcia

Mb+ +e-learning+szkola

Algorytmy w C

Algorytmy i Struktury Danych - carme.pld-linux.orgcarme.pld-linux.org/~evil/varia/informatyka/semestr_2/Algorytmy i... · Algorytmy i Struktury Danych Literatura S. Sengupta, C. Ph.

E-learning? Prosta sprawa!

Algorytmy i Struktury danych - math.uni.lodz.plmath.uni.lodz.pl/~kowalcr/AlgorytmyIStruktury/Wyklad1.pdf · Algorytmy i Struktury danych Algorytmy i programowanie . dr Robert Kowalczyk,

ALGORYTMY GRAFOWE

Algorytmy - zadania

IT Learning - e-learning for software developers

ICT w nauczaniu języków obcych na przykładzie ... · tance learning (np. e-learning, m-learning, blended learning), zawierająca różne metody nauczania, ... 1.1 Definicja Słowo

ALGORYTMY EWOLUCYJNE: instrukcja

E- learning szkoła przyszłosci

Algorytmy 1

Magdalena Szuflita. E-learning, b-learning i m-learning w ...

E learning - uczenie sie

ALGORYTMY W PRZYK£ADACHinformatyka.2ap.pl/ftp/3d/algorytmy/podręcznik_algorytmy...Strona 1 z 185 ALGORYTMY W PRZYKŁADACH Tekst został opracowany na podstawie zasobów internetowych

Algorytmy konstrukcyjne dla problemu harmonogramowania …zeszyty-naukowe.wwsi.edu.pl/zeszyty/zeszyt15/Algorytmy... · Opracowane algorytmy konstrukcyjne podczas działania korzystają

SAS 670 / 800 Basics of high strength reinforcement · 2014-07-04 · SAS 670 / 800 Rozwój zbrojenia wysokiej wytrzymałosci SAS 670 / 800 development of high strength reinforcement