Search
๐Ÿ’ช๐Ÿป

MDP

์ƒ์„ฑ์ผ
2024/07/11 15:16
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Markov Property

1. Grid world example

์•„๋ž˜์™€ ๊ฐ™์€ ์ƒํƒœ๋ฅผ ๊ฐ€์ง„ grid world๋ฅผ ๊ณ ๋ คํ•œ๋‹ค.
โ€ข
State : S={(1,1),(1,2),...,(4,2),(4,3)}S = \{(1,1), (1,2), ... , (4,2), (4,3)\}
โ€ข
Action: A={north,south,east,west}A = \{north, south, east, west\}
โ€ข
Reward : (4,3) ๋„๋‹ฌ ์‹œ +1, (4,2) ๋„๋‹ฌ ์‹œ -1, ๊ธฐํƒ€ case์— ๋Œ€ํ•ด negative reward c
โ†’ negative reward์˜ ์ ์šฉ ์ด์œ  : ์˜๋ฏธ์—†๋Š” ์ด๋™์— ๋Œ€ํ•œ ํŒจ๋„ํ‹ฐ ๋ถ€์—ฌ
โ€ข
Agent : noisy movement, ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด ์Šค์Šค๋กœ ํ•™์Šตํ•˜๋Š” ์ฃผ์ฒด
โ€ข
State trainsition probability
1.
๋ฒฝ(2,2)์— ๋„๋‹ฌ ์‹œ ํ˜„์žฌ ์ƒํƒœ ์œ ์ง€
2.
์„ ํƒํ•œ ๋ฐฉํ–ฅ(80%) ๊ธฐ์ค€ ์ขŒ, ์šฐ ๋ฐฉํ–ฅ์— ๋Œ€ํ•ด 10% ํ™•๋ฅ ๋กœ ์ด๋™ โ†’ action์— ๋ฌด์ž‘์œ„์„ฑ ๋ถ€์—ฌ
โ€ข
Terminal state : (4,2), (4,3)
โ€ข
Goal : total sum of rewards๋ฅผ maximizeํ•˜๋Š” policy ์ฐพ๊ธฐ
์ด๋•Œ, Episode๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค๋ฉด total reward๋Š” negative reward์™€ ์ตœ์ข… reward์˜ ํ•ฉ์ธ 5c+1์ด ๋œ๋‹ค.

2. Actions in grid world

โ€ข
Deterministic
Agent์˜ ๋‹ค์Œ state๊ฐ€ ํ˜„์žฌ state์™€ action์— ์˜ํ•ด์„œ ์™„๋ฒฝํžˆ ์ •ํ•ด์ง€๋Š” ๊ฒฝ์šฐ๋กœ, Policy๊ฐ€ ์ •ํ•ด์ง€๋ฉด episode 1๊ฐœ๋งŒ ๊ฐ€๋Šฅํ•˜๋‹ค.
โ€ข
Stochastic
Agent๊ฐ€ ๊ฐ™์€ state์—์„œ ๊ฐ™์€ action์„ ํ•˜๋”๋ผ๋„ ๋ฌด์ž‘์œ„์„ฑ์— ์˜ํ•ด ๋‹ค์–‘ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. ๋”ฐ๋ผ์„œ Policy๊ฐ€ ์ •ํ•ด์ง€๋”๋ผ๋„ ์—ฌ๋Ÿฌ episode๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. ๋Œ€๋ถ€๋ถ„ Stochastic grid world๋ฅผ ๊ณ ๋ คํ•œ๋‹ค.

3. Markov property

โ€ข
Stochastic Process
ํ™•๋ฅ ์  ๊ณผ์ •์€ ์‹œ๊ฐ„์— ๋”ฐ๋ผ index๊ฐ€ ๋ถ€์—ฌ๋œ ํ™•๋ฅ  ๋ณ€์ˆ˜์˜ ์ง‘ํ•ฉ์ด๋‹ค. ํฌ๊ฒŒ ์ด์‚ฐ ํ™•๋ฅ  ๊ณผ์ •๊ณผ ์—ฐ์† ํ™•๋ฅ  ๊ณผ์ •์œผ๋กœ ๊ตฌ๋ถ„๋˜๋ฉฐ, ์ด๋•Œ t๊ฐ€ ํ˜„์žฌ state๋ฅผ ์˜๋ฏธํ•˜๊ณ , t+1์ด next state, ๋‚˜๋จธ์ง€๊ฐ€ post๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
โ€ข
Markov Process
๋งŒ์•ฝ ํ™•๋ฅ  ๊ณผ์ •์ด Markov property๋ฅผ ๋งŒ์กฑํ•˜๋ฉด Markov process๋ผ๊ณ  ํ•œ๋‹ค. Markov property๋ฅผ ๋งŒ์กฑํ•œ๋‹ค๋Š” ๊ฒƒ์€ ํ˜„์žฌ state St=sS_t = s ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‹ค์Œ state St+1=sโ€ฒS_{t+1} = s' ์ด ๋  ํ™•๋ฅ ์ด ๊ณผ๊ฑฐ์˜ ์ƒํƒœ๋“ค์— ๋…๋ฆฝ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ Memoryless property๋ผ๊ณ ๋„ ํ•œ๋‹ค.
P(St+1=sโ€ฒโˆฃSt=s)=P(St+1=sโ€ฒโˆฃS0=s0,...,St=s)P(S_{t+1} = s'|S_t = s) = P(S_{t+1}=s'| S_0 = s_0 , ..., S_t=s)
์ด์ฒ˜๋Ÿผ Markov property๋ฅผ ๋งŒ์กฑํ•˜๋ฉด process์—์„œ ์ด์ „ ์ƒํƒœ๊ฐ€ ๊ธฐ๋ก๋  ํ•„์š”์„ฑ์ด ์‚ฌ๋ผ์ง€๋ฏ€๋กœ, ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์ด ์ฆ๊ฐ€ํ•œ๋‹ค.
โ€ข
State transition probability
ํ˜„์žฌ state s์—์„œ ๋‹ค์Œ state sโ€™์ด ๋  ํ™•๋ฅ ์„ ์˜๋ฏธํ•˜๋ฉฐ ์ƒํƒœ ์ „์ด ํ™•๋ฅ ์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ์ด๋•Œ Markov property๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ฒฝ์šฐ๋ผ๋ฉด ๊ณผ๊ฑฐ state๊ฐ€ ํ™•๋ฅ ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์œผ๋ฏ€๋กœ ํ–‰๋ ฌ ํ˜•ํƒœ์˜ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
P(St+1=sโ€ฒโˆฃSt=s)P(S_{t+1} = s'|S_t = s)
โ€ข
State transition probability matrix
Pij=Psisj=p(sjโˆฃsi)=P(St+1=sjโˆฃSt=si)P_{ij} = P_{sisj} = p(s_j|s_i) = P(S_{t+1}=s_j | S_t =s_i) ์™€ ๊ฐ™์ด ํ–‰๋ ฌ ํ‘œํ˜„์„ ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ ๊ฐ๊ฐ์˜ ํ–‰์˜ ํ•ฉ์€ ํ˜„์žฌ ์ƒํƒœ์—์„œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๋ฏธ๋ž˜ ์ƒํƒœ๋กœ์˜ ์ง„ํ–‰ ํ™•๋ฅ ๋“ค์˜ ํ•ฉ์ด๋ฏ€๋กœ ํ•ฉ์€ 1์ด ๋œ๋‹ค.
๊ฒฐ๊ณผ์ ์œผ๋กœ Markov process๋Š” (S, P)์˜ tuple ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ S๋Š” state์˜ ์ง‘ํ•ฉ์ด๊ณ  P๋Š” ์ƒํƒœ ์ „์ด ํ™•๋ฅ  ํ–‰๋ ฌ์ด ๋œ๋‹ค.

Markov Decision Process

1. MDP

์œ„์—์„œ ์‚ดํŽด๋ณธ MP๋Š” State์™€ Transition Probability์˜ tuple๋กœ ํ‘œํ˜„ ๊ฐ€๋Šฅํ•˜์˜€์œผ๋‚˜, MDP์—์„œ๋Š” action, reward, discount factor๊ฐ€ ํฌํ•จ๋˜๋ฉฐ, ๋ชจ๋“  state๋Š” Markov property๋ฅผ ๋งŒ์กฑํ•˜๋Š” ํ˜•ํƒœ์ด๋‹ค.
โ€ข
S : state space
โ€ข
A : action space
โ€ข
P : Transition Probability
Pssโ€ฒa=p(sโ€ฒโˆฃs,a)=P(st+1=sโ€ฒโˆฃSt=s,At=a)P^a_{ss'} = p(s'|s,a)=P(s_{t+1}=s'|S_t=s, A_t=a)
โ†’ ํ˜„์žฌ state๊ฐ€ s์ด๊ณ  a action์„ ์„ ํƒํ–ˆ์„ ๋•Œ ๋‹ค์Œ state๊ฐ€ sโ€™์ด ๋  ํ™•๋ฅ ์„ ์˜๋ฏธํ•œ๋‹ค.
โ€ข
R : reward function
Action ์„ ํƒ์˜ ๊ธฐ์ค€์ด ๋˜๋ฉฐ, reward๊ฐ€ ์ฃผ์–ด์ง€๋Š” ์‹œ์ ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
Rssโ€™aR^a_{ssโ€™} : ํ˜„์žฌ state์—์„œ action์„ ํ•ด์„œ ๋‹ค์Œ state๋กœ ์ „์ด๋  ๋•Œ ์–ป๊ฒŒ ๋˜๋Š” reward์— ๋Œ€ํ•œ ํ‘œํ˜„
RsR_s : ํ˜„์žฌ state์— ์กด์žฌํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฐ›๊ฒŒ ๋˜๋Š” reward์— ๋Œ€ํ•œ ํ‘œํ˜„
RsaR^a_s: ํ˜„์žฌ state์—์„œ action์„ ์ทจํ–ˆ์„ ๋•Œ ์–ป๊ฒŒ๋˜๋Š” reward์— ๋Œ€ํ•œ ํ‘œํ˜„
โ€ข
ฮณ\gamma : 0~1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ discount factor๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

2. Environment model

MDP์—์„œ์˜ ํ™˜๊ฒฝ ๋ชจ๋ธ์€ ์ „์ด ํ™•๋ฅ ์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๋•Œ ์ „์ด ํ™•๋ฅ ์„ ์•„๋Š” ๊ฒฝ์šฐ MDP๋ฅผ ์•ˆ๋‹ค๋ผ๊ณ  ํ‘œํ˜„ํ•˜๋ฉฐ ์ด ๊ฒฝ์šฐ๋ฅผ Model-based๋ผ๊ณ  ํ•œ๋‹ค. Model-based์˜ ๊ฒฝ์šฐ ๋ฐฉ๋Œ€ํ•œ ๊ฒŒ์‚ฐ์„ ํšจ์œจ์ ์œผ๋กœ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ Dynamic programming ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ optimal policy๋ฅผ ์ฐพ๋Š”๋‹ค. MDP๋ฅผ ๋ชจ๋ฅด๋Š” ๊ฒฝ์šฐ ์ฆ‰ ์ „์ด ํ™•๋ฅ ์„ ๋ชจ๋ฅด๋Š” ๊ฒฝ์šฐ๋ฅผ Model-free๋ผ๊ณ  ํ•˜๋ฉฐ ์ด ๊ฒฝ์šฐ Reinforce learning์„ ์‚ฌ์šฉํ•˜์—ฌ optimal policy๋ฅผ์ฐพ๊ฒŒ ๋œ๋‹ค. ๋‹จ, MDP๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒฝ์šฐ state, action, reward space๋Š” ์‚ฌ์ „์— ์ •์˜ํ•ด์ค€๋‹ค.

3. optimal policy in Grid world example

์•„๋ž˜์™€ ๊ฐ™์€ ์ƒํƒœ๋ฅผ ๊ฐ€์ง„ grid world๋ฅผ ๊ณ ๋ คํ•˜์—ฌ, optimal policy๋ฅผ ์ฐพ์•„๋ณด์ž
โ€ข
State : S={(1,1),(1,2),...,(4,2),(4,3)}S = \{(1,1), (1,2), ... , (4,2), (4,3)\}
โ€ข
Action: A={north,south,east,west}A = \{north, south, east, west\}
โ€ข
Reward : (4,3) ๋„๋‹ฌ ์‹œ +1, (4,2) ๋„๋‹ฌ ์‹œ -1, ๊ธฐํƒ€ case์— ๋Œ€ํ•ด negative reward c
โ†’ negative reward์˜ ์ ์šฉ ์ด์œ  : ์˜๋ฏธ์—†๋Š” ์ด๋™์— ๋Œ€ํ•œ ํŒจ๋„ํ‹ฐ ๋ถ€์—ฌ
โ€ข
Agent : noisy movement, ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด ์Šค์Šค๋กœ ํ•™์Šตํ•˜๋Š” ์ฃผ์ฒด
โ€ข
State trainsition probability
1.
๋ฒฝ(2,2)์— ๋„๋‹ฌ ์‹œ ํ˜„์žฌ ์ƒํƒœ ์œ ์ง€
2.
์„ ํƒํ•œ ๋ฐฉํ–ฅ(80%) ๊ธฐ์ค€ ์ขŒ, ์šฐ ๋ฐฉํ–ฅ์— ๋Œ€ํ•ด 10% ํ™•๋ฅ ๋กœ ์ด๋™ โ†’ action์— ๋ฌด์ž‘์œ„์„ฑ ๋ถ€์—ฌ
โ€ข
Terminal state : (4,2), (4,3)
โ€ข
Goal : total sum of rewards๋ฅผ maximizeํ•˜๋Š” policy ์ฐพ๊ธฐ
MDP์—์„œ optimal policy๋Š” ฯ€โˆ—\pi_*๋กœ ํ‘œํ˜„ํ•˜๋ฉฐ, ๊ฐ state์—์„œ ์–ด๋–ค action์„ ์„ ํƒํ•  ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค. ์ฆ‰, ์œ„์˜ ๊ฒฝ์šฐ 9๊ฐœ์˜ state์—์„œ ์ทจํ•˜๋Š” action์˜ ์ง‘ํ•ฉ์„ 1๊ฐœ์˜ policy๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ state transition probability์— ์˜ํ•ด์„œ ๊ฐ™์€ action์„ ์„ ํƒํ•˜๋”๋ผ๋„ ์ „์ด๋˜๋Š” state๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ํ•˜๋‚˜์˜ policy์— ์—ฌ๋Ÿฌ๊ฐœ์˜ episode๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๊ณ  ๋”ฐ๋ผ์„œ reward์˜ ๋‹จ์ˆœ ํ•ฉ์ด ์•„๋‹Œ ํ•ฉ์˜ ๊ธฐ๋Œ“๊ฐ’์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๊ฒƒ์„ optimal policy ฯ€โˆ—\pi_*๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค.
์œ„ ๊ทธ๋ฆผ์€ ๊ฐ ๋ชจ๋ธ๋งˆ๋‹ค ์„œ๋กœ ๋‹ค๋ฅธ optimal policy๋ฅผ ํ‘œํ˜„ํ•œ ๊ฒƒ์ธ๋ฐ, small negative reward์˜ ์ฐจ์ด๋กœ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์ด ๋˜๋ฉฐ optimal policy๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค. small negative reward(c)์˜ ๊ฐ’์ด ํฌ๋ฉด ๊ฐ๊ฐ์˜ ์ด๋™์— ๋Œ€ํ•ด ์ฃผ์–ด์ง€๋Š” penalty๊ฐ€ ์ž‘์œผ๋ฏ€๋กœ 1๋ฒˆ case์˜ (3,2)์—์„œ optimalํ•œ action์ด west๊ฐ€ ๋œ๋‹ค. ์ด๋Š” -1์— ๋„๋‹ฌํ•  ๊ฐ€๋Šฅ์„ฑ์„ ์™„์ „ํžˆ ๋ฐฐ์ œํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ์œผ๋กœ, west action์„ ์ทจํ•˜๊ณ  transition probability์— ์˜ํ•ด north๋กœ ์ „์ด๋˜๊ธฐ๋ฅผ ํฌ๋งํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ฐ™์€ ์ด์œ ๋กœ (4,1)์—์„œ์˜ optimal action๋„ south๊ฐ€ ๋œ๋‹ค. ๋ฐ˜๋ฉด 2๋ฒˆ case์—์„œ๋Š” c์˜ ๊ฐ’์ด 1๋ฒˆ์— ๋น„ํ•ด ์ž‘์•„์กŒ์œผ๋ฏ€๋กœ, step์˜ ์ฆ๊ฐ€์— ๋”ฐ๋ฅธ penalty๊ฐ€ ๋Š˜์–ด๋‚œ ๊ฒƒ์ด๊ณ  ๋”ฐ๋ผ์„œ (3,2)์™€ (4,1)์—์„œ optimal action์ด ๋‹ฌ๋ผ์ง„๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ 4๋ฒˆ case์˜ ๊ฒฝ์šฐ ๊ฐ step์— ๋Œ€ํ•œ penalty๊ฐ€ -2๋กœ ๊ต‰์žฅํžˆ ํฌ๊ธฐ ๋•Œ๋ฌธ์— ์ตœ๋Œ€ํ•œ ๋น ๋ฅธ ์ข…๋ฃŒ๋ฅผ ํ•˜๋„๋ก optimal policy๊ฐ€ ์ฃผ์–ด์ง„๋‹ค. ํŠนํžˆ (3,2)์—์„œ east action์„ optimal๋กœ ์ทจ๊ธ‰ํ•˜๋Š”๋ฐ, ์ด๋Š” +1 reward๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด์„œ north action์„ ์ทจํ•  ์‹œ, +1์— ๋„๋‹ฌํ•˜๋”๋ผ๋„ -2+1 = -1 ์˜ reward๋ฅผ ์–ป๊ฒŒ ๋˜๊ณ  transition probability์— ์˜ํ•ด์„œ (3,3)์— ๋„๋‹ฌํ•˜๊ณ  east action์„ ์ทจํ–ˆ์Œ์—๋„ west๋กœ ์ „์ด๋œ๋‹ค๋ฉด ํ›จ์”ฌ ๋” ์ ์€ ๋ณด์ƒ์„ ์–ป๊ธฐ ๋•Œ๋ฌธ์—, ํ™•๋ฅ ์ ์œผ๋กœ ๊ฐ€์žฅ ์•ˆ์ „ํ•œ east action์„ optimal๋กœ ์ทจ๊ธ‰ํ•œ๋‹ค. ์ •๋ฆฌํ•˜์ž๋ฉด (3,2)์—์„œ east action์„ optimal๋กœ ์ง€์ •ํ•˜๋Š” ๊ฒฝ์šฐ 0.8์˜ ํ™•๋ฅ ๋กœ -1 ๋ณด์ƒ์„ ์–ป๊ฒŒ ๋˜๊ณ  ๋‚˜๋จธ์ง€ ๊ฒฝ์šฐ๋Š” ์‹ฌ์ง€์–ด +1์— ๋„๋‹ฌํ•˜๋Š” ๊ฒฝ์šฐ์ด๋”๋ผ๋„ -1๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜ ๊ฐ™๋‹ค. ๋งŒ์•ฝ north action์„ optimal๋กœ ์ทจ๊ธ‰ํ•˜๋Š” ๊ฒฝ์šฐ ๊ฐ€์žฅ ํฐ reward๋Š” -1์ธ๋ฐ, ์ด ๊ฒฝ์šฐ๋Š” -1์„ ์–ป๊ธฐ ์œ„ํ•œ ํ™•๋ฅ ์ด 0.8*0.8=0.64๊ฐ€ ๋˜๋ฏ€๋กœ (3,2)์—์„œ์˜ optimal action์€ east์ด๋‹ค.

Reward, Policy

1. Reward

reward๋Š” scalar feedback์œผ๋กœ t step์—์„œ agent์˜ action์ด ์–ผ๋งˆ๋‚˜ ์ ์ ˆํ•œ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด์ค€๋‹ค. ๋”ฐ๋ผ์„œ agent์˜ ๋ชฉํ‘œ๋Š” reward์˜ ๋ˆ„์  ํ•ฉ์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋•Œ ๊ฐ•ํ™”ํ•™์Šต์€ Reward Hypothesis๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค.
โ€ข
Reward Hypothesis
โ†’ ๋ชจ๋“  ๋ชฉํ‘œ๋Š” reward์˜ ๋ˆ„์  ํ•ฉ์˜ ๊ธฐ๋Œ“๊ฐ’์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ‘œํ˜„ ๊ฐ€๋Šฅํ•˜๋‹ค.
์ด๋•Œ, ๊ธฐ๋Œ“๊ฐ’์ด ๋‚˜์˜ค๋Š” ์ด์œ ๋Š” ํ•˜๋‚˜์˜ Policy์—์„œ transition probability์— ์˜ํ•ด์„œ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” episode๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ๊ฐ€ ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๊ธฐ๋Œ“๊ฐ’์˜ ๊ฐœ๋…์„ ์‚ฌ์šฉํ•œ๋‹ค. ๋˜ํ•œ reward๋ฅผ ๋ฐ›๋Š” ์‹œ์ ์ด ๋‹ค์–‘ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค์–ด ๋ฐ”๋‘‘์—์„œ๋Š” ๊ฒŒ์ž„ ์ข…๋ฃŒ ํ›„์—, ํƒ๊ตฌ์—์„œ๋Š” ๊ฐ ์ ์ˆ˜๋งˆ๋‹ค reward๋ฅผ ์ง€๊ธ‰ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

2. Known dynamics

๋ชจ๋“  transition์— ๋Œ€ํ•ด์„œ dynamics( p(sโ€™,rโˆฃs,a)p(sโ€™,r|s,a) )๋ฅผ ์•ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด transition probability, reward ๋“ฑ์„ ์ด๋ฅผ ์‚ฌ์šฉํ•ด ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๋‹ค.
โ€ข
transition probability
โ†’ ํ˜„์žฌ state s์—์„œ action a๋ฅผ ์„ ํƒํ•˜์—ฌ ๋‹ค์Œ state sโ€™์œผ๋กœ ์ „์ดํ•˜๋Š” ๊ณผ์ •์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” reward์— ๋Œ€ํ•œ ํ™•๋ฅ ์˜ ํ•ฉ์œผ๋กœ transition probability๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ณ  ๊ทธ ๊ณผ์ •์—์„œ dynamics์— ๋Œ€ํ•œ ์‹์ด ์กด์žฌํ•œ๋‹ค.
โ€ข
expected reward (state-action pair)
โ†’ ๊ธฐ๋Œ“๊ฐ’ ๋ถ€๋ถ„์„ ๊ธฐ๋Œ“๊ฐ’์˜ ์ •์˜์— ์˜ํ•ด ํ’€์–ด์ค€๋‹ค. p(rโˆฃs,a)p(r|s,a) ๊ฐ€ ํ˜„์žฌ state s์—์„œ action a๋ฅผ ์„ ํƒํ–ˆ์„ ๋•Œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” reward์— ๋Œ€ํ•œ ํ™•๋ฅ ์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Š” state s์—์„œ action a๋ฅผ ์„ ํƒํ–ˆ์„ ๋•Œ reward r์„ ๋ฐ›๊ณ  ์ „์ด ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  sโ€™์— ๋Œ€ํ•œ ํ•ฉ์œผ๋กœ ํ‘œํ˜„ ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ ๊ณผ์ •์—์„œ dynamics์— ๋Œ€ํ•œ ์‹์ด ์กด์žฌํ•œ๋‹ค.
โ€ข
expected reward (state-action-next_state)
โ†’ ๊ธฐ๋Œ“๊ฐ’ ๋ถ€๋ถ„์„ ๊ธฐ๋Œ“๊ฐ’ ์ •์˜์— ์˜ํ•ด ํ’€์–ด์ค€๋‹ค. ์ด๋•Œ p(rโˆฃsโ€™)p(r|sโ€™)์„ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์˜ ์ •์˜์— ๋งž๊ฒŒ ํ’€์–ด์ฃผ๊ณ , (s,a)(s,a) ์กฐ๊ฑด์„ ๋™์‹œ์— ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด ์‹์˜ ์ „๊ฐœ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ ๊ณผ์ •์—์„œ dynamics์— ๋Œ€ํ•œ ์‹์ด ์กด์žฌํ•œ๋‹ค.

3. Return

โ†’ t ์‹œ์ ๋ถ€ํ„ฐ ์ข…๋ฃŒ ์‹œ์ ๊นŒ์ง€ ๊ฐ๊ฐ€๋œ reward์˜ ๋ˆ„์  ํ•ฉ์ด๋‹ค. ์ด๋•Œ ๊ฐ๊ฐ€์œจ ฮณ\gamma๋Š” 0~1 ์‚ฌ์ด์˜ ์‹ค์ˆ˜ ๊ฐ’์œผ๋กœ, ๋ฏธ๋ž˜์˜ reward์˜ ๋ถˆํ™•์‹ค์„ฑ์— ๋Œ€ํ•œ ํ‘œํ˜„์ด๋ฉฐ ํฌ๊ธฐ์— ๋”ฐ๋ผ ๊ทผ์‹œ์•ˆ์ ์ด๊ฑฐ๋‚˜ ์›์‹œ์•ˆ์ ์ธ ๋ชจ๋ธ์ž„์„ ํ‘œํ˜„ ๊ฐ€๋Šฅํ•˜๋‹ค. discount factor๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” return์˜ ๋ฌดํ•œ๋Œ€ ๋ฐœ์‚ฐ์„ ๋ง‰๊ณ  ๋ฏธ๋ž˜์˜ ๋ถˆํ™•์‹ค์„ฑ์— ๋Œ€ํ•œ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ ์‹ค์ „์ ์ธ ์ƒํ™ฉ์—์„œ ๋‹น์žฅ์˜ ๋ณด์ƒ์— ์กฐ๊ธˆ ๋” ๊ด€์‹ฌ์„ ๊ฐ€์ง€๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ชจ๋“  sequence๊ฐ€ ๋ฐ˜๋“œ์‹œ ์ข…๋ฃŒ๋˜๋Š” ๊ฒฝ์šฐ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค.

4. Policy

โ€ข
stochastic policy
โ†’ ์ฃผ์–ด์ง„ state์—์„œ ์ทจํ•˜๋Š” action์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
ฯ€(aโˆฃs)=P(At=aโˆฃSt=s)\pi(a|s) = P(A_t=a | S_t=s)
โ€ข
Deterministic policy
ฯ€(s)=a\pi(s)=a
policy๋Š” ๊ฐ state์—์„œ ์–ด๋–ค action์„ ์ทจํ•˜๋Š” ๊ฒƒ์ด optimalํ•œ์ง€, ์ฆ‰ total discounted reward๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•œ action ์„ ํƒ์— ๋Œ€ํ•œ guideline์ด๋‹ค. ์ด๋•Œ MDP์—์„œ์˜ policy๋Š” MDP ๊ฐ state๊ฐ€ Markov property๋ฅผ ๋งŒ์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์˜ค์ง ํ˜„์žฌ state์— ๋Œ€ํ•ด์„œ๋งŒ ์ข…์†์ ์ด๋‹ค.
โ€ข
case1) Known MDP
โ†’ deterministic optimal policy ฯ€โˆ—(s)\pi_*(s)๊ฐ€ ์กด์žฌํ•œ๋‹ค.
โ€ข
case2) Unknown MDP
โ†’ stochastic policy๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•˜๋ฉฐ, ฯตโˆ’greedy\epsilon-greedy ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.
ฯตโˆ’greedy\epsilon-greedy ๋ฐฉ์‹์€ ฯต\epsilon๋งŒํผ randomํ•œ action์„ ์„ ํƒํ•˜๊ณ  1โˆ’ฯต1-\epsilon๋งŒํผ greedyํ•œ action(optimal)์„ ์„ ํƒํ•˜์—ฌ ํ˜„์žฌ sample์—์„œ ๋น„๋ก ์ตœ์ ์€ ์•„๋‹์ง€๋ผ๋„, ๊ฒฝํ—˜ํ•˜์ง€ ์•Š์€ action์— ๋Œ€ํ•ด์„œ๋„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด๋•Œ ฯต\epsilon์˜ ํ™•๋ฅ ์€ ์„ ํƒ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  action์— ๋Œ€ํ•ด์„œ ๋™๋“ฑํ•˜๊ฒŒ ๋‚˜๋ˆ„๊ฒŒ ๋œ๋‹ค.

5. Notation

vฯ€(s)v_\pi(s) : state-value function โ†’ ํ˜„์žฌ state์—์„œ policy ฯ€\pi๋ฅผ ๋”ฐ๋ž์„ ๋•Œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  return์— ๋Œ€ํ•œ ๊ธฐ๋Œ“๊ฐ’
vโˆ—(s)v_*(s): optimal state-value function โ†’ ํ˜„์žฌ state์—์„œ optimal policy ฯ€\pi๋ฅผ ๋”ฐ๋ž์„ ๋•Œ์˜ value
qฯ€(s,a)q_\pi(s,a) : action value function โ†’ ํ˜„์žฌ state์—์„œ policy ฯ€\pi์— ๋”ฐ๋ผ action a๋ฅผ ์„ ํƒํ–ˆ์„ ๋•Œ์˜ value
qโˆ—(s,a)q_*(s,a) : optimal action-value function โ†’ ํ˜„์žฌ state์—์„œ policy ฯ€โˆ—\pi_*์— ๋”ฐ๋ผ action a๋ฅผ ์„ ํƒํ–ˆ์„ ๋•Œ์˜ value
์ด๋•Œ value๋Š” ๊ฐ๊ฐ return์˜ ๊ธฐ๋Œ“๊ฐ’์„ ์˜๋ฏธํ•œ๋‹ค.

Partially Observable Markov Decision Process (POMDP)

1. POMDPโ€™s State transition = (S, A, O, T, Z, R)

โ€ข
State & Action & Reward
โ†’ MDP์™€ ๋™์ผํ•˜๋‹ค.
โ€ข
State Transition Probability
โ†’ MDP์™€ ๋™์ผํ•˜๋ฉฐ, T(s,a,sโ€™)=P(sโ€™โˆฃs,a)T(s,a,sโ€™) = P(sโ€™|s,a)๋กœ ์ƒํƒœ ์ „์ด ํ™•๋ฅ ์„ ํ‘œํ˜„ํ•œ๋‹ค.
โ€ข
Observation
โ†’ Agent๊ฐ€ action์„ ์„ ํƒํ•˜์—ฌ ๋‹ค์Œ state๋กœ ์ „์ดํ•œ ํ›„์— Observation์„ ํ†ตํ•ด ํ˜„์žฌ state์— ๋Œ€ํ•œ ๋‹จ์„œ๋ฅผ ์–ป๋Š”๋‹ค. Z(sโ€™,a,o)=P(oโˆฃs,a)Z(sโ€™, a, o) = P(o|s,a) ๋กœ ํ‘œํ˜„๋œ๋‹ค.
โ€ข
Observation Probability
โ†’ Observation set์˜ ๊ฐ ์›์†Œ๋ฅผ ๊ด€์ฐฐํ•˜๊ฒŒ ๋  ํ™•๋ฅ ์„ ์˜๋ฏธํ•œ๋‹ค.
โ€ข
์š”์•ฝ : MDP์—์„œ๋Š” ์ƒํƒœ๊ฐ€ ์™„์ „ํžˆ ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ ๋ฐ˜๋ฉด, POMDP์—์„œ๋Š” ์ƒํƒœ๊ฐ€ ๋ถ€๋ถ„์ ์œผ๋กœ๋งŒ ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ฆ‰, ์—์ด์ „ํŠธ๋Š” ์™„์ „ํ•œ ์ƒํƒœ ์ •๋ณด๋ฅผ ๊ฐ–์ง€ ์•Š๊ณ , ๊ด€์ฐฐ์„ ํ†ตํ•ด ๊ฐ„์ ‘์ ์œผ๋กœ ์ƒํƒœ๋ฅผ ์ถ”๋ก ํ•ด์•ผ ํ•œ๋‹ค.

2. Planning

state๊ฐ€ ๋ถ€๋ถ„์ ์œผ๋กœ๋งŒ ๊ด€์ธก ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, Agent๋Š” ์ƒํƒœ์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ์ธ Belief b๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋•Œ Belief b์˜ ์˜๋ฏธ๋Š” โ€์—์ด์ „ํŠธ๊ฐ€ ๊ฐ ์ƒํƒœ์— ๋Œ€ํ•ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ํ™•๋ฅ  ๋ถ„ํฌโ€์ด๋‹ค. ์ฆ‰ ๊ฐ ์ƒํƒœ๊ฐ€ ํ˜„์žฌ ์ƒํƒœ์ผ ๊ฐ€๋Šฅ์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋•Œ Belief bโ€ฒ(sโ€ฒ)๋Š” ์ด์ „์˜ Belief ๊ณผ ์ƒˆ๋กœ์šด observation์„ ๋ฐ”ํƒ•์œผ๋กœ update๋œ๋‹ค.
1) Initial state
โ†’ Agent๊ฐ€ ์ดˆ๊ธฐํ™”๋œ belief ์ƒํƒœ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
2) action ์„ ํƒ
โ†’ Agent๋Š” ํ˜„์žฌ belief๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Action ์„ ํƒ
โ†’ ํ˜„์žฌ state์— ๋Œ€ํ•œ ํ™•๋ฅ ์  ์ถ”์ •์„ ๋ฐ”ํƒ•์œผ๋กœ Action ์„ ํƒ
3) State transition
โ†’ Agent๊ฐ€ ๋‹ค์Œ state st+1s_{t+1}๋กœ ์ „์ด
4) get obsevation
โ†’ ์ƒˆ๋กœ์šด state st+1s_{t+1} ์—์„œ ๋Š” observation ot+1o_{t+1}๋ฅผ ํš๋“
5) belief state update
bโ€ฒ(sโ€ฒ)=P(oโˆฃsโ€ฒ,a)โˆ‘sโˆˆSP(sโ€ฒโˆฃs,a)b(s)P(oโˆฃa,b)b'(s') = \frac{P(o|s',a)\sum_{s\in S} P(s'|s,a)b(s)}{P(o|a,b)}
โ†’ ๋ฒ ์ด์ง€์•ˆ ๋ฒ•์น™์„ ํ†ตํ•ด ์ „๊ฐœ๋˜๋Š” ์ˆ˜์‹์œผ๋กœ ์œ„์˜ ์‹์ด ์œ ๋„๋˜๋Š” ๊ณผ์ •์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
6) get reward
โ†’ Agent๋Š” State transition๊ณผ observation ์„ ํ†ตํ•ด reward ํš๋“

3. ํ•œ๊ณ„์ 

MDP์—์„œ ํ˜„์‹ค์ ์œผ๋กœ ๋ชจ๋“  state spcae๋ฅผ ์•Œ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์—, POMDP๋ฅผ ๋„์ž…ํ•œ ๊ฒƒ์ธ๋ฐ ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  POMDP ์‹ค์ œ ์‚ฌ์šฉ์— ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด Large State Space, Long Planning Horizon, Large Observation Space, Large Action Space๊ฐ€ ์žˆ๋‹ค.
โ€ข
Long Planning Horizon: agent๊ฐ€ ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๋‹จ๊ณ„์˜ ๊ณ„ํš์„ ์„ธ์›Œ์•ผ ํ•  ๋•Œ, ๊ฐ ๋‹จ๊ณ„์˜ ์ž ์žฌ์  ๊ฒฐ๊ณผ ์ˆ˜๊ฐ€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ํ˜„์ƒ.
โ†’ ์œ„ ์ˆ˜์‹์—์„œ ์•Œ ์ˆ˜ ์žˆ์ง€๋งŒ, observation space๋ž‘ belief probability๋ฅผ ๊ณ„์† ๊ณ„์‚ฐํ•ด์•ผ ํ•˜์—ฌ ๋งค state space๊ฐ€ ๋ฐ”๋€” ๋•Œ๋งˆ๋‹ค ๋ฌด์ˆ˜ํ•œ ์—ฐ์‚ฐ์ด ์ƒ๊ธด๋‹ค.

4. ํ•ด๊ฒฐ๋ฐฉ๋ฒ•

์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ ๊ทผ์‚ฌ
โ€ข
PBVI (Point-Based Value Iteration): ๋Œ€ํ‘œ์ ์ธ belief ์ง‘ํ•ฉ์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ , ๊ทธ ์ƒ˜ํ”Œ๋œ belief์—์„œ๋งŒ ๊ฐ€์น˜ ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฌธ์ œ์˜ ๋ณต์žก์„ฑ์„ ์ค„์ธ๋‹ค.
โ€ข
HSVI (Heuristic Search Value Iteration): Heuristic ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒ€์ƒ‰ ๊ณต๊ฐ„์„ ์ค„์ด๊ณ  ๋น ๋ฅด๊ฒŒ ์ข‹์€ ์ •์ฑ…์„ ์ฐพ๋Š”๋‹ค.

Intrinsic Motivation and Intrinsic Rewards

๊ฐ•ํ™”ํ•™์Šต์—์„œ๋Š” reward๊ฐ€ ๋งค์šฐ sparseํ•˜๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
โ€œreward๊ฐ€ sparseํ•˜๋‹คโ€ ๋ผ๋Š” ๊ฑด ์—์ด์ „ํŠธ๊ฐ€ ํ•™์Šต ๊ณผ์ •์—์„œ ๋ณด์ƒ์„ ๋ฐ›๋Š” ๋นˆ๋„๊ฐ€ ๋งค์šฐ ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.
์˜ˆ๋ฅผ ๋“ค๋ฉด, ์—ฌ๊ธฐ์„œ ๋งˆ๋ฆฌ์˜ค๊ฐ€ ์–ด๋–ค ๋ณด์ƒ์„ ๋ฐ›๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ action์„ ํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ ๋งŒ์•ฝ ์ฃผ์–ด์ง„ 177์ดˆ๋™์•ˆ ์•„๋ฌด๋Ÿฐ reward๋ฅผ ๋ฐ›์ง€ ๋ชปํ–ˆ๋‹ค๋ฉด ํ•™์Šต ์‹œ๊ฐ„์ด ๋„ˆ๋ฌด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค. ๋•Œ๋ฌธ์— sparse reward ์ธ ๊ฒฝ์šฐ, ์ ๋‹นํ•œ reward ๋ฅผ ๋ฐ›์„ ์ˆ˜๊ฐ€ ์—†์–ด ํ•™์Šต์ด ์ „ํ˜€ ๋˜์ง€ ์•Š๋Š”๋‹ค. ์ด๋Ÿฌํ•œ sparse reward ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ฮต-greedy, UCB, HER ๋“ฑ์ด ์žˆ๋‹ค.

1. UCB

โ€ข
์กด์žฌํ•˜๋Š” ํ–‰๋™์˜ ์„ ํƒ์ง€ ์ค‘์—์„œ, ๊ฐ€์žฅ ๋†’์€ upper bound๋ฅผ ๊ฐ€์ง„ ํ–‰๋™์„ ์„ ํƒ
์ ์„ ์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ๋†’์€ reward๋ฅผ ๋ฐ›์„ ํ™•๋ฅ ์ด ๋†’๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ๋ฌด๋‚œํ•œ ์„ ํƒ์€ 2๋ฒˆ์ธ ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋†’์€ reward๋ฅผ ์ง€์†์ ์œผ๋กœ ๋ฐ›์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
๋จผ์ € ๊ฐ€์žฅ ๋†’์€ action value์ธย ๐‘„๐‘ก(๐‘Ž)๋ฅผ ์„ ํƒํ•˜๊ณ , ์—ฌ๊ธฐ์— ์˜ค๋ฅธ์ชฝ ํ•ญ์ธ upper bound๋ฅผ ๋”ํ•˜๋Š” ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.
๐‘„๐‘ก(๐‘Ž)๐‘„_๐‘ก(๐‘Ž) : time step t์—์„œ ํ–‰๋™ a์˜ ํ‰๊ท  ๋ณด์ƒ
๐‘๐‘ก(๐‘Ž)๐‘_๐‘ก(๐‘Ž) : time step t๊นŒ์ง€ ํ–‰๋™ a๊ฐ€ ์„ ํƒ๋œ ํšŸ์ˆ˜
ํ•ด๋‹น action์ด ์„ ํƒ๋œ ํšŸ์ˆ˜๊ฐ€ ์ ๋‹ค๋ฉด ๊ทธ action์„ ํ•˜๋„๋ก ์œ ๋„ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ์ฒ˜์Œ์—๋Š” ๋งŽ์€ action์ด ์žˆ๋”๋ผ๋„ ์•„์˜ˆ ์„ ํƒ๋˜๋Š” action์ด ์—†๋„๋ก ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋‚˜์ค‘์—๋Š” ํšจ์œจ์ ์ธ ํƒ์ƒ‰์„ ํ•ด์„œ agent๊ฐ€ ์œ ์šฉํ•œ ๋ณด์ƒ์„ ๋” ์ž์ฃผ ๋ฐœ๊ฒฌํ•˜๋„๋ก ํ•ด์„œ sparse reward ๋ฌธ์ œ๋ฅผ ๊ฐ„์ ‘์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.
์ •๋ฆฌํ•˜์ž๋ฉด, ํ‰๊ท  reward๋„ ๋†’๊ณ  ์„ ํƒ๋œ ํšŸ์ˆ˜๋„ ์ ์€ action์„ ์„ ํƒํ•˜๋„๋ก ๊ตฌ์„ฑ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

2. HER (Hindsight Experience Replay)

1) episode์—์„œ ๊ฐ transition์„ ์ €์žฅํ•  ๋•Œ, ์›๋ž˜ ๋ชฉํ‘œ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ๋ชฉํ‘œ์™€ ํ•จ๊ป˜ ์ €์žฅ
2) ๊ฐ episode์—์„œ ์ตœ์ข… state๋ฅผ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” transition์„ ์ถ”๊ฐ€๋กœ ์ €์žฅ
3) Reward ๋ณ€ํ™”๋กœ ์ธํ•œ ํ•™์Šต ํšจ์œจ์„ฑ ์ฆ๋Œ€

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„:

1.
์ดˆ๊ธฐํ™”
โ€ข
๋ชฉํ‘œ gg ์™€ ์ดˆ๊ธฐ state s0s_0 sampling
2.
์ƒํƒœ ์ „์ด ๋ฐ˜๋ณต
โ€ข
Action policy ฯ€b(stโˆฅg)\pi_b(s_tโˆฅg)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ˜„์žฌ state sts_t์™€ ๋ชฉํ‘œ gg์—์„œ action ata_t๋ฅผ sampling
โ€ข
Action ata_t๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์ƒˆ๋กœ์šด ์ƒํƒœ st+1s_t+1 ๊ด€์ฐฐ
3.
episode ์ข…๋ฃŒ ํ›„ state transition ์ €์žฅ
โ€ข
๊ธฐ๋ณธ transition ์ €์žฅ
โ€ข
์ถ”๊ฐ€ ๋ชฉํ‘œ๋ฅผ ์œ„ํ•œ transition ์ €์žฅ
โ—ฆ
ํ˜„์žฌ episode์˜ state๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ถ”๊ฐ€ ๋ชฉํ‘œ ์ง‘ํ•ฉ G:=S(currentepisode)G:=S(currentepisode) sampling
4.
์ตœ์ ํ™” ๋ฐ˜๋ณต
โ€ข
Replay buffer R์—์„œ Mini bath B sampling
โ€ข
์•Œ๊ณ ๋ฆฌ์ฆ˜ A๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Mini batch B๋กœ ํ•œ ๋‹จ๊ณ„ ์ตœ์ ํ™” ์ˆ˜ํ–‰
โ†’ ์ •๋ฆฌํ•˜์ž๋ฉด Agent๊ฐ€ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉ ํ•˜์—ฌ ์–ป์€ ๋ชจ๋“  transition๊ณผ ์›๋ž˜ ๋ชฉํ‘œ๋ฅผ replay buffer์— ์ €์žฅํ•จ๊ณผ ๋™์‹œ์— ํ˜„์žฌ episode์—์„œ ๋„๋‹ฌํ•œ ์ตœ์ข… state๋ฅผ ์ƒˆ๋กœ์šด ๋ชฉํ‘œ๋กœ ์„ค์ •ํ•˜์—ฌ transition์„ ์ถ”๊ฐ€๋กœ ์ €์žฅํ•˜๊ณ , ์ดํ›„์— replay buffer์—์„œ sample์„ ๋ฝ‘์•„์„œ ๊ทธ๊ฒƒ์œผ๋กœ optimization์„ ์ง„ํ–‰ํ•œ๋‹ค.
replay buffer์—์„œ transition์„ sampling ํ•˜๋Š” ๊ฒƒ๊ณผ ํ•จ๊ป˜, strategy S์— ๋”ฐ๋ผ additional goal g'์„ sampling ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
original goal g๊ฐ€ ์žˆ๊ธด ํ•˜์ง€๋งŒ, g์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ•œ trajectory์˜ ๊ฒฝ์šฐ r_g๋Š” ํ•ญ์ƒ near-zero reward์ผ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—, additinal goal g'์— ๋Œ€ํ•ด ์ƒˆ๋กœ non-zero reward signal์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.
๊ทธ๋ฆฌํ•˜์—ฌ, additional goal g'์— ๋Œ€ํ•ด ๋ฐœ์ƒํ•œ non-zero reward signal rg'๊ณผ s||g'์œผ๋กœ ์น˜ํ™˜๋œ sample์„ ํ†ตํ•ด agent๋Š” ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณผ์ •์„ ํ†ตํ•ด original goal g์— ๋„๋‹ฌํ•˜๋Š” ๊ฒƒ์„ ์‹คํŒจํ•œ trajectory๋กœ ๋ถ€ํ„ฐ๋„ ์ •์ฑ…์ด ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ๋‹ค.