Search
๐Ÿ’ช๐Ÿป

POMDP

์ƒ์„ฑ์ผ
2025/04/02 13:10
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Partially Observable Markov Decision Process (POMDP)

1. POMDPโ€™s State transition = (S, A, O, T, Z, R)

โ€ข
State & Action & Reward
โ†’ MDP์™€ ๋™์ผํ•˜๋‹ค.
โ€ข
State Transition Probability
โ†’ MDP์™€ ๋™์ผํ•˜๋ฉฐ, T(s,a,sโ€™)=P(sโ€™โˆฃs,a)T(s,a,sโ€™) = P(sโ€™|s,a)๋กœ ์ƒํƒœ ์ „์ด ํ™•๋ฅ ์„ ํ‘œํ˜„ํ•œ๋‹ค.
โ€ข
Observation
โ†’ Agent๊ฐ€ action์„ ์„ ํƒํ•˜์—ฌ ๋‹ค์Œ state๋กœ ์ „์ดํ•œ ํ›„์— Observation์„ ํ†ตํ•ด ํ˜„์žฌ state์— ๋Œ€ํ•œ ๋‹จ์„œ๋ฅผ ์–ป๋Š”๋‹ค. Z(sโ€™,a,o)=P(oโˆฃs,a)Z(sโ€™, a, o) = P(o|s,a) ๋กœ ํ‘œํ˜„๋œ๋‹ค.
โ€ข
Observation Probability
โ†’ Observation set์˜ ๊ฐ ์›์†Œ๋ฅผ ๊ด€์ฐฐํ•˜๊ฒŒ ๋  ํ™•๋ฅ ์„ ์˜๋ฏธํ•œ๋‹ค.
โ€ข
์š”์•ฝ : MDP์—์„œ๋Š” ์ƒํƒœ๊ฐ€ ์™„์ „ํžˆ ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ ๋ฐ˜๋ฉด, POMDP์—์„œ๋Š” ์ƒํƒœ๊ฐ€ ๋ถ€๋ถ„์ ์œผ๋กœ๋งŒ ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ฆ‰, ์—์ด์ „ํŠธ๋Š” ์™„์ „ํ•œ ์ƒํƒœ ์ •๋ณด๋ฅผ ๊ฐ–์ง€ ์•Š๊ณ , ๊ด€์ฐฐ์„ ํ†ตํ•ด ๊ฐ„์ ‘์ ์œผ๋กœ ์ƒํƒœ๋ฅผ ์ถ”๋ก ํ•ด์•ผ ํ•œ๋‹ค.

2. Planning

state๊ฐ€ ๋ถ€๋ถ„์ ์œผ๋กœ๋งŒ ๊ด€์ธก ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, Agent๋Š” ์ƒํƒœ์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ์ธ Belief b๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋•Œ Belief b์˜ ์˜๋ฏธ๋Š” โ€์—์ด์ „ํŠธ๊ฐ€ ๊ฐ ์ƒํƒœ์— ๋Œ€ํ•ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ํ™•๋ฅ  ๋ถ„ํฌโ€์ด๋‹ค. ์ฆ‰ ๊ฐ ์ƒํƒœ๊ฐ€ ํ˜„์žฌ ์ƒํƒœ์ผ ๊ฐ€๋Šฅ์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋•Œ Belief bโ€ฒ(sโ€ฒ)๋Š” ์ด์ „์˜ Belief ๊ณผ ์ƒˆ๋กœ์šด observation์„ ๋ฐ”ํƒ•์œผ๋กœ update๋œ๋‹ค.
1) Initial state
โ†’ Agent๊ฐ€ ์ดˆ๊ธฐํ™”๋œ belief ์ƒํƒœ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
2) action ์„ ํƒ
โ†’ Agent๋Š” ํ˜„์žฌ belief๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Action ์„ ํƒ
โ†’ ํ˜„์žฌ state์— ๋Œ€ํ•œ ํ™•๋ฅ ์  ์ถ”์ •์„ ๋ฐ”ํƒ•์œผ๋กœ Action ์„ ํƒ
3) State transition
โ†’ Agent๊ฐ€ ๋‹ค์Œ state st+1s_{t+1}๋กœ ์ „์ด
4) get obsevation
โ†’ ์ƒˆ๋กœ์šด state st+1s_{t+1} ์—์„œ ๋Š” observation ot+1o_{t+1}๋ฅผ ํš๋“
5) belief state update
bโ€ฒ(sโ€ฒ)=P(oโˆฃsโ€ฒ,a)โˆ‘sโˆˆSP(sโ€ฒโˆฃs,a)b(s)P(oโˆฃa,b)b'(s') = \frac{P(o|s',a)\sum_{s\in S} P(s'|s,a)b(s)}{P(o|a,b)}
โ†’ ๋ฒ ์ด์ง€์•ˆ ๋ฒ•์น™์„ ํ†ตํ•ด ์ „๊ฐœ๋˜๋Š” ์ˆ˜์‹์œผ๋กœ ์œ„์˜ ์‹์ด ์œ ๋„๋˜๋Š” ๊ณผ์ •์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
6) get reward
โ†’ Agent๋Š” State transition๊ณผ observation ์„ ํ†ตํ•ด reward ํš๋“

3. ํ•œ๊ณ„์ 

MDP์—์„œ ํ˜„์‹ค์ ์œผ๋กœ ๋ชจ๋“  state spcae๋ฅผ ์•Œ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์—, POMDP๋ฅผ ๋„์ž…ํ•œ ๊ฒƒ์ธ๋ฐ ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  POMDP ์‹ค์ œ ์‚ฌ์šฉ์— ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด Large State Space, Long Planning Horizon, Large Observation Space, Large Action Space๊ฐ€ ์žˆ๋‹ค.
โ€ข
Long Planning Horizon: agent๊ฐ€ ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๋‹จ๊ณ„์˜ ๊ณ„ํš์„ ์„ธ์›Œ์•ผ ํ•  ๋•Œ, ๊ฐ ๋‹จ๊ณ„์˜ ์ž ์žฌ์  ๊ฒฐ๊ณผ ์ˆ˜๊ฐ€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ํ˜„์ƒ.
โ†’ ์œ„ ์ˆ˜์‹์—์„œ ์•Œ ์ˆ˜ ์žˆ์ง€๋งŒ, observation space๋ž‘ belief probability๋ฅผ ๊ณ„์† ๊ณ„์‚ฐํ•ด์•ผ ํ•˜์—ฌ ๋งค state space๊ฐ€ ๋ฐ”๋€” ๋•Œ๋งˆ๋‹ค ๋ฌด์ˆ˜ํ•œ ์—ฐ์‚ฐ์ด ์ƒ๊ธด๋‹ค.

4. ํ•ด๊ฒฐ๋ฐฉ๋ฒ•

์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ ๊ทผ์‚ฌ
โ€ข
PBVI (Point-Based Value Iteration): ๋Œ€ํ‘œ์ ์ธ belief ์ง‘ํ•ฉ์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ , ๊ทธ ์ƒ˜ํ”Œ๋œ belief์—์„œ๋งŒ ๊ฐ€์น˜ ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฌธ์ œ์˜ ๋ณต์žก์„ฑ์„ ์ค„์ธ๋‹ค.
โ€ข
HSVI (Heuristic Search Value Iteration): Heuristic ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒ€์ƒ‰ ๊ณต๊ฐ„์„ ์ค„์ด๊ณ  ๋น ๋ฅด๊ฒŒ ์ข‹์€ ์ •์ฑ…์„ ์ฐพ๋Š”๋‹ค.

Intrinsic Motivation and Intrinsic Rewards

๊ฐ•ํ™”ํ•™์Šต์—์„œ๋Š” reward๊ฐ€ ๋งค์šฐ sparseํ•˜๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
โ€œreward๊ฐ€ sparseํ•˜๋‹คโ€ ๋ผ๋Š” ๊ฑด ์—์ด์ „ํŠธ๊ฐ€ ํ•™์Šต ๊ณผ์ •์—์„œ ๋ณด์ƒ์„ ๋ฐ›๋Š” ๋นˆ๋„๊ฐ€ ๋งค์šฐ ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.
์˜ˆ๋ฅผ ๋“ค๋ฉด, ์—ฌ๊ธฐ์„œ ๋งˆ๋ฆฌ์˜ค๊ฐ€ ์–ด๋–ค ๋ณด์ƒ์„ ๋ฐ›๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ action์„ ํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ ๋งŒ์•ฝ ์ฃผ์–ด์ง„ 177์ดˆ๋™์•ˆ ์•„๋ฌด๋Ÿฐ reward๋ฅผ ๋ฐ›์ง€ ๋ชปํ–ˆ๋‹ค๋ฉด ํ•™์Šต ์‹œ๊ฐ„์ด ๋„ˆ๋ฌด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค. ๋•Œ๋ฌธ์— sparse reward ์ธ ๊ฒฝ์šฐ, ์ ๋‹นํ•œ reward ๋ฅผ ๋ฐ›์„ ์ˆ˜๊ฐ€ ์—†์–ด ํ•™์Šต์ด ์ „ํ˜€ ๋˜์ง€ ์•Š๋Š”๋‹ค. ์ด๋Ÿฌํ•œ sparse reward ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ฮต-greedy, UCB, HER ๋“ฑ์ด ์žˆ๋‹ค.

1. UCB

โ€ข
์กด์žฌํ•˜๋Š” ํ–‰๋™์˜ ์„ ํƒ์ง€ ์ค‘์—์„œ, ๊ฐ€์žฅ ๋†’์€ upper bound๋ฅผ ๊ฐ€์ง„ ํ–‰๋™์„ ์„ ํƒ
์ ์„ ์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ๋†’์€ reward๋ฅผ ๋ฐ›์„ ํ™•๋ฅ ์ด ๋†’๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ๋ฌด๋‚œํ•œ ์„ ํƒ์€ 2๋ฒˆ์ธ ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋†’์€ reward๋ฅผ ์ง€์†์ ์œผ๋กœ ๋ฐ›์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
๋จผ์ € ๊ฐ€์žฅ ๋†’์€ action value์ธย ๐‘„๐‘ก(๐‘Ž)๋ฅผ ์„ ํƒํ•˜๊ณ , ์—ฌ๊ธฐ์— ์˜ค๋ฅธ์ชฝ ํ•ญ์ธ upper bound๋ฅผ ๋”ํ•˜๋Š” ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.
๐‘„๐‘ก(๐‘Ž)๐‘„_๐‘ก(๐‘Ž) : time step t์—์„œ ํ–‰๋™ a์˜ ํ‰๊ท  ๋ณด์ƒ
๐‘๐‘ก(๐‘Ž)๐‘_๐‘ก(๐‘Ž) : time step t๊นŒ์ง€ ํ–‰๋™ a๊ฐ€ ์„ ํƒ๋œ ํšŸ์ˆ˜
ํ•ด๋‹น action์ด ์„ ํƒ๋œ ํšŸ์ˆ˜๊ฐ€ ์ ๋‹ค๋ฉด ๊ทธ action์„ ํ•˜๋„๋ก ์œ ๋„ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ์ฒ˜์Œ์—๋Š” ๋งŽ์€ action์ด ์žˆ๋”๋ผ๋„ ์•„์˜ˆ ์„ ํƒ๋˜๋Š” action์ด ์—†๋„๋ก ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋‚˜์ค‘์—๋Š” ํšจ์œจ์ ์ธ ํƒ์ƒ‰์„ ํ•ด์„œ agent๊ฐ€ ์œ ์šฉํ•œ ๋ณด์ƒ์„ ๋” ์ž์ฃผ ๋ฐœ๊ฒฌํ•˜๋„๋ก ํ•ด์„œ sparse reward ๋ฌธ์ œ๋ฅผ ๊ฐ„์ ‘์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.
์ •๋ฆฌํ•˜์ž๋ฉด, ํ‰๊ท  reward๋„ ๋†’๊ณ  ์„ ํƒ๋œ ํšŸ์ˆ˜๋„ ์ ์€ action์„ ์„ ํƒํ•˜๋„๋ก ๊ตฌ์„ฑ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

2. HER (Hindsight Experience Replay)

1) episode์—์„œ ๊ฐ transition์„ ์ €์žฅํ•  ๋•Œ, ์›๋ž˜ ๋ชฉํ‘œ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ๋ชฉํ‘œ์™€ ํ•จ๊ป˜ ์ €์žฅ
2) ๊ฐ episode์—์„œ ์ตœ์ข… state๋ฅผ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” transition์„ ์ถ”๊ฐ€๋กœ ์ €์žฅ
3) Reward ๋ณ€ํ™”๋กœ ์ธํ•œ ํ•™์Šต ํšจ์œจ์„ฑ ์ฆ๋Œ€

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„:

1.
์ดˆ๊ธฐํ™”
โ€ข
๋ชฉํ‘œ gg ์™€ ์ดˆ๊ธฐ state s0s_0 sampling
2.
์ƒํƒœ ์ „์ด ๋ฐ˜๋ณต
โ€ข
Action policy ฯ€b(stโˆฅg)\pi_b(s_tโˆฅg)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ˜„์žฌ state sts_t์™€ ๋ชฉํ‘œ gg์—์„œ action ata_t๋ฅผ sampling
โ€ข
Action ata_t๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์ƒˆ๋กœ์šด ์ƒํƒœ st+1s_t+1 ๊ด€์ฐฐ
3.
episode ์ข…๋ฃŒ ํ›„ state transition ์ €์žฅ
โ€ข
๊ธฐ๋ณธ transition ์ €์žฅ
โ€ข
์ถ”๊ฐ€ ๋ชฉํ‘œ๋ฅผ ์œ„ํ•œ transition ์ €์žฅ
โ—ฆ
ํ˜„์žฌ episode์˜ state๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ถ”๊ฐ€ ๋ชฉํ‘œ ์ง‘ํ•ฉ G:=S(currentepisode)G:=S(currentepisode) sampling
4.
์ตœ์ ํ™” ๋ฐ˜๋ณต
โ€ข
Replay buffer R์—์„œ Mini bath B sampling
โ€ข
์•Œ๊ณ ๋ฆฌ์ฆ˜ A๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Mini batch B๋กœ ํ•œ ๋‹จ๊ณ„ ์ตœ์ ํ™” ์ˆ˜ํ–‰
โ†’ ์ •๋ฆฌํ•˜์ž๋ฉด Agent๊ฐ€ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉ ํ•˜์—ฌ ์–ป์€ ๋ชจ๋“  transition๊ณผ ์›๋ž˜ ๋ชฉํ‘œ๋ฅผ replay buffer์— ์ €์žฅํ•จ๊ณผ ๋™์‹œ์— ํ˜„์žฌ episode์—์„œ ๋„๋‹ฌํ•œ ์ตœ์ข… state๋ฅผ ์ƒˆ๋กœ์šด ๋ชฉํ‘œ๋กœ ์„ค์ •ํ•˜์—ฌ transition์„ ์ถ”๊ฐ€๋กœ ์ €์žฅํ•˜๊ณ , ์ดํ›„์— replay buffer์—์„œ sample์„ ๋ฝ‘์•„์„œ ๊ทธ๊ฒƒ์œผ๋กœ optimization์„ ์ง„ํ–‰ํ•œ๋‹ค.
replay buffer์—์„œ transition์„ sampling ํ•˜๋Š” ๊ฒƒ๊ณผ ํ•จ๊ป˜, strategy S์— ๋”ฐ๋ผ additional goal g'์„ sampling ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
original goal g๊ฐ€ ์žˆ๊ธด ํ•˜์ง€๋งŒ, g์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ•œ trajectory์˜ ๊ฒฝ์šฐ r_g๋Š” ํ•ญ์ƒ near-zero reward์ผ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—, additinal goal g'์— ๋Œ€ํ•ด ์ƒˆ๋กœ non-zero reward signal์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.
๊ทธ๋ฆฌํ•˜์—ฌ, additional goal g'์— ๋Œ€ํ•ด ๋ฐœ์ƒํ•œ non-zero reward signal rg'๊ณผ s||g'์œผ๋กœ ์น˜ํ™˜๋œ sample์„ ํ†ตํ•ด agent๋Š” ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณผ์ •์„ ํ†ตํ•ด original goal g์— ๋„๋‹ฌํ•˜๋Š” ๊ฒƒ์„ ์‹คํŒจํ•œ trajectory๋กœ ๋ถ€ํ„ฐ๋„ ์ •์ฑ…์ด ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ๋‹ค.