Search
๐Ÿ’ช๐Ÿป

Bellman Equation

์ƒ์„ฑ์ผ
2024/07/11 15:16
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Bellman Equation

1. Value functions

Value function์€ ๊ฐ state s ๋˜๋Š” state-action pair (s,a)๊ฐ€ ์•ž์œผ๋กœ์˜ ๊ณผ์ •์ด policy ฯ€\pi๋ฅผ ๋”ฐ๋ผ ์ง„ํ–‰ํ•œ๋‹ค๊ณ  ํ•  ๋•Œ, Return์˜ ๊ธฐ๋Œ“๊ฐ’์˜ ๊ด€์ ์—์„œ ๊ทธ ๊ฐ€์น˜๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค. ์ •๋ฆฌํ•˜์ž๋ฉด ํ˜„์žฌ state s ํ˜น์€ (s,a)๊ฐ€ ์•ž์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ reward๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋Š”์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ํ•จ์ˆ˜์ด๋‹ค.

2. State-value function

โ€ข
vฯ€(s)v_\pi(s) = Eฯ€[GtโˆฃSt=s]E_\pi[G_t|S_t = s]
โ†’ ํ˜„์žฌ state s์˜ ๊ฐ€์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ•จ์ˆ˜๋กœ ํ˜„์žฌ state s์—์„œ ์‹œ์ž‘ํ•˜์—ฌ policy ฯ€\pi๋ฅผ ๋”ฐ๋ฅผ ๋•Œ Return์˜ ๊ธฐ๋Œ“๊ฐ’์œผ๋กœ ํ‘œํ˜„๋œ๋‹ค. ๋งŒ์•ฝ Return์˜ ๊ธฐ๋Œ“๊ฐ’์ด ํฌ๋‹ค๋ฉด, ๊ทธ state์—์„œ ์•ž์œผ๋กœ ๋ฐ›๊ฒŒ ๋  ํ‰๊ท ์ ์ธ reward๊ฐ€ ํฌ๋‹ค๋Š” ์˜๋ฏธ์ด๋ฏ€๋กœ ๊ฐ€์น˜๊ฐ€ ๋†’๋‹ค๊ณ  ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค.

3. Action-value function

โ€ข
qฯ€(s,a)=Et[GtโˆฃSt=s,At=a]q_\pi(s,a) =E_t[G_t|S_t=s, A_t=a]
โ†’ ํ˜„์žฌ state s์—์„œ action a๋ฅผ ์ทจํ–ˆ์„ ๋•Œ์˜ ๊ฐ€์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ•จ์ˆ˜๋กœ ํ˜„์žฌ state s์—์„œ action a๋ฅผ ์„ ํƒํ•˜๊ณ  ์ดํ›„์— policy ฯ€\pi๋ฅผ ๋”ฐ๋ผ ์ง„ํ–‰ํ•  ๋•Œ์˜ Return์˜ ๊ธฐ๋Œ“๊ฐ’์œผ๋กœ ํ‘œํ˜„๋œ๋‹ค. ๋งŒ์•ฝ action-value function์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค๋ฉด ์–ด๋–ค action์ด ๊ฐ€์žฅ ์ข‹์€ action์ธ์ง€ ํŒ๋‹จ์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— policy ์„ ํƒ์— ์žˆ์–ด ์œ ๋ฆฌํ•˜๋‹ค. ๋ฐ˜๋ฉด State-value function๋งŒ์„ ์‚ฌ์šฉํ•ด์„œ๋Š” policy ฯ€\pi๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํŠน์ • state์—์„œ ๊ฐ€์น˜๋ฅผ ํ‰๊ฐ€ํ•  ๋•Œ state value function์€ ํ•˜๋‚˜์˜ ๊ฐ’์œผ๋กœ ๋‚˜์˜ค์ง€๋งŒ, action value function์€ ๋ชจ๋“  action์— ๋Œ€ํ•ด ๊ณ ๋ คํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.
โ†’ vฯ€(s)=โˆ‘aฯ€(aโˆฃs)qฯ€(s,a)v_\pi(s) = \sum_a\pi(a|s)q_\pi(s,a) ์™€ ๊ฐ™์ด Action value function์„ ํ™œ์šฉํ•˜์—ฌ state-value function์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ ํ˜„์žฌ state s์˜ ๊ฐ€์น˜๋Š” ๊ทธ state์—์„œ ์„ ํƒ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  action์— ๋Œ€ํ•ด ๊ณ ๋ ค๋˜์–ด์•ผ ํ•˜๋ฏ€๋กœ ํŠน์ • action์ด ์„ ํƒ๋  ํ™•๋ฅ ๊ณผ ๊ทธ๋•Œ์˜ ๊ฐ€์น˜๋“ค์ธ q-function์„ ํ†ตํ•ด ํ‘œํ˜„ ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด๋‹ค.

4. Advantage function

โ€ข
Aฯ€(s,a)=qฯ€(s,a)โˆ’vฯ€(s)A_\pi(s,a) = q_\pi(s,a)-v_\pi(s)
โ†’ state s์—์„œ action a์— ๋Œ€ํ•œ Advantage function์€ ๋‘ value function์˜ ์ฐจ์ด๋กœ ๋‚˜ํƒ€๋‚ด๋ฉฐ action์ด ํ˜„์žฌ state์˜ ๊ฐ€์น˜์— ๋Œ€ํ•ด ํ‰๊ท  ์ด์ƒ์˜ ๊ฐ€์น˜๊ฐ€ ์žˆ๋Š”์ง€, ์ดํ•˜์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด์ค€๋‹ค.

5. Dynamic Programming vs RL

โ€ข
DP์—์„œ๋Š” MDP๋ฅผ ์•„๋Š” ์ƒํ™ฉ(Model-based)์ด๋ฏ€๋กœ ฯ€\pi์™€ transition probability P๋ฅผ ์•Œ๊ณ  ์žˆ๋Š” ์ƒํ™ฉ์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ง์ ‘ ๊ธฐ๋Œ“๊ฐ’์— ๋Œ€ํ•œ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๊ณ  ์—ฐ์‚ฐ๋Ÿ‰์ด ์ ์€ state-value function์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
โ€ข
RL์—์„œ๋Š” MDP๋ฅผ ๋ชจ๋ฅด๋Š” ์ƒํ™ฉ(Model-free)์ด๋ฏ€๋กœ transition probability๋ฅผ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, ๊ธฐ๋Œ“๊ฐ’ ์—ฐ์‚ฐ์„ ์ •ํ™•ํžˆ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ sample์„ ํ†ตํ•ด ์ถ”์ •์น˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

6. Bellman expectation equation

โ€ข
State value function์˜ ๋‹ค์–‘ํ•œ ์‹ ํ‘œํ˜„๊ณผ ์ „๊ฐœ ๋ฐฉ์‹
โ€ข
Action value function์˜ ๋‹ค์–‘ํ•œ ์‹ ํ‘œํ˜„๊ณผ ์ „๊ฐœ ๋ฐฉ์‹
โ€ข
Bellman expectation equation ์ •๋ฆฌ

7. Optimal value function and Policy

โ€ข
Optimal state-value function
vโˆ—(s)=maxโกฯ€ย vฯ€(s)v_*(s) = \displaystyle\max_\pi\ v_\pi (s)
โ†’ ๋ชจ๋“  Policy์— ๋Œ€ํ•˜์—ฌ state value function์˜ ๊ฐ’์„ ์ตœ๋Œ€๋กœ ํ•˜๋Š” Policy๋ฅผ ์ ์šฉํ•˜์˜€์„ ๋•Œ์˜ State value function
โ€ข
Optimal action-value function
qโˆ—(s,a)=maxโกฯ€ย qฯ€(s,a)q_*(s,a) = \displaystyle \max_\pi\ q_\pi(s,a)
โ†’ ๋ชจ๋“  Policy์— ๋Œ€ํ•˜์—ฌ action value function์˜ ๊ฐ’์„ ์ตœ๋Œ€๋กœ ํ•˜๋Š” Policy๋ฅผ ์ ์šฉํ•˜์˜€์„ ๋•Œ์˜ Action value function
: ๊ฒฐ๊ตญ ๊ฐ๊ฐ์€ Optimal policy๋ฅผ ๋”ฐ๋ž์„ ๋•Œ์˜ value function์„ ์˜๋ฏธํ•œ๋‹ค.
๋งŒ์•ฝ ๊ฐ๊ฐ์˜ state์˜ value๋ฅผ ์ตœ๋Œ€๋กœ ํ•˜๋Š” policy๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฒฝ์šฐ์—๋Š” policy๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋‚˜, MDP์—์„œ๋Š” 1) ๋ชจ๋“  Policy์— ๋Œ€ํ•ด์„œ ฯ€โˆ—โ‰ฅฯ€\pi_*โ‰ฅ \pi์ธ ์ ์–ด๋„ ํ•˜๋‚˜์˜ Optimal policy๊ฐ€ ์กด์žฌํ•œ๋‹ค 2) Optimal policy๋ฅผ ์ ์šฉํ•œ state value function์€ ํ•ญ์ƒ Optimal state value function๊ณผ ๊ฐ™๋‹ค. 3) Optimal policy๋ฅผ ์ ์šฉํ•œ action value function์€ ํ•ญ์ƒ Optimal action value function๊ณผ ๊ฐ™๋‹ค.
๋”ฐ๋ผ์„œ Optimal value function์€ ๊ฒฐ๊ตญ Optimal policy๋ฅผ ๋”ฐ๋ž์„ ๋•Œ์˜ value function๊ณผ ๊ฐ™๊ณ  ์ด๊ฒƒ์€ ๋‹ค๋ฅธ ๋ชจ๋“  policy์— ๋Œ€ํ•ด์„œ ์„ฑ๋ฆฝํ•˜๋ฏ€๋กœ ๊ฐ๊ฐ์˜ state ๋˜๋Š” state-action์—์„œ Optimal value function์ด ์„œ๋กœ ๋‹ค๋ฅด๊ฒŒ ๋˜๋Š” ๋ฌธ์ œ๋Š” ๋ฐœ์ƒํ•˜์ง€ ์•Š๋Š”๋‹ค.
โ€ข
Optimal Policy
Optimal Policy๋Š” qโˆ—(s,a)q_*(s,a)๋ฅผ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๊ณผ์ •์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.
ฯ€โˆ—(aโˆฃs)={1ifย a=argโกmaxโกaqโˆ—(s,a)0otherwise\pi_*(a|s) = \begin{cases} 1 & \text{if } a = \arg\max\limits_a q_*(s, a) \\ 0 & \text{otherwise} \end{cases}
โ†’ ์—ฌ๊ธฐ์„œ qโˆ—(s,a)q_*(s,a)๋ฅผ ์ตœ๋Œ€๋กœ ํ•˜๋Š” action์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์€ ๊ฐ state์— ๋Œ€ํ•ด action value function์ด ์ตœ๋Œ€๊ฐ€ ๋˜๋Š” action์„ ๋‹จ์ˆœํžˆ ์„ ํƒํ•œ๋‹ค๋ฉด ๊ทธ๊ฒƒ์€ Optimal Policy๊ฐ€ ์•„๋‹ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ Optimal action value function์„ ์ตœ๋Œ€๋กœ ํ•˜๋Š” action์„ ์„ ํƒํ•˜๋„๋ก ํ•ด์•ผํ•œ๋‹ค. ์ด๋•Œ optimal action value function์€ ๊ฐ state์— ๋Œ€ํ•ด์„œ ๊ฐ€๋Šฅํ•œ action์˜ ์ˆ˜๋งŒํผ ์กด์žฌํ•˜๋ฉฐ, ์ด๋Š” ํ•ด๋‹น state์—์„œ ๊ทธ action์„ ์„ ํƒํ•˜๊ณ  ์ดํ›„์— optimal policy๋ฅผ ๋”ฐ๋ฅผ ๋•Œ์˜ ํ•ด๋‹น state-action pair์˜ value๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
โ†’ MDP์—์„œ๋Š” ํ•ญ์ƒ deterministic optimal policy๊ฐ€ ์กด์žฌํ•˜๋ฉฐ, ์ด๋Š” ๊ณง ํŠน์ • state์—์„œ ์„ ํƒํ•˜๋Š” action์ด ๋ช…ํ™•ํžˆ ํ•˜๋‚˜๋กœ ๊ฒฐ์ •๋œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

8. Bellman optimality equation

โ€ข
Bellman optimality equation ์ •๋ฆฌ