Search
๐Ÿ’ช๐Ÿป

Value/Policy iteration

์ƒ์„ฑ์ผ
2024/07/08 08:28
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Dynamic Programming

1. Dynamic Programming

โ€ข
Substructure
: ๊ธฐ์กด ๋ฌธ์ œ๋ฅผ ๋” ์ž‘์€ ํ•˜์œ„ ๋ฌธ์ œ๋กœ ๋ถ„ํ•ด
โ€ข
Table Structure
: ๊ฐ๊ฐ์˜ ํ•˜์œ„ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•œ ํ›„์—, ๊ณ„์‚ฐ๋œ ๊ฒฐ๊ณผ๋ฅผ table์— ์ €์žฅํ•˜์—ฌ ์žฌ์‚ฌ์šฉ
โ€ข
Bottom-up Computation
: table์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜์œ„ ๋ฌธ์ œ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๋” ํฐ ํ•˜์œ„ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ธฐ์กด ๋ฌธ์ œ์˜ ํ•ด๊ฒฐ์ฑ…์— ๋„๋‹ฌํ•˜๋Š” ๋ฐฉ์‹
โ†’ ์œ„์˜ 3๊ฐ€์ง€ ๋ฐฉ์‹์„ ํ†ตํ•ด ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค.
โ€ข
DP๋ฅผ ์ ์šฉํ•˜๋ฉด ์œ ์šฉํ•œ case
1.
Optimal substructure
โ†’ ์›๋ž˜ ๋ฌธ์ œ์˜ optimal solution์ด ํ•˜์œ„ ๋ฌธ์ œ์˜ optimal solution์˜ ๊ฒฐํ•ฉ์œผ๋กœ ํ•ด๊ฒฐ ๊ฐ€๋Šฅํ•œ ์ƒํ™ฉ
2.
Overlapping sub-problems
โ†’ ๊ฐ™์€ ํ•˜์œ„ ๋ฌธ์ œ์˜ ํ•ด๊ฐ€ ๋ฐ˜๋ณต์ ์œผ๋กœ ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๊ณ„์‚ฐ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ table์— ์ €์žฅํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํšจ์œจ์ ์ธ ์ƒํ™ฉ
โ‡’ MDP๋Š” Bellman equation์— ์˜ํ•˜์—ฌ ์žฌ๊ท€์  ๋ถ„ํ•ด์˜ ์š”๊ฑด๊ณผ Value function์˜ ์žฌ์‚ฌ์šฉ์ด ์ ์šฉ๋˜๋ฏ€๋กœ MDP๋ฅผ ์•„๋Š” ์ƒํ™ฉ์—์„œ DP๋ฅผ ์ ์šฉํ•˜๋ฉด MDP๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

2. Sequential Decision Problem

โ€ข
Model-based : Planning
MDP๋ฅผ ์•Œ๊ณ  ์žˆ๋Š” ์ƒํ™ฉ์ด๋ฏ€๋กœ DP๋ฅผ ์ด์šฉํ•˜์—ฌ Planning์„ ํ†ตํ•ด ์ง์ ‘ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค. ์ด๋•Œ Model์— ๋Œ€ํ•œ ์™„๋ฒฝํ•œ ์ •๋ณด๋ฅผ ์•Œ๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ง์ ‘ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋‚˜ ๋ฐฉ๋Œ€ํ•œ ๊ณ„์‚ฐ๋Ÿ‰์œผ๋กœ ์ธํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ๋” ์ž‘์€ ๋ฌธ์ œ๋กœ ๋‚˜๋ˆ„์–ด ํšจ์œจ์ ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ DP๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋•Œ State value function์„ ์ฃผ๋กœ ์ด์šฉํ•˜๋ฉฐ ์ฐจ์›์˜ ์ €์ฃผ์— ์˜ํ•˜์—ฌ state space๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜๊ธฐ ์–ด๋ ต๋‹ค.
โ€ข
Model-free : Learning
MDP๋ฅผ ๋ชจ๋ฅด๋Š” ์ƒํ™ฉ์ด๋ฏ€๋กœ RL์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•œ๋‹ค. ์ด๋•Œ ์ •ํ™•ํ•œ transition probability๋Š” ๋ชจ๋ฅด์ง€๋งŒ (s,a) โ†’ (sโ€™,aโ€™) ์„ ์—ฌ๋Ÿฌ๊ฐœ ์–ป์œผ๋ฉด ๊ทผ์‚ฌ์ ์œผ๋กœ transition probability๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ action์„ ์ทจํ•ด์„œ ์–ป์€ sample data๋กœ policy๋ฅผ ํ‰๊ฐ€ํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ด๋•Œ ๋ชจ๋“  state์— ๋Œ€ํ•œ ๊ณ„์‚ฐ์ด ์–ด๋ ต๊ณ  ์ฃผ์–ด์ง„ data๋ฅผ ํ†ตํ•ด ๊ฒŒ์‚ฐํ•ด์•ผ ํ•˜๋ฏ€๋กœ action value function์ด ์ฃผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.
โ‡’ ๋ฏธ๋ž˜์˜ state์˜ value๋ฅผ ํ†ตํ•ด ํ˜„์žฌ state์˜ value๋ฅผ updateํ•˜๋Š” ๋ฐฉ์‹์„ Backup์ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ, DP์—์„œ๋Š” MDP๋ฅผ ์•„๋Š” ์ƒํ™ฉ์ด๋ฏ€๋กœ full backup์„ ์‚ฌ์šฉํ•˜๊ณ  MDP๋ฅผ ๋ชจ๋ฅด๋Š” RL์—์„œ๋Š” sample backup์„ ์‚ฌ์šฉํ•œ๋‹ค.

Value iteration

1. Value iteration ๊ฐœ์š”

Bellman optimality equation์„ ์‚ฌ์šฉํ•˜์—ฌ Optimal Policy๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋ฐ˜๋ณต์ ์œผ๋กœ Bellman optimality equation์„ ์ ์šฉํ•˜์—ฌ Optimal function์˜ Update๊ฐ€ ์ด์ „๊ณผ ๋ณ€ํ™”๊ฐ€ ์—†์„ ๋•Œ๊นŒ์ง€ ์ฆ‰, ์ˆ˜๋ ด์‹œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋•Œ back up ๋ฐฉ์‹์— 2๊ฐ€์ง€๊ฐ€ ์กด์žฌํ•˜๋Š”๋ฐ, ๋™๊ธฐ์  ๋ฐฉ์‹๊ณผ ๋น„๋™๊ธฐ์  ๋ฐฉ์‹์ด๋‹ค. ๋™๊ธฐ์  ๋ฐฉ์‹์€ value function์˜ update์— update ์ด์ „์— ๊ณ„์‚ฐํ•œ ๊ฐ’๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•œ๋ฒˆ์— updateํ•˜๋Š” ๋ฐฉ์‹์ด๊ณ  ๋น„๋™๊ธฐ์  ๋ฐฉ์‹์€ ํ•˜๋‚˜์˜ state๊ฐ€ update ๋˜๋ฉด, ๋‚˜๋จธ์ง€์— ๋Œ€ํ•ด update๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์ด์ „์— update๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

2. Value iteration ๋‹จ์ 

๋‹จ์ ์œผ๋กœ๋Š” ํฌ๊ฒŒ 2๊ฐ€์ง€๊ฐ€ ์žˆ๋Š”๋ฐ, ๊ฐ state๊ฐ€ optimal value์— ์ˆ˜๋ ดํ•˜๊ธฐ ์ด์ „์— value๋ฅผ max๋กœ ํ•˜๋Š” action์ด ๊ฑฐ์˜ ์ •ํ•ด์ง€๊ฒŒ ๋˜๋ฏ€๋กœ, ์ˆ˜๋ ด์‹œ๊นŒ์ง€ ์ด๋ฅผ ๋ฐ˜๋ณตํ•œ๋‹ค๋ฉด ์“ธ๋ชจ์—†๋Š” ์—ฐ์‚ฐ๋งŒ ๋Š˜์–ด๋‚˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ ๋ฐ˜๋ณต๋งˆ๋‹ค ๋ชจ๋“  state-action-next state๋ฅผ ๊ณ ๋ คํ•œ ๊ณ„์‚ฐ์ด ํ•„์š”ํ•˜์—ฌ ๊ณ„์‚ฐ๋Ÿ‰์ด O(S2A)O(S^2A)์œผ๋กœ ์ƒ๋‹น์ด ๋งŽ์€ ํŽธ์ด๋‹ค.
*์ˆ˜๋ ด์˜ ์˜๋ฏธ
โˆฃVk+1(s)โˆ’Vk(s)โˆฃ<ฯต|V_{k+1}(s) - V_k(s)| < \epsilon
โ†’ ๋ชจ๋“  state์— ๋Œ€ํ•ด update ์ „/ํ›„์˜ ์ฐจ์ด๊ฐ€ ๋งค์šฐ ์ž‘๋‹ค.

3. Algorithm

4. Example

โ€ข
Robot Car
์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌ์„ฑ๋œ Robot Car ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด value iteration ๋ฐฉ์‹์„ ์ ์šฉํ•ด๋ณด์ž
1.
State : cool, warm, overheated
2.
Action : slow, fast
3.
Rewards : slow = 1, fast = 2 , overheated = -10
โ†’ transition probability ์™€ reward๋ฅผ ์•Œ๊ณ  ์žˆ์œผ๋ฏ€๋กœ Model-based
โ€ข
Value iteration์˜ ์ ์šฉ
1.
๊ฐ state value๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”
2.
Bellman optimality equation์„ ์‚ฌ์šฉํ•˜์—ฌ optimal state value ๊ณ„์‚ฐ
3.
์ˆ˜๋ ด ์‹œ๊นŒ์ง€ ๋ฐ˜๋ณต
โ€ข
์˜ˆ) 2๋ฒˆ์งธ ๋ฐ˜๋ณต์—์„œ Warm state์˜ value ๊ณ„์‚ฐํ•˜๊ธฐ

Policy iteration

1. Policy iteration ๊ฐœ์š”

Policy iteration์€ policy evaluation๊ณผ policy improvement๋ฅผ ์ˆ˜๋ ด ์‹œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•˜์—ฌ optimal policy๋ฅผ ์ฐพ์•„๊ฐ€๋Š” ๊ณผ์ •์œผ๋กœ Bellman expectation equation์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋•Œ policy evaluation ๋‹จ๊ณ„์—์„œ๋Š” deterministic policy๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ state์— ๋Œ€ํ•œ value๋ฅผ ์ˆ˜๋ ด ์‹œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•˜์—ฌ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด๋•Œ ๊ณ ์ •๋œ Policy๋กœ ํ•˜๋‚˜์˜ action๋งŒ ๊ณ ๋ คํ•˜๋ฏ€๋กœ ๊ณ„์‚ฐ๋Ÿ‰์ด O(S2)O(S^2)์œผ๋กœ ์ด์ „์˜ value iteration์— ๋น„ํ•ด ๋” ์ ์–ด์ง„๋‹ค. Policy Improvement ๋‹จ๊ณ„์—์„œ๋Š” ๊ณ„์‚ฐ๋œ state value์— ๋Œ€ํ•ด greedy policy๋ฅผ ์„ ํƒํ•˜์—ฌ policy๋ฅผ ๊ฐœ์„ ํ•œ๋‹ค.

2. Policy Improvement Theorem

1.
ํ˜„์žฌ state S์—์„œ Policy ฯ€\pi๋ฅผ ๋”ฐ๋ž์„ ๋•Œ์˜ state value ์™€ ํ˜„์žฌ state S์—์„œ Policy ฯ€โ€™\piโ€™์— ์˜ํ•˜์—ฌ action์„ ์„ ํƒํ•˜๊ณ  ์ดํ›„์— Policy ฯ€\pi๋ฅผ ๋”ฐ๋ž์„ ๋•Œ์˜ action value๋ฅผ ๋น„๊ตํ–ˆ์„ ๋•Œ action value๊ฐ€ ๋ชจ๋“  state์— ๋Œ€ํ•ด ํฌ๋‹ค๋ฉด ฯ€โ€™\piโ€™์ด ๋” ๋‚˜์€ Policy์ด๋‹ค.
2.
ํ˜„์žฌ state S์—์„œ Policy ฯ€โ€™\piโ€™์„ ๋”ฐ๋ž์„ ๋•Œ์˜ state value์™€ ํ˜„์žฌ state S์—์„œ Policy ฯ€\pi๋ฅผ ๋”ฐ๋ž์„ ๋•Œ์˜ state value ์ค‘ ์ „์ž๊ฐ€ ํฌ๋‹ค๋ฉด, ฯ€โ€™\piโ€™์ด ๋” ๋‚˜์€ Policy์ด๋‹ค.

3. Algorithm

โ†’ Iterative policy evaluation์„ ํ†ตํ•ด์„œ vฯ€v_\pi๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ greedyย policyย ฯ€greedy \ policy \ \pi๋ฅผ ์„ ํƒํ•˜์—ฌ policy๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค. ์ด๋•Œ, greedyย policyย ฯ€โˆ—greedy\ policy\ \pi^*๊ฐ€ oldย policyย ฯ€old\ policy\ \pi์— ๋น„ํ•ด์„œ ๋›ฐ์–ด๋‚˜์ง€ ์•Š๋‹ค๋ฉด, ๋‘ policy๋กœ ๊ณ„์‚ฐ๋œ state value๊ฐ€ ์ˆ˜๋ ดํ•œ ์ƒํƒœ๊ฐ€ ๋  ๊ฒƒ์ด๊ณ  ๊ทธ๋•Œ์˜ value๊ฐ€ optimal value์ด๋ฏ€๋กœ ๊ทธ๋•Œ์˜ policy๊ฐ€ optimal policy๊ฐ€ ๋œ๋‹ค.
โ†’ ๋ฐ˜๋ณต ํšŸ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜์•ผ ์ˆ˜๋ ดํ•˜๊ฒŒ ๋˜๋Š” state-value์™€ ๋‹ฌ๋ฆฌ optimal policy๋Š” ์‚ฌ์ „์— ์ •ํ•ด์ง€๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์—ฌ๊ธฐ์„œ value iteration๊ณผ ๋‹ค๋ฅด๊ฒŒ policy iteration์€ policy์˜ ๊ฐœ์„ ์ด ์—†๋‹ค๋ฉด ์‚ฌ์ „ ์ข…๋ฃŒ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

4. Optimal policy ์ •๋ฆฌ

โ€ข
Dynamic programming
โ†’ Model-based ์ด๋ฏ€๋กœ MDP์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์•Œ๊ณ  ์žˆ๊ณ  ๋”ฐ๋ผ์„œ ๋ชจ๋“  value function์˜ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ table์— value๋ฅผ ์ €์žฅํ•˜์—ฌ full backupํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ optimal policy๋ฅผ ์ฐพ๋Š”๋‹ค. ์ด ๊ณผ์ •์—์„œ ๊ณ„์‚ฐ๋Ÿ‰์˜ ๋ฌธ์ œ๋กœ state-value function์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ๋˜ํ•œ DP์˜ ๊ฒฝ์šฐ state์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜๋ฉด state์— ๋Œ€ํ•ด ์ง€์ˆ˜์ ์œผ๋กœ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋Š˜์–ด๋‚˜๋ฏ€๋กœ large problems์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค.
โ€ข
RL
โ†’ Model-free์ด๋ฏ€๋กœ MDP์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋ชจ๋ฅด๋Š” ์ƒํ™ฉ์ด๊ธฐ ๋•Œ๋ฌธ์— ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์–ป์€ sample data๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ learning์„ ์ง„ํ–‰ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๊ทผ์‚ฌ์ ์œผ๋กœ Bellman optimality equation์„ ํ’€์–ด optimal solution์„ ์ฐพ๋Š”๋‹ค. sampling์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— state์˜ ๊ฐœ์ˆ˜์— ๋Œ€ํ•ด ์ฐจ์›์˜ ์ €์ฃผ์— ๋น ์ง€๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.