Search
๐Ÿ

3์žฅ

Dynamic Programming

Dynamic : sequential or temporal component to the problem
Programming : optimising a โ€œprogramโ€ c.f. linear programming
๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๋ฐฉ๋ฒ•์„ ์˜๋ฏธ

Requirements for Dynamic Programming

โ€ข
Optimal substructure
โ€ข
Overlapping subproblems
DP๋Š” 2๊ฐ€์ง€ ์†์„ฑ์„ ๊ฐ€์ง„ ๋ฌธ์ œ๋“ค์„ ์œ„ํ•œ ์ผ๋ฐ˜์ ์ธ ์†”๋ฃจ์…˜ ๋ฐฉ๋ฒ•์ด๋‹ค.

Markov decision processes attribute

1.
Bellman equation gives recursive decomposition
2.
Value function stores and reuse solutions

Planning by Dynamic Programming

Planning ์ด๋ž€ ๋ชจ๋ธ์„ ์•Œ๊ณ  ์žˆ์„ ๋•Œ, MDP๋ฅผ ํ‘ธ๋Š” ๋ฌธ์ œ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
โ†’ ์ตœ์ ์˜ policy๋ฅผ ์ฐพ๋Š” ๊ฒƒ์„ ์˜๋ฏธ
ํ’€๋‹ค์˜ ์˜๋ฏธ๋กœ 2๊ฐ€์ง€ ์ข…๋ฅ˜๊ฐ€ ์žˆ๋‹ค.
1.
prediction
2.
control

Prediction โ†’ value function์„ ์ฐพ๋Š” ๊ฒƒ

MDP ์—์„œ P๋Š” transition matrix ๋กœ์จ ํ™•๋ฅ ์ด ๋‹ด๊ฒจ ์žˆ๋‹ค.
์—ฌ๊ธฐ์„œ policy pi๋Š” optimal ์ด ์•„๋‹ˆ์–ด๋„ ๋œ๋‹ค.
์–ด๋– ํ•œ policy๋ฅผ ๋˜์ ธ ์ฃผ๋ฉด ๊ทธ์— ๋งž๋Š” value function์„ ๋ฐ˜ํ™˜ํ•ด์ฃผ๋Š” ๊ฒƒ์„ prediction์„ ํ‘ธ๋Š” ๊ฒƒ์ด๋ผ ํ•œ๋‹ค.

Control โ†’ optimal policy๋ฅผ ์ฐพ๋Š” ๊ฒƒ

Policy Evaluation

Iterative Policy Evaluation

Problem : evaluate a given policy pi
Solution : iterative application of Bellman expectation backup
backup์€ memory์— ์ €์žฅํ•˜๊ฒŒ ๋œ๋‹ค.
๋ชจ๋“  iteration๋™์•ˆ์— ์กด์žฌํ•˜๋Š” ๋ชจ๋“  state S์— ๋Œ€ํ•ด v(k(sโ€™)) ๋กœ๋ถ€ํ„ฐ v(k+1(s))๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ๊ฒƒ์„ synchronous backups๋ผ ํ•œ๋‹ค.
sโ€™๋Š” s์˜ ์ด์–ด์ง€๋Š” state๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
์œ„์™€ ๊ฐ™์€ ๊ณ„์‚ฐ์„ ๊ณ„์†ํ•˜๋‹ค๋ณด๋ฉด v(pi)์— ์ˆ˜๋ ดํ•œ๋‹ค.

Iterative Policy Evaluation

๊ฐ€์žฅ ์œ„์— ์กด์žฌํ•˜๋Š” state์˜ value function์€ ์ด์ „์˜ state๋“ค์˜ value function ๋“ค๋กœ ๋ถ€ํ„ฐ ์—…๋ฐ์ดํŠธ๋œ๋‹ค.

Policy Iteration

How to Improve a Policy

Given a policy ฯ€

โ€ข
Evaluate the policy ฯ€
โ€ข
Improve the policy by acting greedily with respect to v(ฯ€)
But this process of policy iteration always converges to ฯ€โˆ—
์‹œ์ž‘์€ ์ด์ƒํ•ด๋„ ๋์€ optimal ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Œ.
โ€ข
Policy evaluation : Estimate v(ฯ€) Iterative policy evaluation
โ€ข
Policy improvement : Generate ฯ€โ€ฒ โ‰ฅ ฯ€ Greedy policy improvement

Value Iteration

Principle of Optimality

Any optimal policy can be subdivided into two components:
โ€ข
An optimal first action Aโˆ—
โ€ข
Followed by an optimal policy from successor state Sโ€ฒ

Deterministic Value Iteration

์œ„ ์‹์€ nolinear function ๋•Œ๋ฌธ์— inverse๋ฅผ ๊ตฌํ•˜์ง€ ๋ชปํ•ด ์ง์ ‘์ ์ธ ์†”๋ฃจ์…˜์„ ๊ณ„์‚ฐํ•  ์ˆ˜์—†์–ด iterative ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ•ด์•ผํ•˜๋Š” Optimality Equation ์ด๋‹ค.
๋‹ค์Œ์— ๊ฐˆ ์ˆ˜ ์žˆ๋Š” state์—์„œ์˜ optimal value function์„ ๊ฐ€์ง€๊ณ  ํ˜„์žฌ state์—์„œ์˜ optimal value๋ฅผ ์ฐพ๋Š”๋‹ค๋Š” ๊ฑด ์ง๊ด€์ ์œผ๋กœ ์ตœ์ข… reward๋กœ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์„œ ๊ฑฐ๊พธ๋กœ ์•ž์„ ํ–ฅํ•ด ์ฐพ์•„๋‚˜๊ฐ„๋‹ค ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

Value Iteration

โ€ข
Unlike policy iteration, there is no explicit policy
โ€ข
Intermediate value functions may not correspond to any policy

Synchronous Dynamic Programming Algorithms

Problem
Bellman Equation
Algorithm
Prediction
Bellman Expectation Equation
Iterative Policy Evaluation
Control
Bellman Expectation Equation + Greedy Policy Improvement
policy Iteration
Control
Bellman Optimality Equation
Value Iteration
โ€ข
Algorithms are based on state-value function v(ฯ€(s)) or v(โˆ—(s))
โ€ข
Complexity O(mn2) per iteration, for m actions and n states
Cost๊ฐ€ ๋งค์šฐ ๋†’์Œ

Asynchronous Dynamic Programming

์ง€๊ธˆ๊นŒ์ง€์˜ DP ๋ฐฉ์‹์€ synchronous backup ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค.
โ†’ ๋ชจ๋“  state๋“ค์ด parallel ํ•˜๊ฒŒ ๋ฐฑ์—…์ด ๋˜์—ˆ๋‹ค.
Asynchronous Dynamic Programming ๋ฐฉ์‹์€ ์•„๋ฌด ์ˆœ์„œ๋กœ๋‚˜ state๋“ค์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋ฐฑ์—…ํ•œ๋‹ค.
โ†’ ์„ ํƒ๋œ ๊ฐ๊ฐ์˜ state์— ๋Œ€ํ•ด์„œ๋งŒ ์ ์ ˆํ•œ ๋ฐฑ์—…์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
๋ชจ๋“  state๋“ค์ด ๊ณจ๊ณ ๋ฃจ ๋‹ค ๋ฝ‘ํžŒ๋‹ค๋ฉด Convergence๋ฅผ ๋ณด์žฅํ•œ๋‹ค.