Search
๐Ÿ’ช๐Ÿป

Double DQN

์ƒ์„ฑ์ผ
2024/07/22 02:59
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Double DQN

1. Double DQN ๊ฐœ์š”

์ผ๋ฐ˜์ ์ธ Q-learning์—์„œ target์„ ๊ณ„์‚ฐํ•  ๋•Œ max ์—ฐ์‚ฐ์ž๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Q-value๊ฐ€ overestimate ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค. ๋งŒ์•ฝ Q-value์— ๋Œ€ํ•œ ์ถ”์ • ์‹œ ๋ชจ๋“  ๊ฐ’์ด overestimate ๋œ๋‹ค๋ฉด, ์–ด์ฐจํ”ผ Q-value๋ฅผ ์ตœ๋Œ€๋กœ ํ•˜๋Š” action์„ ์„ ํƒํ•˜๋„๋ก ๋งŒ๋“œ๋Š” policy๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์ฃผ์š”ํ•œ ๋ชฉ์ ์ด๋ฏ€๋กœ ๋ณ„ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š๋Š๋‚˜ RL์—์„œ๋Š” ๋ชจ๋“  (s,a)(s,a)์— ๋Œ€ํ•ด์„œ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  sample์— ๋Œ€ํ•ด์„œ๋งŒ Q-value๋ฅผ ์ถ”์ •ํ•˜๋ฏ€๋กœ ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ Double Q-learning์€ ์ด๋Ÿฌํ•œ overestimation bias๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด target ๋‚ด์—์„œ action์„ ์„ ํƒํ•˜๋Š” ๋ถ€๋ถ„๊ณผ action์— ๋Œ€ํ•ด Q-value๋ฅผ ๊ตฌํ•˜๋Š” ๊ณผ์ •์„ ๋‚˜๋ˆ„์–ด ์ ์šฉํ•œ๋‹ค. Double Q-learning์—์„œ๋Š” Q-table์„ 2๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด ์ด ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•˜์˜€์œผ๋‚˜, DQN์—์„œ๋Š” target network์˜ ์‚ฌ์šฉ์œผ๋กœ ์ธํ•˜์—ฌ ์ด๋ฏธ ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๊ฐœ์˜ parameter๊ฐ€ ์กด์žฌํ•˜๋ฏ€๋กœ ์ด๋ฅผ ํ™œ์šฉํ•œ๋‹ค.
โ€ข
Double DQN์˜ loss function
L(ฮธ)=[rt+1+ฮณQ^(st+1,argโ€‰maxโกaQ(st+1,a;ฮธ);ฮธ^)โˆ’Q(st,at;ฮธ)]2L(\theta) = [r_{t+1}+\gamma \hat Q(s_{t+1}, \displaystyle\argmax_a Q(s_{t+1}, a ; \theta) ; \hat \theta) - Q(s_t,a_t ;\theta)]^2
โ‡’ Target์—์„œ action์„ ์„ ํƒํ•  ๋•Œ ฮธ\theta๋ฅผ ์‚ฌ์šฉํ•œ network์—์„œ ์„ ํƒํ•˜๋„๋ก ๋งŒ๋“ ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๊ฒฝ์šฐ ์„ ํƒ๋œ action์ด ฮธ^\hat \theta๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ ํƒํ•œ action์— ๋Œ€ํ•ด์„œ๋„ ํ•ญ์ƒ max๋ผ๋Š” ๋ณด์žฅ์€ ์—†๊ธฐ ๋•Œ๋ฌธ์— overestimate๊ฐ€ ์ค„์–ด๋“ค๊ฒŒ ๋œ๋‹ค.

2. Prioritized Replay ๊ฐœ์š”

Online RL agent๋Š” parameter update ์‹œ ์‹œ๊ฐ„์˜ ํ๋ฆ„์— ๋”ฐ๋ผ ์–ป์–ด์ง„ data๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” ์—ฐ์†๋œ sample๋“ค์ด ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋˜๋Š” temporal correlation ๋ฌธ์ œ๋ฅผ ์œ ๋ฐœํ•˜๋ฉฐ, ๋“œ๋ฌผ๊ฒŒ ์–ป์–ด์ง€๋Š” ๊ฒฝํ—˜์— ๋Œ€ํ•ด์„œ ๋น ๋ฅธ ํ๊ธฐ๋ฅผ ํ•œ๋‹ค๋Š” ๊ฒƒ์—์„œ ๋‹จ์ ์„ ๊ฐ€์ง„๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด DQN์—์„œ๋Š” Replay buffer๋ฅผ ๋„์ž…ํ•˜์—ฌ transition์„ ์ €์žฅํ•˜์—ฌ replay buffer๋กœ๋ถ€ํ„ฐ minibatch์˜ ํฌ๊ธฐ๋งŒํผ samplingํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ๊ฒฝ์šฐ ์ค‘์š”ํ•œ sample๊ณผ ๊ทธ๋ ‡์ง€ ์•Š์€ sample์ด ์ถ”์ถœ๋  ํ™•๋ฅ ์ด ๋™์ผํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ์ค‘์š”ํ•œ sample์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ๋” ์ž์ฃผ sampling ๋˜๋„๋ก ๋งŒ๋“ค ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์ด๋•Œ ์ค‘์š”ํ•œ sample์˜ ํŒ๋‹จ์€ TD error์˜ ํฌ๊ธฐ๋กœ ํ‰๊ฐ€ํ•˜๋Š”๋ฐ, TD error๊ฐ€ ํฌ๋‹ค๋Š” ๊ฒƒ์€ ํ•ด๋‹น state์—์„œ ์˜ˆ์ธกํ•œ Q-value์— ๋Œ€ํ•œ ๊ฐœ์„ ์ด ๋งŽ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์ด๊ณ , ๊ฐœ์„ ์— ๋Œ€ํ•œ ๋งŽ์€ ์ •๋ณด๋ฅผ ๋‹ด์€ transition์ด๋ฏ€๋กœ ์ค‘์š”ํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์—ฌ๊ธฐ์—๋„ sample์˜ ๋‹ค์–‘์„ฑ ๊ฐ์†Œ์™€ bias์— ๋Œ€ํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ stochastic sampling prioritization๊ณผ importance sampling weights๋ฅผ ๋„์ž…ํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•œ๋‹ค.

3. Prioritizing with TD error

โ€ข
Model-based
์˜ˆ๋ฅผ ๋“ค์–ด value-iteration์˜ ๊ฒฝ์šฐ prioritizing update๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋” ํšจ์œจ์ ์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์€๋ฐ, ์ด๋•Œ ๋ชจ๋“  state์— ๋Œ€ํ•ด์„œ ๋™๋“ฑํ•˜๊ฒŒ updateํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ state value์˜ ๋ณ€ํ™”๋Ÿ‰์ด ๊ฐ€์žฅ ํฐ state์— ๋Œ€ํ•ด์„œ ์šฐ์„ ์ ์œผ๋กœ updateํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ํ˜„์žฌ policy์— ๋Œ€ํ•ด์„œ Q-value๋ฅผ ์ฐพ์„ ๋•Œ ์˜ํ–ฅ๋ ฅ์ด ํฐ state๋ถ€ํ„ฐ update๋ฅผ ๋ฐ˜์˜ํ•˜๊ฒŒ ๋˜๋ฏ€๋กœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋Š” ์ค‘์š”ํ•œ update์˜ ๊ฒฐ๊ณผ๊ฐ€ ์ฆ‰์‹œ ๋ฐ˜์˜๋˜๋Š” ๋น„๋™๊ธฐ์  ๋ฐฉ์‹์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.
โ€ข
Model-free
๋Œ€๋ถ€๋ถ„์˜ RL ํ™˜๊ฒฝ์—์„œ ์„ฑ๊ณต์— ๋Œ€ํ•œ transition๋ณด๋‹ค ์‹คํŒจ์— ๋Œ€ํ•œ transition์ด ๋” ๋งŽ์ด ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ํŠน์ • ์‹œ๋„์— ๋Œ€ํ•ด ์„ฑ๊ณต์ ์ธ ๊ฒฐ๊ณผ๋กœ ์ด์–ด์กŒ์„ ๋•Œ, value ์ฐจ์ด๊ฐ€ ๋งค์šฐ ์ปค์ง€๊ฒŒ ๋œ๋‹ค. ์™œ๋ƒํ•˜๋ฉด ํŠน์ • state์—์„œ ์„ฑ๊ณต์ ์ธ action์„ ์ทจํ–ˆ๋‹ค๋Š” ๊ฒƒ์€ ๊ทธ action์— ์˜ํ•ด ๋ฐ›๊ฒŒ ๋˜๋Š” reward๊ฐ€ ์ปค์ง„๋‹ค๋Š” ๊ฒƒ๊ณผ action์˜ ๊ฒฐ๊ณผ ์ด๋™ํ•˜๊ฒŒ ๋˜๋Š” state๊ฐ€ ๋” ๋†’์€ ๊ฐ€์น˜๋ฅผ ๊ฐ€์ง„ state๋ผ๋Š” ์˜๋ฏธ๊ฐ€ ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋งŒ์•ฝ ๋‹จ์ˆœํžˆ Replay buffer์—์„œ ๋™๋“ฑํ•˜๊ฒŒ sampling ํ•  ๊ฒฝ์šฐ ์œ„์™€ ๊ฐ™์€ ์œ ์˜๋ฏธํ•œ sample์ด ๋ฝ‘ํžˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ ์„ ๊ฒƒ์ด๋ฏ€๋กœ Prioritizing์„ ์‚ฌ์šฉํ•œ๋‹ค.
โ‡’ ๊ฒฐ๊ณผ์ ์œผ๋กœ Prioritizing๋Š” ํ˜„์žฌ state์—์„œ ํŠน์ • transition์„ ์ทจํ–ˆ์„ ๋•Œ ํ•™์Šตํ•  ์ •๋ณด๊ฐ€ ๋งŽ์€ ๊ฒƒ์„ ๋” ์šฐ์„ ์ ์œผ๋กœ ๊ณ ๋ คํ•˜๋„๋ก ๋งŒ๋“ค์–ด์ฃผ์–ด์•ผ ํ•˜๋ฉฐ ๋”ฐ๋ผ์„œ TD-error๋ฅผ ๊ฐ€์ง€๊ณ  ๊ทธ๊ฒƒ์„ ํŒ๋‹จํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋•Œ TD-error๋Š” ์ด๋ฏธ DQN-learning ๊ณผ์ • ์ค‘์—์„œ ์ฐพ์•„์ง€๋Š” ๊ฐ’์ด๋ฏ€๋กœ ์‹ค์šฉ์ ์ด๋‹ค.
โ€ข
์‹
TD-error๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ samplingํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ€์ค‘์น˜๋Š” ์•„๋ž˜ ์‹๊ณผ ๊ฐ™๋‹ค.
โ€ข
๋ฌธ์ œ์ 
Replay buffer์˜ ์ „์ฒด transition์„ updateํ•˜๋Š” ๊ฒƒ์€ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์ด ํ•„์š”ํ•˜๋ฏ€๋กœ mini batch๋กœ sampling ๋œ transition์— ๋Œ€ํ•ด์„œ๋งŒ update๋œ๋‹ค. ๋”ฐ๋ผ์„œ TD-error์˜ ์ดˆ๊ธฐ๊ฐ’ ํฐ transition์— ๋Œ€ํ•ด์„œ๋Š” ์ž์ฃผ ์„ ํƒ๋˜์–ด update๊ฐ€ ์ •ํ™•ํ•˜๊ฒŒ ๋˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ๊ทธ๋ ‡์ง€ ์•Š๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ sample์˜ ๋‹ค์–‘์„ฑ์ด ์ค„์–ด overfitting์˜ ๊ฐ€๋Šฅ์„ฑ์ด ์ƒ๊ธด๋‹ค. ์‹ฌ์ง€์–ด, sampling์„ ์ง„ํ–‰ํ• ์ˆ˜๋ก priority๊ฐ€ ๊ณ„์† ๋ณ€ํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— data sampling์„ ์ง„ํ–‰ํ•˜๋Š” ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๋ถ„ํฌ๊ฐ€ ๊ณ„์† ๋ณ€ํ™”ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.
โ€ข
Sample ๋‹ค์–‘์„ฑ์— ๊ด€ํ•œ ๋ฌธ์ œ ํ•ด๊ฒฐ
Stochastic sampling prioritization์„ ์‚ฌ์šฉํ•œ๋‹ค. Sample์— ๋Œ€ํ•œ prioritization์„ ํ•˜๋Š” ํ™•๋ฅ ์ด P(i)=pฮฑโˆ‘kpkฮฑP(i) = \frac {p^\alpha}{\sum_k p^{\alpha} _{k}}๋กœ ์ฃผ์–ด์ง€๋Š”๋ฐ, ์ด๋•Œ ฮฑ\alpha์˜ ๊ฐ’์ด 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก TD-error๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์„ ํƒ๋  ๋น„์œจ์„ ๋†’์ด๊ฒŒ ๋˜๊ณ  ๋งŒ์•ฝ 0์ด ๋œ๋‹ค๋ฉด prioritization์„ ์ „ํ˜€ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋œ๋‹ค. ์ •๋ฆฌํ•˜์ž๋ฉด ฮฑ\alpha๊ฐ€ ์–ผ๋งˆ๋‚˜ prioritization์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ hyperparameter๊ฐ€ ๋œ๋‹ค.
โ€ข
Sampling ๋ถ„ํฌ ๋ณ€ํ™”์— ๋”ฐ๋ฅธ bias ๋ฌธ์ œ ํ•ด๊ฒฐ
Importance sampling weights๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ samling ๋ถ„ํฌ๋ฅผ ๋ณด์ •ํ•œ๋‹ค. ์ด๋•Œ importancee sampling weights์˜ ๊ฐ’์€ wi=(1Nย 1P(i))ฮฒw_i = (\frac{1}{N} \ \frac{1}{P(i)})^\beta๋กœ ์ฃผ์–ด์ง€๊ฒŒ ๋œ๋‹ค. ๋งŒ์•ฝ ฮฒ\beta๊ฐ€ 1์ด ๋œ๋‹ค๋ฉด ์™„๋ฒฝํ•œ ๋ณด์ •์ด ์ด๋ค„์ง€๊ฒŒ ๋œ๋‹ค. ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
unbiased update๋ฅผ ์œ„ํ•ด์„œ๋Š” training์˜ ๋งˆ์ง€๋ง‰์— ฮฒ\beta๊ฐ€ 1์— ์ˆ˜๋ ดํ•˜๋„๋ก ๋งŒ๋“ค์–ด์ฃผ๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ๋”ฐ๋ผ์„œ ์ดˆ๊ธฐ๊ฐ’ ฮฒ0\beta_0๋ฅผ 1๋กœ ์ˆ˜๋ ดํ•˜๋„๋ก ํ•˜๋Š” linearly anneal์„ ์ ์šฉํ•œ๋‹ค.

4. Double DQN with prioritized replay pseudo code

*** โ–ณ\triangle์€ parameter update๋ฅผ ์œ„ํ•œ ์ €์žฅ ๊ณต๊ฐ„