Search
๐Ÿ’ช๐Ÿป

Double DQN

์ƒ์„ฑ์ผ
2024/07/22 02:59
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Double DQN

1. Double DQN

โ€ข
Q-learning์—์„œ์˜ Target
โ—ฆ
r+ฮณmaxโกaQ(sโ€ฒ,a)r + \gamma \displaystyle \max_{a} Q(s',a)
โ–ช
maxโกaQ(sโ€ฒ,a)\displaystyle\max_{a} Q(s', a) ์—์„œ Q(sโ€ฒ,a)Q(s',a)๋Š” ์ถ”์ •๊ฐ’
โ–ช
์ด ๊ฐ’์ด ์‹ค์ œ๋กœ ๊ฐ€์žฅ ์ข‹์€ ํ–‰๋™์ด๊ธฐ ๋•Œ๋ฌธ์— ๋†’์€ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์šฐ์—ฐํžˆ ๋†’์€ ๊ฐ’์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.
โ€ข
Q-value์˜ overestimation bias๋ฅผ ํ•ด๊ฒฐ
โ—ฆ
target์—์„œ action์„ ์„ ํƒํ•˜๋Š” ๋ถ€๋ถ„๊ณผ ์„ ํƒ๋œ action์— ๋Œ€ํ•ด Q-value๋ฅผ ๊ตฌํ•˜๋Š” ๊ณผ์ •์„ ๋ณ„๋„์˜ Network๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ง„ํ–‰
โ€ข
Double DQN์˜ loss function
L(ฮธ)=[rt+1+ฮณQ^(st+1,argโ€‰maxโกaQ(st+1,a;ฮธ);ฮธ^)โˆ’Q(st,at;ฮธ)]2L(\theta) = [r_{t+1}+\gamma \hat Q(s_{t+1}, \displaystyle\argmax_a Q(s_{t+1}, a ; \theta) ; \hat \theta) - Q(s_t,a_t ;\theta)]^2
โ—ฆ
Target์—์„œ action ์„ ํƒ
โ–ช
ฮธ\theta๋ฅผ parameter๋กœ ๊ฐ€์ง€๋Š” network
โ—ฆ
Q-value ๊ณ„์‚ฐ
โ–ช
ฮธ^\hat \theta์„ parameter๋กœ ๊ฐ€์ง€๋Š” network
โ—ฆ
๋™์ผํ•œ ํ–‰๋™์ด ๋‘ ๋„คํŠธ์›Œํฌ์—์„œ ๋™์‹œ์— ๊ฐ€์žฅ ํฐ Q-value๋ฅผ ๊ฐ–๊ฒŒ ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ๊ธฐ ๋•Œ๋ฌธ์— overestimation ๋ฌธ์ œ๊ฐ€ ์™„ํ™”๋จ

2. Overestimation

โ€ข
Jensen์˜ ๋ถ€๋“ฑ์‹
E[maxโกaQ(sโ€ฒ,a)]โ‰ฅmaxโกaE[Q(sโ€ฒ,a)]\displaystyle E[\max_{a}Q(s',a)] \geq \max_{a}E[Q(s',a)]
โ€ข
E[Q(sโ€ฒ,a)]E[Q(s',a)]๋Š” ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ sample์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ํ•œ๋‹ค๋ฉด ์‹ค์ œ Q-value์— ๊ทผ์‚ฌํ•จ.
โ€ข
์ฆ‰ Jensen์˜ ๋ถ€๋“ฑ์‹์— ์˜ํ•ด ์™„์ „ํžˆ update ๋˜์ง€ ์•Š์€ Q-value์— max ์—ฐ์‚ฐ์ž๋ฅผ ๋จผ์ € ์ ์šฉํ•˜๋Š” ๊ฒƒ์€ overestimation ๋ฌธ์ œ๋ฅผ ๋ฐœ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธ
โ€ข
sโ€ฒs'์—์„œ ๊ฐ€๋Šฅํ•œ ํ–‰๋™์ด ์ด 3๊ฐœ๊ฐ€ ์กด์žฌํ•˜๊ณ  5๊ฐœ์˜ sample์„ ์–ป์—ˆ๋‹ค๊ณ  ๊ฐ€์ •
โ—ฆ
a1:[1.25,0.88,1.38,1.77,0.84]a2:[2.47,1.91,2.06,2.38,1.79]a3:[3.03,3.54,2.65,3.40,2.86]a_1: [1.25, 0.88, 1.38, 1.77, 0.84] \\ a_2: [2.47, 1.91, 2.06, 2.38, 1.79] \\ a_3: [3.03, 3.54, 2.65, 3.40, 2.86]
โ—ฆ
max ์—ฐ์‚ฐ์ž๋ฅผ ๋จผ์ € ์ ์šฉํ•˜๊ณ  mean ์—ฐ์‚ฐ์„ ์ ์šฉ ์‹œ
โ–ช
max:[3.03,3.54,2.65,3.40,2.86]max: [3.03, 3.54, 2.65, 3.40, 2.86]
โ–ช
mean:3.096mean : 3.096
โ—ฆ
mean ์—ฐ์‚ฐ์ž๋ฅผ ๋จผ์ € ์ ์šฉํ•˜๊ณ  max ์—ฐ์‚ฐ ์ ์šฉ ์‹œ
โ–ช
a1:1.22,a2:2.12,a3:3.10a_1: 1.22, a_2: 2.12, a_3: 3.10
โ–ช
max:3.10max : 3.10

3. Prioritized Replay

โ€ข
Online RL
โ—ฆ
์—ฐ์†๋œ transition ์‚ฌ์ด์˜ temporal correlation ๋ฌธ์ œ ์œ ๋ฐœ
โ—ฆ
๋“œ๋ฌผ๊ฒŒ ๋ฐœ์ƒํ•˜๋Š” experience์— ๋Œ€ํ•ด ๊ทธ ๊ฐ€์น˜๊ฐ€ ๋†’๋”๋ผ๋„ ํ๊ธฐ
โ—ฆ
DQN์—์„œ replay buffer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์ œ ์™„ํ™”
โ€ข
Replay buffer
โ—ฆ
์ค‘์š”ํ•œ sample๊ณผ ๊ทธ๋ ‡์ง€ ์•Š์€ sample์ด ์ถ”์ถœ๋  ํ™•๋ฅ ์ด ๋™์ผ
โ—ฆ
์ค‘์š”ํ•œ sample์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ๋” ์ž์ฃผ sampling ๋˜๋„๋ก ์กฐ์น˜ํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ
โ€ข
Sample์˜ ์ค‘์š”์„ฑ?
โ—ฆ
TD error์˜ ํฌ๊ธฐ๋กœ ํ‰๊ฐ€

3. Prioritizing with TD error

โ€ข
Model-based
โ—ฆ
Value iteration
โ–ช
Value ๋ณ€ํ™”๋Ÿ‰์ด ํฐ ์ƒํƒœ๋ถ€ํ„ฐ ๋จผ์ € update
โ–ช
์ค‘์š”ํ•œ ๋ณ€ํ™”๊ฐ€ ๋‹ค๋ฅธ state value ๊ณ„์‚ฐ์— ์ฆ‰์‹œ ๋ฐ˜์˜
โ–ช
๋น„๋™๊ธฐ์  ๋ฐฉ์‹์—์„œ ํšจ๊ณผ์ 
โ€ข
Model-free
โ—ฆ
์„ฑ๊ณต์— ๋Œ€ํ•œ transition๋ณด๋‹ค ์‹คํŒจ์— ๋Œ€ํ•œ transition์ด ๋” ๋งŽ์ด ๋“ฑ์žฅ
โ—ฆ
ํŠน์ • ์‹œ๋„์— ๋Œ€ํ•ด ์„ฑ๊ณต์ ์ธ ๊ฒฐ๊ณผ๋กœ ์ด์–ด์ง„๋‹ค๋ฉด value ์ฐจ์ด๊ฐ€ ๋งค์šฐ ์ปค์ง
โ€ข
TD-error ๊ธฐ๋ฐ˜ ๊ฐ€์ค‘์น˜ ๊ณ„์‚ฐ
โ€ข
๋ฌธ์ œ์ 
โ—ฆ
Replay buffer์˜ ์ „์ฒด transition์„ update ํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ ์ด๋ฏ€๋กœ mini batch๋กœ sampling๋œ transition์— ๋Œ€ํ•ด์„œ๋งŒ priority update
โ–ช
์ดˆ๊ธฐ sampling ๋œ transition ์ค‘ TD-error๊ฐ€ ํฐ transition์ด ์ž์ฃผ ์„ ํƒ๋˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ๋ฌด์‹œ๋  ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ
โ–ช
Sample ๋‹ค์–‘์„ฑ์˜ ๊ฐ์†Œ๋กœ ์ธํ•ด ๊ณผ์ ํ•ฉ ๊ฐ€๋Šฅ
โ—ฆ
Priority๊ฐ€ ๊ณ„์† ๋ณ€ํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, transition sampling์„ ์ง„ํ–‰ํ•˜๋Š” ๋ถ„ํฌ๊ฐ€ ๊ณ„์† ๋ณ€ํ™”ํ•˜์—ฌ bias ๋ฐœ์ƒ
โ€ข
Sample ๋‹ค์–‘์„ฑ์— ๊ด€ํ•œ ๋ฌธ์ œ ํ•ด๊ฒฐ
โ—ฆ
Stochastic sampling prioritization์„ ์‚ฌ์šฉ
โ—ฆ
Prioritization ํ™•๋ฅ 
โ–ช
P(i)=pฮฑโˆ‘kpkฮฑP(i) = \frac {p^\alpha}{\sum_k p^{\alpha} _{k}}
โ–ช
ฮฑ\alpha์˜ ๊ฐ’์ด 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก TD-error๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์„ ํƒ๋  ๋น„์œจ์„ ๋†’์ž„
โ–ช
ฮฑ\alpha=0์ด๋ผ๋ฉด prioritization์„ ์ „ํ˜€ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์ž„
โ—ฆ
ฮฑ\alpha : ์–ผ๋งˆ๋‚˜ prioritization์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ hyperparameter
โ€ข
Sampling ๋ถ„ํฌ ๋ฌธ์ œ ํ•ด๊ฒฐ
โ—ฆ
Importance sampling weights ์‚ฌ์šฉ
โ—ฆ
importancee sampling weights
โ–ช
wi=(1Nย 1P(i))ฮฒw_i = (\frac{1}{N} \ \frac{1}{P(i)})^\beta
โ–ช
ฮฒ=1\beta=1 : ์™„๋ฒฝํ•œ ๋ณด์ •
โ–ช
Stability๋ฅผ ์œ„ํ•ด weight์— 1maxโกkwk\frac{1}{\displaystyle \max_{k} w_k}์„ ๊ณฑํ•˜์—ฌ ์ง„ํ–‰
โ—ฆ
Update ์‹œ ์ž์ฃผ sampling ๋˜๋Š” sample์˜ ์˜ํ–ฅ๋ ฅ์„ ์ค„์ธ๋‹ค.

4. Double DQN with prioritized replay pseudo code