Search
๐Ÿ’ช๐Ÿป

DQN

์ƒ์„ฑ์ผ
2024/07/22 02:58
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

DQN

1. DQN ๊ฐœ์š”

DQN์€ ๋†’์€ ์ฐจ์›์˜ ์ด๋ฏธ์ง€ input์„ CNN์„ ๊ธฐ๋ฐ˜์œผ๋กœ Q-learning ๋ชจ๋ธ์— ์ ์šฉํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ๊ธฐ์กด Q-learning์—์„œ Q-table์„ ํ†ตํ•ด update ํ–ˆ์œผ๋‚˜, DQN์€ Q-network๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Policy evaluation์„ ์ง„ํ–‰ํ•œ๋‹ค. DQN์—์„œ๋Š” input์œผ๋กœ large state space ๋˜๋Š” continuous state space๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋Š” CNN์˜ convolution layer๊ฐ€ ๋†’์€ ์ฐจ์›์˜ input์„ ๋‚ฎ์€ ์ฐจ์›์˜ feature vector๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ output์œผ๋กœ๋Š” ๊ฐ action์— ๋Œ€ํ•œ Q-value๊ฐ€ ์ถœ๋ ฅ๋˜์–ด์•ผ ํ•˜๋ฏ€๋กœ action space๋Š” ์ด์‚ฐ์ ์ด๊ณ  ์œ ํ•œํ•˜๋ฉฐ state space์— ๋น„ํ•ด ์ž‘์€ ์ฐจ์›๋งŒ ๊ฐ€๋Šฅํ•˜๋‹ค.

2. DQN ๊ตฌ์กฐ

Playing Atari with Deep Reinforcement Learning์—์„œ ์†Œ๊ฐœ๋œ DQN์˜ ๊ตฌ์กฐ์ด๋‹ค.
โ€ข
Input
Input์œผ๋กœ ํŠน์ • ์ˆœ๊ฐ„์˜ ํ™”๋ฉด์ด ๋“ค์–ด๊ฐ€๊ฒŒ ๋˜๋ฉด, Markov property๋ฅผ ๋งŒ์กฑํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์—, 4๊ฐœ์˜ frame์„ ๋ฌถ์–ด์„œ input์œผ๋กœ ์ž…๋ ฅํ•œ๋‹ค. ์ด๋•Œ tabular method์™€ ๋‹ค๋ฅด๊ฒŒ ์ „์ฒด state์— ๋Œ€ํ•œ ์ž…๋ ฅ์ด ์•„๋‹Œ, ํŠน์ • state์— ๋Œ€ํ•œ ์ž…๋ ฅ์„ ๋ฐ›์•„ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ state space์˜ ์ฐจ์›์— ์ œ์•ฝ์„ ๋ฐ›์ง€ ์•Š๋Š”๋‹ค.
โ€ข
Convolution layer
์ž…๋ ฅ๋œ ์ด๋ฏธ์ง€๋ฅผ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜์—ฌ feature vector๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์ด ๊ณผ์ •์—์„œ ์ฐจ์›์ด ์ค„์–ด๋“ค๊ฒŒ ๋˜๋ฏ€๋กœ ๋†’์€ ์ฐจ์›์˜ ์ด๋ฏธ์ง€ ์ž…๋ ฅ๋„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.
โ€ข
Output
๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ฐ action์— ๋Œ€ํ•œ Q-value๋ฅผ ์ถœ๋ ฅํ•˜๊ฒŒ ๋œ๋‹ค. ์ „์ฒด ๊ณผ์ •์˜ ๋ชฉ์ ์€ ์ถœ๋ ฅ๋˜๋Š” Q-value๊ฐ€ optimal Q-value์— ๊ทผ์‚ฌํ•˜๋„๋กํ•˜๋Š” ๊ฒƒ์ด๋ฉฐ, ์ด๋ฅผ ์œ„ํ•ด์„œ Q-network์˜ parameter์ธ ฮธ\theta๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ด ๋œ๋‹ค. Target์—์„œ๋Š” output ์ค‘์— ๊ฐ€์žฅ ๋†’์€ Q-value๋ฅผ ๊ฐ–๋Š” action์„ ์„ ํƒํ•˜๊ฒŒ ๋œ๋‹ค.

3. Naive DQN

โ€ข
Q-learning
Q(St,At)โ†Q(St,At)+ฮฑ[Rt+1+ฮณmaxโกaQ(St+1,a)โˆ’Q(St,At)]Q(S_t, A_t) โ† Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \displaystyle\max_a Q(S_{t+1} ,a ) - Q(S_t, A_t)]
โ‡’ target์—์„œ๋Š” Q์— ๋Œ€ํ•œ greedygreedy policy๋กœ action์„ ์„ ํƒํ•˜๊ณ  ํ˜„์žฌ state์—์„œ๋Š” behavior policy์ธ Q์— ๋Œ€ํ•œ ฯตโˆ’greedy\epsilon - greedy policy๋กœ action์„ ์„ ํƒํ•˜์—ฌ ํ•ด๋‹น state์—์„œ์˜ Q-value์™€ target๊ณผ์˜ ์ฐจ์ด๊ฐ€ ์ž‘์•„์ง€๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•˜๊ฒŒ ๋œ๋‹ค
โ€ข
Naive DQN
L(ฮธ)=[rt+1+ฮณmaxโกaQ(st+1,a;ฮธ)โˆ’Q(st,at;ฮธ)]2L(\theta) = [r_{t+1} + \gamma \displaystyle \max_a Q(s_{t+1}, a ; \theta) - Q(s_t, a_t ; \theta)]^2
โ‡’ MSE๋กœ ์†์‹คํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•˜์—ฌ ์ด๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.
โ€ข
๋ฌธ์ œ์ 
1.
Temporal correlations
: Online-RL์—์„œ๋Š” ์‹œ๊ฐ„์˜ ํ๋ฆ„์— ๋”ฐ๋ผ ์–ป์–ด์ง„ data๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ parameter update๋ฅผ ์ง„ํ–‰ํ•˜๊ฒŒ ๋˜๋ฏ€๋กœ data ์‚ฌ์ด์˜ ์ƒ๊ด€์„ฑ์ด ์กด์žฌํ•œ๋‹ค. ์ด๋Š” parameter update๋ฅผ ํ†ตํ•ด ๊ทผ์‚ฌํ•œ ํ•จ์ˆ˜๊ฐ€ overfitting ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋˜ํ•œ ํ•œ๋ฒˆ ์‚ฌ์šฉํ•œ data๋ฅผ ๋‹ค์‹œ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋“œ๋ฌผ๊ฒŒ ๋‚˜์˜ค๋Š” experience์— ๋Œ€ํ•ด์„œ๋Š” ๊ทธ๊ฒƒ์ด ์ค‘์š”ํ• ์ง€๋ผ๋„, ๋น ๋ฅด๊ฒŒ ํ๊ธฐ๋˜์–ด๋ฒ„๋ฆฐ๋‹ค๋Š” ๋ฌธ์ œ์ ์„ ๊ฐ€์ง„๋‹ค
2.
Non-stationary target
: Target์„ ๊ณ„์‚ฐ ํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” Q-network๊ฐ€ update ๋Œ€์ƒ ( Q(St,At)Q(S_t, A_t) )๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” network์™€ ๋™์ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ฮธ\theta update์— ๋”ฐ๋ผ target function์ด ์ž์ฃผ ๋ณ€ํ™”ํ•˜๊ฒŒ ๋˜์–ด, ์ˆ˜๋ ด์ด ์–ด๋ ค์›Œ์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

4. Replay buffer

Temporal correlation ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ DQN์—์„œ๋Š” replay buffer๋ฅผ ๋„์ž…ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ (s,a,r,sโ€™){(s, a, r, sโ€™)}์œผ๋กœ ์ด๋ฃจ์–ด์ง„ data๋ฅผ replay buffer์— ์ €์žฅํ•ด๋‘๊ณ  minibatch์˜ ํฌ๊ธฐ๋งŒํผ random sampling ํ•˜์—ฌ ํ•™์Šต์— ์‚ฌ์šฉํ•œ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ data ์‚ฌ์ด์˜ temporal correlation์ด ์ค„์–ด๋“ค๊ฒŒ ๋˜๊ณ  minibatch๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ํ•™์Šต ์†๋„๊ฐ€ ํ–ฅ์ƒ๋˜๋ฉฐ, ๊ณผ๊ฑฐ์˜ ๊ฒฝํ—˜์„ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ์œ ์šฉํ•˜๋‹ค.

5. Target Network

Target function์ด ์ž์ฃผ ๋ณ€ํ•˜๋Š” ๋ฌธ์ œ๋Š” ์ˆ˜๋ ด์„ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ ๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, DQN์—์„œ๋Š” Target network๋ฅผ ๋„์ž…ํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค. ๋จผ์ € Target Q-network์˜ parameter ฮธ^\hat\theta๊ณผ behavior Q-network parameter ฮธ\theta๋ฅผ ๋™์ผํ•˜๊ฒŒ ์„ค์ •ํ•˜์—ฌ ํ•™์Šต์„ ์‹œ์ž‘ํ•œ๋‹ค. ํ•™์Šต ๊ณผ์ •์—์„œ๋Š” ฮธ^\hat\theta๋Š” ๊ณ ์ •ํ•˜๊ณ  ฮธ\theta๋Š” ์ง€์†์ ์œผ๋กœ updateํ•œ๋‹ค. ์ผ์ • step์ด ์ง€๋‚œ ์ดํ›„์— ฮธ^\hat\theta๋ฅผ ฮธ\theta๋กœ updateํ•œ๋‹ค. ์ด์™€ ๊ฐ™์€ ๋ฐฉ์‹์„ ์ ์šฉํ•˜์—ฌ Target function์ด ์ง€์†์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

6. DQN ์ •๋ฆฌ

1.
ํ˜„์žฌ state StS_t์—์„œ Behavior network์˜ parameter๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ฯตโˆ’greedy\epsilon -greedy Policy์— ์˜ํ•ด action ata_t ์„ ํƒ
2.
1์˜ ๊ฒฐ๊ณผ ์–ป๊ฒŒ ๋œ transition (st,at,rt+1,st+1)(s_t, a_t, r_{t+1}, s_{t+1})์„ Replay buffer์— ์ €์žฅ
3.
Replay Buffer์—๋Š” ์ตœ๊ทผ N๊ฐœ์˜ transition ์ •๋ณด๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ์œผ๋ฉฐ, 2์˜ ๊ฒฐ๊ณผ ์–ป์–ด์ง„ ์ƒˆ ์ •๋ณด๋ฅผ ์ €์žฅํ•˜๋ฉด์„œ, ๊ฐ€์žฅ ์˜ค๋ž˜๋œ transition ์‚ญ์ œ
4.
Replay Buffer์—์„œ minibatch์˜ ํฌ๊ธฐ๋งŒํผ random์œผ๋กœ transition sampling
5.
์ฃผ์–ด์ง„ N๊ฐœ์˜ data์— ๋Œ€ํ•ด Target network์˜ parameter๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ target value ๊ณ„์‚ฐ
6.
Loss function L(ฮธ)L(\theta) ๊ณ„์‚ฐ
7.
Behavior network parameter update
8.
์ผ์ • step ์ดํ›„์— Target network parameter update
**๊ณผ์ • 6, ๊ณผ์ • 7์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ˆ˜์‹

7. DQN pesudo code

โ€ข
CNN in DQN
DQN์—์„œ๋Š” ๋ชจ๋“  action์— ๋Œ€ํ•œ Q-value๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด์„œ 1๊ฐœ์˜ frame์ด ์•„๋‹ˆ๋ผ, 4๊ฐœ์˜ frame์„ ์—ฎ์–ด์„œ ๋„ฃ์–ด์ฃผ๋Š”๋ฐ ์ด๋Š” markov property๋ฅผ ๋งŒ์กฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์กฐ์น˜์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ฐจ์›์ด ๋งค์šฐ ์ปค์ง€๊ฒŒ ๋˜๋Š”๋ฐ CNN์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ convolution layer๋ฅผ ํ†ตํ•ด ์ฐจ์› ์ถ•์†Œ๋ฅผ ํ•œ๋‹ค. ์ด๋•Œ CNN์˜ max pooling์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ธฐ์กด์˜ CNN์—์„œ max pooling์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” translation invariance๋ฅผ ์œ„ํ•ด์„œ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ฒŒ์ž„์—์„œ๋Š” ์œ„์น˜ ๋ณ€ํ™”์— ๋”ฐ๋ผ output์ด ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์ด ํƒ€๋‹นํ•˜๊ธฐ ๋•Œ๋ฌธ์— max pooling์€ ์ ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค.

8. Multi-step Learning

Multi-step learning์€ ํ•™์Šต์— ์žˆ์–ด Target์„ ๊ณ„์‚ฐํ•  ๋•Œ, n-step return์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋•Œ ์ ๋‹นํ•œ step์„ ์‚ฌ์šฉํ•˜๋ฉด ํ•™์Šต ์†๋„์˜ ๊ฐœ์„ ์„ ์ด๋ฃฐ ์ˆ˜ ์žˆ๋‹ค. Multi-step learning์—์„œ ์‚ฌ์šฉํ•˜๋Š” loss function์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.