Search
๐Ÿ’ช๐Ÿป

DQN

์ƒ์„ฑ์ผ
2024/07/22 02:58
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

DQN

1. DQN ๊ฐœ์š”

โ€ข
๊ณ ์ฐจ์›์˜ ์ด๋ฏธ์ง€ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด CNN ๊ตฌ์กฐ ๋„์ž…
โ—ฆ
๊ณ ์ฐจ์›/์—ฐ์†์  State Space ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
โ€ข
Q-Network๋ฅผ ํ†ตํ•ด Q-Value๋ฅผ ํ‘œํ˜„
โ—ฆ
๋ชจ๋“  action์— ๋Œ€ํ•œ Q-Value๋ฅผ ์ถœ๋ ฅํ•ด์•ผ ํ•˜๋ฏ€๋กœ action space๋Š” ์ด์‚ฐ์ ์ด๊ณ  ์œ ํ•œํ•˜๋ฉฐ state space์— ๋น„ํ•ด ์ž‘์€ ์ฐจ์›๋งŒ ๊ฐ€๋Šฅ

2. DQN ๊ตฌ์กฐ

โ€ข
Input
โ—ฆ
Markov Property๋ฅผ ๋งŒ์กฑํ•˜๋„๋ก, 4๊ฐœ์˜ frame์„ ๋ฌถ์–ด์„œ input์œผ๋กœ ์ž…๋ ฅ
โ—ฆ
ํ•˜๋‚˜์˜ Frame์€ ์†๋„, ๋ฐฉํ–ฅ ๋“ฑ์˜ ์ธก๋ฉด์—์„œ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์ถฉ๋ถ„ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์ง€ ์•Š์Œ.
โ—ฆ
ํŠน์ • State์— ๋Œ€ํ•œ ์ž…๋ ฅ์„ ๋ฐ›์•„ Q-Value๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์—, State Space์˜ ์ฐจ์›์— ์ œ์•ฝ์ด ์ ๋‹ค.
โ€ข
Convolution layer
โ—ฆ
์ž…๋ ฅ๋œ ์ด๋ฏธ์ง€์— ํ•ฉ์„ฑ ๊ณฑ ์—ฐ์‚ฐ์„ ์ ์šฉํ•˜์—ฌ Feature Vector๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ฐจ์›์„ ์ค„์ž„
โ—ฆ
FC๋กœ ๊ฐ ๋ฒกํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก, Flatten ์ง„ํ–‰
โ—ฆ
๊ฒŒ์ž„์—์„œ๋Š” ์œ„์น˜ ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ญ‰๊ฐœ๋Š” Max Pooling ์—ฐ์‚ฐ์€ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Œ
โ€ข
Output
โ—ฆ
๊ฐ action์— ๋Œ€ํ•œ Q-Value
โ—ฆ
Q-Value๊ฐ€ Optimal Q-Value์— ๊ทผ์‚ฌํ•˜๋„๋ก ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ํ•˜์—ฌ Network์˜ Parameter๋ฅผ Updateํ•œ๋‹ค.

3. Naive DQN

โ€ข
Q-learning
Q(St,At)โ†Q(St,At)+ฮฑ[Rt+1+ฮณmaxโกaQ(St+1,a)โˆ’Q(St,At)]Q(S_t, A_t) โ† Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \displaystyle\max_a Q(S_{t+1} ,a ) - Q(S_t, A_t)]
โ—ฆ
Target policy: Q์— ๋Œ€ํ•œ greedygreedy policy๋กœ action ์„ ํƒ
โ—ฆ
Behavior policy: Q์— ๋Œ€ํ•œ ฯตโˆ’greedy\epsilon - greedy policy๋กœ action์„ ์„ ํƒ
โ—ฆ
ํ•ด๋‹น state์—์„œ์˜ Q-value์™€ target๊ณผ์˜ ์ฐจ์ด๊ฐ€ ์ž‘์•„์ง€๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต
โ€ข
Naive DQN
L(ฮธ)=[rt+1+ฮณmaxโกaQ(st+1,a;ฮธ)โˆ’Q(st,at;ฮธ)]2L(\theta) = [r_{t+1} + \gamma \displaystyle \max_a Q(s_{t+1}, a ; \theta) - Q(s_t, a_t ; \theta)]^2
โ—ฆ
MSE๋กœ ์†์‹คํ•จ์ˆ˜ ์ •์˜ ํ›„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ update
โ€ข
๋ฌธ์ œ์ 
โ—ฆ
Temporal correlations
โ–ช
Online RL์—์„œ๋Š” ์‹œ๊ฐ„์˜ ํ๋ฆ„์— ๋”ฐ๋ผ ์–ป์€ transition์„ ๋ฐ”ํƒ•์œผ๋กœ parameter update ์ง„ํ–‰
โ–ช
transition ์‚ฌ์ด์˜ ์ƒ๊ด€์„ฑ ์กด์žฌ โ‡’ ํŠน์ • ์ƒํ™ฉ์— ๊ณผ๋„ํ•˜๊ฒŒ ์ ์‘ํ•  ์ˆ˜ ์žˆ๋‹ค (๊ณผ์ ํ•ฉ)
โ–ช
๋“œ๋ฌผ๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ์ค‘์š”ํ•œ experience์— ๋Œ€ํ•ด ์ถฉ๋ถ„ํ•œ ํ•™์Šต X
โ—ฆ
Non-stationary target
โ–ช
Target์„ ๊ณ„์‚ฐํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” Q-Network์™€ Update์˜ ๋Œ€์ƒ์ด ๋˜๋Š” Q-Value๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” Network๊ฐ€ ๋™์ผํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Target function์ด ์ž์ฃผ ๋ณ€ํ™”ํ•˜๊ฒŒ ๋˜์–ด ์ˆ˜๋ ด์ด ์–ด๋ ต๋‹ค.

4. Replay buffer

โ€ข
Temporal correlation ๋ฌธ์ œ ํ•ด๊ฒฐ
โ€ข
(s,a,r,sโ€™){(s, a, r, sโ€™)}์œผ๋กœ ์ด๋ฃจ์–ด์ง„ Transition์„ replay buffer์— ์ €์žฅ ํ›„ minibatch์˜ ํฌ๊ธฐ๋งŒํผ random sampling ํ•˜์—ฌ ํ•™์Šต์— ์‚ฌ์šฉ
โ€ข
Transition ์‚ฌ์ด์˜ temporal correlation ๊ฐ์†Œ (์„œ๋กœ ๋‹ค๋ฅธ experience์—์„œ์˜ transition์œผ๋กœ Batch ๊ตฌ์„ฑ ๊ฐ€๋Šฅ)
โ€ข
Minibatch๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ํ•™์Šต ์†๋„๊ฐ€ ํ–ฅ์ƒ
โ€ข
๊ณผ๊ฑฐ์˜ ๊ฒฝํ—˜์„ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค (Buffer ๋‚ด๋ถ€์˜ transition์ด ๋ฐ”๋กœ ํ๊ธฐ๋˜์ง€ ์•Š์Œ)

5. Target Network

โ€ข
Target Q-network์˜ parameter ฮธ^\hat\theta๊ณผ behavior Q-network parameter ฮธ\theta๋ฅผ ๋™์ผํ•˜๊ฒŒ ์„ค์ •ํ•˜์—ฌ ํ•™์Šต ์‹œ์ž‘
โ€ข
ํ•™์Šต ๊ณผ์ •์—์„œ๋Š” ฮธ^\hat\theta๋Š” ๊ณ ์ •ํ•˜๊ณ  ฮธ\theta๋Š” ์ง€์†์ ์œผ๋กœ update
โ€ข
์ผ์ • step์ด ์ง€๋‚œ ์ดํ›„์— ฮธ^\hat\theta๋ฅผ ฮธ\theta๋กœ update
โ€ข
์ด์™€ ๊ฐ™์€ ๋ฐฉ์‹์„ ์ ์šฉํ•˜์—ฌ Target function์ด ์ง€์†์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ๋ฌธ์ œ ํ•ด๊ฒฐ

6. DQN

1.
ํ˜„์žฌ state StS_t์—์„œ Behavior network์˜ parameter๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ฯตโˆ’greedy\epsilon -greedy Policy์— ์˜ํ•ด action ata_t ์„ ํƒ
2.
1์˜ ๊ฒฐ๊ณผ ์–ป๊ฒŒ ๋œ transition (st,at,rt+1,st+1)(s_t, a_t, r_{t+1}, s_{t+1})์„ Replay buffer์— ์ €์žฅ
3.
Replay Buffer์—๋Š” ์ตœ๊ทผ N๊ฐœ์˜ transition ์ •๋ณด๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ์œผ๋ฉฐ, 2์˜ ๊ฒฐ๊ณผ ์–ป์–ด์ง„ ์ƒˆ ์ •๋ณด๋ฅผ ์ €์žฅํ•˜๋ฉด์„œ, ๊ฐ€์žฅ ์˜ค๋ž˜๋œ transition ์‚ญ์ œ
4.
Replay Buffer์—์„œ minibatch์˜ ํฌ๊ธฐ๋งŒํผ random์œผ๋กœ transition sampling
5.
์ฃผ์–ด์ง„ N๊ฐœ์˜ data์— ๋Œ€ํ•ด Target network์˜ parameter๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ target value ๊ณ„์‚ฐ
6.
Loss function L(ฮธ)L(\theta) ๊ณ„์‚ฐ
7.
Behavior network parameter update
8.
์ผ์ • step ์ดํ›„์— Target network parameter update
** ๋ชฉ์ ํ•จ์ˆ˜ ์„ค์ • ๋ฐ ๋ฏธ๋ถ„

8. Multi-step Learning

Multi-step learning์€ ํ•™์Šต์— ์žˆ์–ด Target์„ ๊ณ„์‚ฐํ•  ๋•Œ, n-step return์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋•Œ ์ ๋‹นํ•œ step์„ ์‚ฌ์šฉํ•˜๋ฉด ํ•™์Šต ์†๋„์˜ ๊ฐœ์„ ์„ ์ด๋ฃฐ ์ˆ˜ ์žˆ๋‹ค. Multi-step learning์—์„œ ์‚ฌ์šฉํ•˜๋Š” loss function์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.