Search
๐Ÿ’ช๐Ÿป

Dueling DQN

์ƒ์„ฑ์ผ
2024/07/22 03:00
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Dueling DQN

1. Dueling DQN ๊ฐœ์š”

Dueling DQN์€ Advantage fuction๊ณผ State-value function์„ ๊ฐ๊ฐ ๊ณ„์‚ฐํ•˜๊ณ  ์ด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ Action value function์„ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. CNN encoder๋Š” ๊ณต์œ ํ•˜๋ฉด์„œ ๊ฐ ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ FC layer์˜ parameter๋งŒ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.
๊ฒฐ๊ณผ์ ์œผ๋กœ Dueling DQN์€ state-value function๊ณผ Advantage-function์„ ๋ชจ๋‘ ํ•™์Šตํ•˜์—ฌ action์ด ํ™˜๊ฒฝ ๋‚ด์—์„œ ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋„ ๊ทธ state์˜ value๋ฅผ ํ‰๊ฐ€ํ•˜๋Š”๋ฐ ์œ ์šฉํ•˜๊ฒŒ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ „๋ฐ˜์ ์ธ state์— ๋Œ€ํ•œ value ํ‰๊ฐ€๋„ ๊ฐ€๋Šฅํ•˜๋‹ค.

2. Value function

๋ณดํ†ต์€ ํŠน์ • state์—์„œ ์–ด๋–ค action์„ ์„ ํƒํ•  ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ Policy๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์— ์ค‘์ ์„ ๋‘”๋‹ค. ๋”ฐ๋ผ์„œ Q(s,a)Q(s,a)๋ฅผ ํ•™์Šตํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํŠน์ • state์—์„œ action์ด ํ™˜๊ฒฝ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์ด ๊ฑฐ์˜ ์—†๋Š” ๊ฒฝ์šฐ ๊ทธ state์˜ value์ธ V(s)V(s)๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด state value ํŒ๋‹จ์— ์žˆ์–ด ๋” ์ค‘์š”ํ•  ๋•Œ๋„ ์žˆ๋‹ค.
์œ„์˜ ์˜ˆ์‹œ๋ฅผ ๋ณด๋ฉด ์ฒซ๋ฒˆ์งธ Case์—์„œ๋Š” ์ฐจ๊ฐ€ ๋จผ ์œ„์น˜์— ์žˆ๊ธฐ ๋•Œ๋ฌธ์— action ์„ ํƒ์— ๋Œ€ํ•œ ์ค‘์š”๋„๋Š” ๋–จ์–ด์ง€๊ฒŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ State-value๋ฅผ ํ‰๊ฐ€์—๋งŒ, ์•ž์˜ ์ฐจ๋ฅผ ๊ณ ๋ คํ•œ๋‹ค. ๋ฐ˜๋Œ€๋กœ ์ฐจ๊ฐ€ ๊ฐ€๊นŒ์šด ๊ฒฝ์šฐ๋Š” action ์„ ํƒ์˜ ์ค‘์š”๋„๊ฐ€ ๋†’์•„์ง€๋ฏ€๋กœ Advantage์—์„œ๋„ ๊ฐ ์ฐจ๋“ค์— ์ง‘์ค‘ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค.

3. Identifiability issue

Action value function, Advantage function, State value function ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์‚ดํŽด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
์—ฌ๊ธฐ์„œ ์ฃผ๋ชฉํ•  ๋ถ€๋ถ„์€ Q-value๋ฅผ max๋กœ ๋งŒ๋“œ๋Š” action์„ ์ ์šฉํ•œ ๊ฒƒ์ด ๊ฒฐ๊ตญ optimal state value์ด๊ธฐ ๋•Œ๋ฌธ์—, optimal action์— ๋Œ€ํ•ด์„œ๋Š” Advantage function์ด 0์ด ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Dueling DQN์—์„œ๋Š” state value function๊ณผ Advantage function ๊ณ„์‚ฐ์— ์žˆ์–ด ์„œ๋กœ ๋‹ค๋ฅธ network๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— optimal action์— ๋Œ€ํ•ด์„œ ์ด ์„ฑ์งˆ์„ ๋ณด์žฅํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ๊ทธ๋ฆฌ๊ณ  state value๋Š” ํ•˜๋‚˜์˜ ๊ฐ’์œผ๋กœ ๋‚˜์˜ค๋Š” ๋ฐ˜๋ฉด Advantage function์€ ๋ชจ๋“  action์— ๋Œ€ํ•ด ๊ณ ๋ ค๋˜๋ฏ€๋กœ shape์— ๋Œ€ํ•œ ๋ฌธ์ œ๋„ ๋ฐœ์ƒํ•œ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ ํ•˜๋‚˜์˜ Q-value์— ๋Œ€ํ•ด (V(s)+const)+(A(s,a)โˆ’const)(V(s)+const) + (A(s,a) -const)์ธ ๊ฒฝ์šฐ์—๋„ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ V(s;ฮธ,ฮฒ)V(s ; \theta, \beta)์™€ A(s,a;ฮธ,ฮฑ)A(s,a ; \theta, \alpha)๊ฐ€ ์ข‹์€ ์ถ”์ •๋Ÿ‰์ด๋ผ๊ณ  ๋ณด๊ธฐ ์–ด๋ ต๋‹ค. ์ด๊ฒƒ์ด identifiability issue๋ฅผ ์ผ์œผํ‚จ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ „์ž์˜ ์„ฑ์งˆ์ด ๋ณด์žฅ๋˜๋„๋ก ๋งŒ๋“ค์–ด์ฃผ์–ด์•ผ ํ•˜๋Š”๋ฐ, ์ด๋Š” V(s)V(s)์™€ A(s,a)A(s,a) ์ค‘ ํ•˜๋‚˜๊ฐ€ ์ •ํ™•ํ•œ estimator๊ฐ€ ๋œ๋‹ค๋ฉด ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ฐ”๋ฅธ ์ถ”์ •๋Ÿ‰์ด ๋  ๊ฒƒ์ด๋ผ๋Š” ์•„์ด๋””์–ด์— ๊ธฐ๋ฐ˜ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์•„๋ž˜์™€ ๊ฐ™์ด optimal action์— ๋Œ€ํ•œ advantage function ํ•ญ์„ ์ถ”๊ฐ€ํ•œ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ์—ฌ๊ธฐ์—๋„ ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”๋ฐ, max ํ•ญ์— ๋Œ€ํ•œ ๋ณ€๋™์„ฑ์ด ํฌ๋ฏ€๋กœ ์•ˆ์ •์„ฑ์ด ๋–จ์–ด์ง„๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ max ๊ฐ’์„ ์ œ์™ธํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ mean ๊ฐ’์„ ์ œ์™ธํ•˜๋Š” ๋ฐฉ์‹์„ ์ ์šฉํ•ด์ค€๋‹ค.
์›๋ž˜ Action-value function๊ณผ State-value function , Advantage function ์‚ฌ์ด์˜ ์˜๋ฏธ๋ฅผ ๋งž์ถฐ์ฃผ๊ธฐ ์œ„ํ•ด์„œ maximum ๊ฐ’์„ ๋นผ๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์ธ๋ฐ, ์ด ๋ถ€๋ถ„์„ mean์œผ๋กœ ๋Œ€์ฒดํ•˜์˜€์œผ๋ฏ€๋กœ max-mean๋งŒํผ์˜ ์˜ค์ฐจ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ›จ์”ฌ ๋” stableํ•œ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์žฅ์ ์ด ์ƒ๊ธด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฒฐ๊ตญ ๋ชฉ์  ์ž์ฒด๊ฐ€ Q๋ฅผ maximizeํ•˜๋Š” action์„ ์ฐพ๋Š” ๊ฒƒ์ธ๋ฐ, ๊ทธ๊ฒƒ์€ Advantage function์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ•ญ์„ maximization ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๊ณ , ๋”ฐ๋ผ์„œ max๋ฅผ ์ œ์™ธํ•˜๋“  mean์„ ์ œ์™ธํ•˜๋“  ๋™์ผํ•œ ๊ฐ’์ด ๋น ์ง„๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ policy๋ฅผ ์ฐพ๋Š”๋ฐ๋Š” ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค.