Search
๐Ÿ’ช๐Ÿป

Dueling DQN

์ƒ์„ฑ์ผ
2024/07/22 03:00
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Dueling DQN

1. Dueling DQN ๊ฐœ์š”

โ€ข
Advantage fuction๊ณผ state-value function์„ ๊ฐ๊ฐ ๊ณ„์‚ฐํ•˜๊ณ  ์ด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ action value function ๊ณ„์‚ฐ
โ€ข
CNN encoder๋Š” ๊ณต์œ ํ•˜๋ฉด์„œ ๊ฐ ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ FC layer์˜ parameter๋งŒ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉ
โ€ข
Action์ด ํ™˜๊ฒฝ ๋‚ด์—์„œ ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋„ state์˜ value๋ฅผ ํ‰๊ฐ€ํ•˜๋Š”๋ฐ ์œ ์šฉ

2. Value function

โ€ข
์ผ๋ฐ˜์ ์œผ๋กœ ํŠน์ • state์—์„œ ์–ด๋–ค action์„ ์ˆ˜ํ–‰ํ•  ๊ฒƒ์ธ์ง€์— ์ง‘์ค‘(Policy)
โ—ฆ
Q(s,a)Q(s,a)๋ฅผ ํ•™์Šต
โ€ข
State์—์„œ action์ด ํ™˜๊ฒฝ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์ด ๊ฑฐ์˜ ์—†๋Š” ๊ฒฝ์šฐ V(s)V(s)๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด State value ํŒ๋‹จ์— ์žˆ์–ด ๋” ์ค‘์š”
โ€ข
์ฒซ๋ฒˆ์งธ Case (์ฐจ๊ฐ€ ๋จผ ์œ„์น˜์— ์กด์žฌ)
โ—ฆ
action ์„ ํƒ์— ๋Œ€ํ•œ ์ค‘์š”๋„ ๊ฐ์†Œ
โ—ฆ
State-value๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ๋งŒ ์•ž์˜ ์ฐจ๊ฐ€ ๊ณ ๋ ค ๋Œ€์ƒ
โ€ข
๋‘๋ฒˆ์งธ Case (์ฐจ๊ฐ€ ๊ฐ€๊นŒ์šด ์œ„์น˜์— ์กด์žฌ)
โ—ฆ
์–ด๋–ค action์„ ํ•˜๋А๋ƒ๊ฐ€ ์ค‘์š”๋„
โ—ฆ
Advantage ๊ณ„์‚ฐ ์‹œ์—๋„ ๊ฐ ์ฐจ๋“ค์— ์ง‘์ค‘

3. Identifiability issue

โ€ข
Action value function, Advantage function, State value function ์‚ฌ์ด์˜ ๊ด€๊ณ„
โ€ข
Q-value๋ฅผ max๋กœ ๋งŒ๋“œ๋Š” action์„ ์ ์šฉํ•œ ๊ฒƒ์ด ๊ฒฐ๊ตญ optimal state value
โ—ฆ
optimal action์ธ ๊ฒฝ์šฐ Advantage function=0
โ€ข
Dueling DQN์—์„œ state value function๊ณผ Advantage function ๊ณ„์‚ฐ์— ์žˆ์–ด ์„œ๋กœ ๋‹ค๋ฅธ network๋ฅผ ์‚ฌ์šฉ
โ—ฆ
Optimal action์— ๋Œ€ํ•ด์„œ ์ด ์„ฑ์งˆ์„ ๋ณด์žฅํ•˜๊ธฐ ์–ด๋ ค์›€
โ—ฆ
state value๋Š” ํ•˜๋‚˜์˜ ๊ฐ’์œผ๋กœ ๋‚˜์˜ค๋Š” ๋ฐ˜๋ฉด Advantage function์€ ๋ชจ๋“  action์— ๋Œ€ํ•ด ๊ณ ๋ ค๋˜๋ฏ€๋กœ shape ๋ฌธ์ œ ๋ฐœ์ƒ
โ—ฆ
ํ•˜๋‚˜์˜ Q-value๋ฅผ ๋ถ„ํ•ดํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ํŠน์ •๋˜์ง€ ์•Š์Œ
โ–ช
(V(s)+const)+(A(s,a)โˆ’const)(V(s)+const) + (A(s,a) -const)
โ–ช
V(s;ฮธ,ฮฒ)V(s ; \theta, \beta)์™€ A(s,a;ฮธ,ฮฑ)A(s,a ; \theta, \alpha)๊ฐ€ ์ข‹์€ ์ถ”์ •๋Ÿ‰์ด๋ผ๊ณ  ๋ณด๊ธฐ ์–ด๋ ค์›€
โ€ข
๋ฐฉ์•ˆ 1) Optimal action์— ๋Œ€ํ•œ advantage function ํ•ญ์„ ์ถ”๊ฐ€
โ—ฆ
Max ํ•ญ์— ๋Œ€ํ•œ ๋ณ€๋™์„ฑ์ด ํฌ๋ฏ€๋กœ(์ž์ฃผ ๋ณ€ํ™”) ์•ˆ์ •์„ฑ์ด ๋–จ์–ด์ง
โ€ข
๋ฐฉ์•ˆ2) Mean ๊ฐ’์„ ์ œ์™ธ
โ—ฆ
Optimal action์— ๋Œ€ํ•ด Max-mean๋งŒํผ์˜ ์˜ค์ฐจ๊ฐ€ ๋ฐœ์ƒ
โ€ข
๋ชฉ์ ์ด Q๋ฅผ maximizeํ•˜๋Š” action์„ ์ฐพ๋Š” ๊ฒƒ
โ—ฆ
Advantage function์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ•ญ์„ maximization ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Œ
โ—ฆ
max๋ฅผ ์ œ์™ธํ•˜๋“  mean์„ ์ œ์™ธํ•˜๋“  policy๋ฅผ ์ฐพ๋Š”๋ฐ๋Š” ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Œ
โ–ช
State-value function์€ action์— ๋”ฐ๋ผ ๋ณ€ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ