Search
๐Ÿ’ช๐Ÿผ

Deep Deterministic Policy Gradient

์ƒ์„ฑ์ผ
2024/09/03 12:14
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Introduction

DQN์€ Neural network๋ฅผ ์Œ“์Œ์œผ๋กœ์จ Q-learning ์˜ high dimension observation space์„ input์œผ๋กœ ๋ฐ›์•„ ์“ธ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์ง€๋งŒ, high dimension action space์—์„œ๋Š” ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š์•˜์Œ.
๊ตณ์ด๊ตณ์ด continuous action space์—์„œ๋„ DQN์„ ์ ์šฉ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. continous ํ•œ action ์ด์ง€๋งŒ ์ด๋ฅผ ๋ช‡ ๊ฐ€์ง€์˜ discrete ํ•œ action์œผ๋กœ discretize ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ action์˜ ๊ฐœ์ˆ˜๋Š” freedom์˜ ์ •๋„์— ๋”ฐ๋ผ ์ง€์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ˆœํžˆ ํ–‰๋™์ด 3๊ฐœ๋กœ discretize ํ•œ๋‹ค๊ณ  ํ•œ๋“ค freedom์ด 7์ด๋ผ๋ฉด ์ „์ฒด action space์˜ dimension์€ 373^7 (= 2187) ์ด ๋˜๊ธฐ ๋–„๋ฌธ์— Curse of dimensionality ์— ๋น ์ ธ๋ฒ„๋ฆด ์ˆ˜ ๋ฐ–์— ์—†๋‹ค. (๋‹น์žฅ ๋กœ๋ด‡ ํŒ” ์›€์ง์ด๊ฒŒ ํ•˜๋Š” ๊ฒƒ๋งŒ ํ•ด๋„ discretize ํ•ด์„œ ์“ฐ๋Š”๊ฒŒ ๋ถˆ๊ฐ€๋Šฅํ•ด๋ณด์ž„)
๋˜ํ•œ, Vanilla Actor-Critic์˜ ๊ฒฝ์šฐ์—๋Š” neural function approximator๋ฅผ ์‚ฌ์šฉํ•˜๋‹ค๋ณด๋‹ˆ ์–ด๋ ค์šด task์—์„œ ๋ถˆ์•ˆ์ •ํ•˜๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ์—ˆ๋‹ค.
๋”ฐ๋ผ์„œ DDPG ์—์„œ๋Š” Actor-Critic๊ณผ DQN(replay buffer, target Q) ์˜ ๋ฐฉ์‹์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์•ˆ์ •์ ์œผ๋กœ function approximation์„ ํ•™์Šตํ•˜๊ณ ์ž ๋งŒ๋“  ๋ชจ๋ธ์ด๋‹ค.

Background

J=Eri,siโˆผE,aiโˆผฯ€[R1].J = \mathbb{E}_{r_i, s_i \sim E, a_i \sim \pi} \left[ R_1 \right].
Reinforcement Learning ์˜ ๋ชฉํ‘œ๋Š” Expected Return ์„ maximizeํ•˜๋Š” policy๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด Return์€ policy ฯ€\pi ๋ฅผ ๋”ฐ๋ฅด๋Š” state์™€ action ์˜ distribution์— ๋”ฐ๋ผ ๊ณ„์‚ฐ๋œ๋‹ค.
โ€ข
Bellman Equation
Qฯ€(st,at)=Ert,st+1โˆผE[r(st,at)+ฮณEat+1โˆผฯ€[Qฯ€(st+1,at+1)]]Q^\pi(s_t, a_t) = \mathbb{E}_{r_t, s_{t+1} \sim E} \left[ r(s_t, a_t) + \gamma \mathbb{E}_{a_{t+1} \sim \pi} \left[ Q^\pi(s_{t+1}, a_{t+1}) \right] \right]
at+1a_{t+1} ๋Š” policy ฯ€\pi ๋ฅผ ๋”ฐ๋ผ state st+1s_{t+1} ์—์„œ stochasticํ•˜๊ฒŒ ์„ ํƒ๋ ์ˆ˜ ์žˆ๋Š” action ๋“ค ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— policy ๊ฐ€ stochasticํ•˜๋‹ค๋ฉด at+1a_{t+1} ์— ๋Œ€ํ•ด expectation์„ ์ทจํ•ด์•ผํ•œ๋‹ค.
โ€ข
Bellman Equation (deterministic policy)
Qฮผ(st,at)=Ert,st+1โˆผE[r(st,at)+ฮณQฮผ(st+1,ฮผ(st+1))]Q^\mu(s_t, a_t) = \mathbb{E}_{r_t, s_{t+1} \sim E} \left[ r(s_t, a_t) + \gamma Q^\mu(s_{t+1}, \mu(s_{t+1})) \right]
policy ฮผ\mu ๊ฐ€ deterministicํ•˜๋‹ค๋ฉด st+1s_{t+1} ์—์„œ ์–ด๋–ค at+1a_{t+1} ๋ฅผ ํ• ์ง€์— ๋Œ€ํ•œ uncertain์ด ์‚ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์—, ฮผ(st+1)\mu(s_{t+1}) ์ด ๋ช…ํ™•ํ•˜๊ฒŒ ๊ฒฐ์ •๋œ๋‹ค. ๋‹ค์‹œ ๋งํ•ด, action์ด ๋‹จ 1๊ฐœ๋กœ ๊ณ ์ •๋œ๋‹ค.
deterministic policy๋กœ ๋ฐ”๊ฟจ์„ ๋•Œ ์ˆ˜์‹์„ ๋ณด๋ฉด Expectation์ด environment ์—๋งŒ ์˜์กดํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด ํŠน์ • state์—์„œ ํŠน์ • action์„ ํ–ˆ์„ ๋•Œ, environment์ด ์–ด๋–ป๊ฒŒ ๋ฐ˜์‘ํ•˜๋Š”์ง€๋Š” policy์— ์ „ํ˜€ ์˜ํ–ฅ์ด ์—†๋‹ค๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.
๋”ฐ๋ผ์„œ, data๋ฅผ ์ˆ˜์ง‘ํ•  ๋•Œ์—๋Š” ๋‹ค๋ฅธ policy (์˜ˆ : ฮฒ\beta) ๋ฅผ ์“ฐ๋”๋ผ๋„, ์ˆ˜์ง‘ํ•œ data๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ target์œผ๋กœ ํ•˜๋Š” policy ฮผ\mu ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. (= off-policy)
โ€ข
Q-learning
DQN์—์„œ๋Š” epsilon greedyํ•˜๊ฒŒ ๋ฝ‘์•˜์ง€๋งŒ ์—ฌ๊ธฐ์—์„œ๋Š” policy ๊ฐ€ deterministicํ•˜๊ธฐ ๋•Œ๋ฌธ์—, greedy policy์ธ ฮผ(s)=argโกmaxโกaQ(s,a)\mu(s) = \arg\max_a Q(s, a) ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
โ€ข
Loss
L(ฮธQ)=Estโˆผฯฮฒ,atโˆผฮฒ,rtโˆผE[(Q(st,atโˆฃฮธQ)โˆ’yt)2]L(\theta^Q) = \mathbb{E}_{s_t \sim \rho^\beta, a_t \sim \beta, r_t \sim E} \left[ \left( Q(s_t, a_t \mid \theta^Q) - y_t \right)^2 \right]
์œ„์—์„œ ๋งํ–ˆ๋“ฏ data๋ฅผ ์ˆ˜์ง‘ํ•  ๋•Œ์—๋Š” policy ฮฒ\beta ๋ฅผ ๊ฐ€์ง€๊ณ  ํ•˜๋ฉฐ, Loss๋Š” ฮธQ\theta^Q ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Q-function๋ฅผ approximationํ•˜๊ณ  ์žˆ๋‹ค. L(ฮธQ)L(\theta^Q) ๋ฅผ minimizeํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰ํ•˜๋ฉฐ, DQN ๊ฐ™์ด Mean Squared Error ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
โ€ข
target Q
yt=r(st,at)+ฮณQ(st+1,ฮผ(st+1)โˆฃฮธQ)y_t = r(s_t, a_t) + \gamma Q(s_{t+1}, \mu(s_{t+1}) \mid \theta^Q)
yty_t ๋Š” st+1s_{t+1} ์—์„œ ์„ ํƒ๋œ action ์— ๋”ฐ๋ฅธ reward๊ณผ ๋ฏธ๋ž˜์˜ Q-๊ฐ’์„ ๋”ํ•œ ๊ฒƒ์ด๋‹ค. yty_t ๋„ ฮธQ\theta^Q ์— ์˜์กด์ ์ด์ง€๋งŒ, ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฌด์‹œ๋œ๋‹ค.
โ†’ yty_t ๋Š” Loss ๊ณ„์‚ฐํ• ๋•Œ target๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ํ•™์Šต ๊ณผ์ •์—์„œ Q(st,atโˆฃฮธQ) Q(s_t, a_t \mid \theta^Q) ๊ฐ€ yt y_t ์™€ ์ตœ๋Œ€ํ•œ ๋น„์Šทํ•ด์ง€๋„๋ก ฮธQ\theta^Q ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ, yt y_t ๊ฐ€ ํŠน์ • time step ์—์„œ์˜ ฮธQ\theta^Q ์— ์ง์ ‘์ ์œผ๋กœ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๊ฒŒ ๊ณ ์ •๋œ ๊ฐ’์œผ๋กœ ์ทจ๊ธ‰๋œ๋‹ค.

Algorithms

์•ž์„œ Q-learning์˜ ๋ฌธ์ œ์  (continuous domain์—์„œ๋Š” ๋งค time step๋งค๋‹ค greedy ํ•œ policy๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ata_t ์— ๋Œ€ํ•œ optimize๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ neural network ์™€ ํ•จ๊ป˜ ์“ฐ๋ฉด ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋„ˆ๋ฌด ๋†’๋‹ค) ๋•Œ๋ฌธ์— continuous domain์—์„œ ์“ธ ์ˆ˜ ์—†์œผ๋ฏ€๋กœ, Deterministic Policy Gradeint ์„ ์‚ฌ์šฉํ•œ๋‹ค.

DPG (Deterministic Policy Gradient)

โ€ข
Parameterized actor function ฮผ(sโˆฃฮธฮผ)\mu(s \mid \theta^\mu)
state๋ฅผ ํŠน์ • action์œผ๋กœ deterministicํ•˜๊ฒŒ mappingํ•˜๋Š” policy๋ฅผ ์˜๋ฏธํ•œ๋‹ค. Chain Rule ์— ์˜ํ•ด update ๋˜๋ฉฐ distribution์œผ๋กœ ๋ถ€ํ„ฐ expected Return JJ ์— ๋Œ€ํ•œ actor parameter ๋กœ ์ง„ํ–‰๋œ๋‹ค.
โ€ข
Critic function Q(s,a)Q(s , a)
Q-learning๊ณผ ๊ฐ™์ด bellman equation์„ ํ†ตํ•ด ํ•™์Šต๋œ๋‹ค.
โ€ข
Policy Gradient
๋ฐ‘์˜ ์ˆ˜์‹ 2๊ฐœ ๋ชจ๋‘ Chain Rule๋ฅผ ์ ์šฉํ•˜์—ฌ gradient ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ถ€๋ถ„์ด๋‹ค.
โˆ‡ฮธฮผJโ‰ˆEstโˆผฯฮฒ[โˆ‡ฮธฮผQ(s,aโˆฃฮธQ)โˆฃs=st,a=ฮผ(stโˆฃฮธฮผ)]\nabla_{\theta^\mu} J \approx \mathbb{E}_{s_t \sim \rho^\beta} \left[\nabla_{\theta^\mu} Q(s, a \mid \theta^Q) \bigg|_{s=s_t, a=\mu(s_t \mid \theta^\mu)}\right]
๋จผ์ €, ์œ„์˜ ์ˆ˜์‹์€ Q-function์˜ action aa ์— ๋Œ€ํ•œ gradient๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ,
โˆ‡ฮธฮผJโ‰ˆEstโˆผฯฮฒ[โˆ‡aQ(s,aโˆฃฮธQ)โˆฃs=st,a=ฮผ(st)โˆ‡ฮธฮผฮผ(sโˆฃฮธฮผ)โˆฃs=st]\nabla_{\theta^\mu} J \approx \mathbb{E}_{s_t \sim \rho^\beta} \left[\nabla_a Q(s, a \mid \theta^Q) \bigg|_{s=s_t, a=\mu(s_t)} \nabla_{\theta^\mu} \mu(s \mid \theta^\mu) \bigg|_{s=s_t}\right]
์œ„ ์ˆ˜์‹์€ policy ฮผ\mu ์˜ parameter ฮธฮผ \theta^\mu ์— ๋Œ€ํ•œ gradient๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

NFQCA (Neural Fitted Q Iteration with Continuous Actions)

Q-learning์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ non-linear approximation ์„ ๋„ฃ์œผ๋ฉด ์ˆ˜๋ ดํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๊ณ , ๋ฐ˜๋Œ€๋กœ continuous domain์—์„œ๋Š” linear approximation ๋งŒ์œผ๋กœ๋Š” ํ•™์Šต์ด ์–ด๋ ต๋‹ค.
DPG ์™€ ๋™์ผํ•œ update rule๋ฅผ ์“ฐ๋ฉด์„œ๋„ non-linear approximation์„ ๋„ฃ๋Š” ๋Œ€์‹ ์— batch learning์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ•˜์˜€๋‹ค. ๋‹ค์‹œ ๋งํ•ด, batch learning์„ ํ•˜์ง€ ์•Š๋Š” NFQCA๋Š” DPG์™€ ๋™์ผํ•˜๋‹ค.
๊ทธ๋ž˜์„œ DDPG๋Š” NFQCA ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ DPG์˜ ์ผ๋ถ€ ์ˆ˜์ •์„ ํ†ตํ•ด continuous domain์—์„œ๋„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ๋งŒ๋“ค์—ˆ๋‹ค.

DDPG (Deep Deterministic Policy Gradient)

โ€ข
mini-batch learning
์•ž์—์„œ ์ด์•ผ๊ธฐ ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ์‹ ๊ฒฝ๋ง์„ ๊ฐ•ํ™”ํ•™์Šต์— ์‚ฌ์šฉํ•  ๋•Œ, ๋Œ€๋ถ€๋ถ„์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ sample ์ด ๋…๋ฆฝ์ ์ด๊ณ , ๋™์ผํ•˜๊ฒŒ ๋ถ„ํฌ๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฐ€์ •์ด ํ•„์ˆ˜์ ์ด๋‹ค.
๊ทธ๋Ÿฐ๋ฐ on-policy ๋ฐฉ๋ฒ• ์ฒ˜๋Ÿผ sample๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ explorationํ•˜๊ณ , ์ƒ์„ฑํ•ด๋‚ด๋ฉด state์™€ action๊ฐ„์˜correlation ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์งˆ ์ˆ˜ ์žˆ๋‹ค.
โ†’ Replay buffer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ mini-batch ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.
โ€ข
Soft target update
Q(s,aโˆฃฮธQ)Q(s, a \mid \theta^Q) ๊ฐ€ update๋˜๋ฉด์„œ, ๋™์ผํ•œ network๊ฐ€ target value ๊ณ„์‚ฐํ•  ๋•Œ์—๋„ ์‚ฌ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต ์ค‘์— network๊ฐ€ ๋นจ๋ฆฌ ๋ณ€ํ•˜๋ฉด์„œ ์ˆ˜๋ ดํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค.
๋งˆ์ฐฌ๊ฐ€์ง€๋กœ DDPG์—์„œ๋„ target network๋ฅผ ์‚ฌ์šฉํ•˜๊ธด ํ•˜์ง€๋งŒ actor-critic ๊ตฌ์กฐ์— ๋งž๊ฒŒ ๋ณ€ํ˜•ํ•˜์˜€๋‹ค. network์˜ weight๋ฅผ ์ง์ ‘ ๋ณต์‚ฌํ•ด์„œ ์“ฐ๋Š” ๋Œ€์‹  ์ฒœ์ฒœํžˆ ๋ณ€ํ™”์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.
Qโ€ฒ(s,aโˆฃฮธQโ€ฒ)Q'(s,a \mid \theta^{Q'}) ์™€ ฮผโ€ฒ(sโˆฃฮธฮผโ€ฒ)\mu'(s \mid \theta^{\mu'}) ์™€ ๊ฐ™์ด ๋ณต์‚ฌ๋ณธ์„ ๋งŒ๋“ค๊ณ , ์ด target์„ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•œ๋‹ค.
ฮธโ€ฒโ†ฯ„ฮธ+(1โˆ’ฯ„)ฮธโ€ฒฮธ' โ†ฯ„ฮธ+(1โˆ’ฯ„)ฮธ'
๋‹ค์Œ๊ณผ ๊ฐ™์ด ฯ„\tau ๋ฅผ 1๋ณด๋‹ค ์ž‘์€ ์ˆ˜๋กœ ๊ฒฐ์ •ํ•˜์—ฌ ํ•™์Šต๋œ network์˜ weight๋ฅผ ์ฒœ์ฒœํžˆ ๋”ฐ๋ผ๊ฐ€๋„๋ก ๋งŒ๋“ค์—ˆ๋‹ค.
์œ„์˜ 2๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ถ”๊ฐ€ํ•˜์—ฌ Q-learning์ด ๊ฐ€์ง€๊ณ  ์žˆ๋˜ unstability ๋ฌธ์ œ๋ฅผ supervised learning ๋ฌธ์ œ๋กœ ๋ฐ”๊พธ์–ด ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์–ด์ฃผ์—ˆ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ target policy ฮผโ€ฒ \mu' ์™€ target Q-function Qโ€ฒQ' ๋ชจ๋‘ ์“ฐ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.
์ด๋ฅผ ํ†ตํ•ด Critic์ด ๋ฐœ์‚ฐํ•˜์ง€ ์•Š๊ณ  ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ดํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.
๋ฌผ๋ก  soft target update์˜ ๊ฒฝ์šฐ ์ฒœ์ฒœํžˆ ๋”ฐ๋ผ๊ฐ€๋Š” ๋ฐฉ์‹์ด๋‹ค๋ณด๋‹ˆ ํ•™์Šต ์†๋„๋Š” ๋Š๋ฆฌ์ง€๋งŒ, soft target update๋ฅผ ์ผ์„ ๋•Œ์™€ ์•ˆ์ผ์„ ๋•Œ์˜ ์•ˆ์ •์„ฑ ์ฐจ์ด๊ฐ€ ํฌ๊ฒŒ ๋‚ฌ๋‹ค๊ณ  ํ•œ๋‹ค.
โ€ข
Problems with low-dimensional feature vectors
feature vector์˜ observation์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฌผ๋ฆฌ์  ๋‹จ์œ„(์˜ˆ: ์œ„์น˜, ์†๋„)๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ๋ฒ”์œ„๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค.
์ด๋Ÿฌํ•œ ์ฐจ์ด๋กœ ์ธํ•ด์„œ network๊ฐ€ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šต์ด ๋ถˆ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์„œ๋กœ ๋‹ค๋ฅธ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง„ state value ๋“ค์„ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” hyperparameter๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์–ด๋ ค์›Œ์งˆ ์ˆ˜ ์žˆ๋‹ค.
์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด batch normalization ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ฆ‰, mini-batch ๋‚ด์˜ ์ฐจ์›์„ normalizationํ•ด์„œ ํ‰๊ท ์ด 0์ด๊ณ , ๋ถ„์‚ฐ์ด 1์ด ๋˜๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค.
low-dimension state space์—์„œ batch normalization๋ฅผ state input์˜ ๋ชจ๋“  layer์™€ Q-network์˜ ๋ชจ๋“  layer์— ์ ์šฉํ•˜์˜€๋‹ค.
๋”ฐ๋ผ์„œ, ๋‹ค์–‘ํ•œ physical unit & range ๋ฅผ ๊ฐ€์ง„ task์—์„œ๋„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.
โ€ข
Exploration in continuous action space
์œ„์—์„œ ๋ดค๋“ฏ DDPG๋Š” off-policy ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ data๋ฅผ ์ˆ˜์ง‘ํ•  ๋•Œ ์‚ฌ์šฉ๋˜๋Š” policy (ฮฒ\beta) ์™€ training ์—์„œ ์‚ฌ์šฉ๋˜๋Š” policy (ฮผ\mu) ๊ฐ€ ๋‹ค๋ฅด๋‹ค.
ฮผโ€ฒ(st)=ฮผ(stโˆฃฮธฮผ)+Nฮผ'(s_t)=ฮผ(s_tโˆฃฮธ_ฮผ)+ \mathcal{N}
deterministic policy๋Š” ์ฃผ์–ด์ง„ state์—์„œ ํ•ญ์ƒ ๋™์ผํ•œ action ๋งŒ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— exploration์„ ์ข€ ๋” ํ•˜๋„๋ก ๋งŒ๋“ค์–ด์ฃผ๊ธฐ ์œ„ํ•ด noise๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

DDPG pseudo code