Search
๐Ÿ’ช๐Ÿผ

Soft Actor-Critic

์ƒ์„ฑ์ผ
2024/09/03 08:01
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Introduction

โ€ข
2๊ฐ€์ง€ ์ฃผ์š” challenge
1.
model-free RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ sample complexity ์ธก๋ฉด์—์„œ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค์–ด๊ฐ„๋‹ค. ๊ฐ„๋‹จํ•œ ์ž‘์—…์—์„œ๋„ ์ˆ˜๋ฐฑ๋งŒ ๋‹จ๊ณ„์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ๊ณ , large, continuous space๋กœ ๊ฐˆ ์ˆ˜๋ก ํ›จ์”ฌ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค.
2.
hyperparameter๊ณผ ๊ด€๋ จํ•ด์„œ ์ž˜ ์„ค์ •ํ•ด์•ผ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
โ€ข
on-policy์˜ ๋ฌธ์ œ์ 
Deep-RL์—์„œ sample complexity๊ฐ€ ๋–จ์–ด์ง€๋Š” ์›์ธ ์ค‘ ํ•˜๋‚˜๋Š” on-policy์ด๋‹ค.
โ€œoff-policyโ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ด์ „ experience๋ฅผ ๋‹ค์‹œ ์‚ฌ์šฉํ•˜์ง€๋งŒ (Replay memory) , โ€on-policyโ€ TRPO, PPO, A3C ์™€ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ gradient step์— ๋Œ€ํ•ด ์ƒˆ๋กœ์šด sample์„ ์ˆ˜์ง‘ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰ํ•œ๋‹ค๋Š” ์ ์—์„œ sample complexity ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ off-policy(DQN) ๊ฒฝ์šฐ์—๋Š” continous space์—์„œ๋Š” ์‚ฌ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค.
๋”ฐ๋ผ์„œ continous space์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” off-policy ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ DDPG๋Š” ์ฃผ์–ด์ง„ ์ƒํƒœ์—์„œ ํ–‰๋™์„ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•œ ํ•˜๋‚˜์˜ Deterministic policy๋ฅผ ํ•™์Šตํ•œ๋‹ค. ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ off-policy์ด๊ธฐ ๋•Œ๋ฌธ์— sample complexity๊ฐ€ ๋†’์ง€๋งŒ, ๋งค์šฐ ๋ฏผ๊ฐํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์™€ ๋ถˆ์•ˆ์ •์„ฑ ๋•Œ๋ฌธ์— ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค.
โ€ข
maximum entropy
SAC๋Š” Soft Q-learning ๊ธฐ๋ฐ˜์˜ Maximum Entropy ๋ฐฉ๋ฒ•์„ ํ™œ์šฉํ•œ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.
์ •๋ณด ์ด๋ก ์—์„œ entropy๊ฐ€ ๋†’๋‹ค๋Š” ๊ฒƒ์€ ์ •๋ณด๊ฐ€ ํ˜ผํƒํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‹ˆ๊นŒ distribution์œผ๋กœ ๋ณด๋ฉด ํ™•๋ฅ ๊ฐ’์ด ๊ฑฐ์˜ ๋™์ผํ•œ uniform distribution๊ณผ ๋™์ผํ•œ ์ƒํƒœ์ธ๋ฐ exploration์„ ํ•˜๋ฉด์„œ gaussian distribution์˜ ํ˜•ํƒœ๋กœ ๊ฐ€๋Š”๊ฒŒ ์ผ๋ฐ˜์ ์ด๋‹ค.
Maximum Entropy RL์˜ ๋ชฉ์ ์€ expected return ๊ณผ policy์˜ expected entropy๋ฅผ maximizeํ•˜๋Š” ๊ฒƒ์ด๋‹ค. maximum entropy distribution์€ policy๋ฅผ high reward region์œผ๋กœ ๊ฐ€๋„๋ก ๋งŒ๋“ ๋‹ค๋Š” ์ ์—์„œ ์ƒ๋‹นํ•œ ์žฅ์ ์ด ์žˆ๋‹ค.
Soft Q-learning : Policy Iteration + Maximum Entropy

preliminary

โ€ข
Standard RL
โˆ‘tE(st,at)โˆผฯฯ€[r(st,at)]\sum_t \mathbb{E}_{(s_t, a_t) \sim \rho_\pi} \left[ r(s_t, a_t) \right]
์ผ๋ฐ˜์ ์ธ RL์—์„œ๋Š” reward์˜ expected sum๋ฅผ maximize ํ•˜๋Š” ๊ฒŒ ๋ชฉ์ ์ด์—ˆ๋‹ค.
โ€ข
Maximum entropy objective
J(ฯ€)=โˆ‘t=0TE(st,at)โˆผฯฯ€[r(st,at)+ฮฑH(ฯ€(โ‹…โˆฃst))]J(\pi) = \sum_{t=0}^T \mathbb{E}_{(s_t, a_t) \sim \rho_\pi} \left[ r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot \mid s_t)) \right]
entropy ํ•ญ์€ policy๊ฐ€ ์–ผ๋งˆ๋‚˜ random์ ์ธ์ง€๋ฅผ ํ‘œํ˜„ํ•œ๋‹ค. (exploration ์„ ์กฐ์ ˆ)
ฮฑ\alpha ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก exploration๋ฅผ ๋” ๋งŽ์ดํ•˜๊ณ , ๋ฐ˜๋Œ€๋กœ ฮฑ\alpha โ†’ 0์œผ๋กœ ๊ฐˆ ์ˆ˜๋ก expected reward๋ฅผ maximizeํ•˜๋Š” ์ผ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๊ฐ™์•„์ง„๋‹ค.
maximum entropy distribution์€ policy๋ฅผ high reward region์œผ๋กœ ๋‹ค๋กœ๊ณ  ๋งŒ๋“ ๋‹ค๋Š” ์ ์—์„œ ์ข‹๋‹ค.
โ€ข
Maximum entropy ์˜ ์žฅ์ 
1.
์œ ๋งํ•˜์ง€ ์•Š์€ ๊ฒฝ๋กœ๋ฅผ ํฌ๊ธฐํ•˜๋ฉด์„œ ๋” ๋„“๊ฒŒ ํƒ์ƒ‰ํ•˜๋„๋ก ์žฅ๋ ค๋œ๋‹ค.
2.
policy๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€์˜ sub-optimalํ•œ action์„ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๋‹ค.
3.
entropy based ํ•™์Šต์€ exploration์—์„œ ๊ธฐ์กด๋ณด๋‹ค ํฌ๊ฒŒ ๊ฐœ์„ ๋˜์–ด ํ•™์Šต์†๋„๊ฐ€ ๋น ๋ฅด๋‹ค.

SAC์˜ ํŠน์ง•

1.
infinite horizon ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด discount factor ฮณ \gamma ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ expected reward์™€ entropy์˜ ํ•ฉ์ด ์œ ํ•œํ•˜๋„๋ก ๋งŒ๋“ฆ
2.
์•ž์—์„œ ๋งํ–ˆ๋“  SAC๋Š” (soft) policy iteration์„ ํ•œ๋‹ค. ํ˜„์žฌ policy์˜ Q-function์„ ํ‰๊ฐ€ํ•˜๊ณ , off-policy gradient update ๋ฐฉ์‹์œผ๋กœ policy๋ฅผ updateํ•œ๋‹ค. (์ด๋Š” policy iteration์œผ๋กœ ์‹œ์ž‘ํ•ด์„œ maximum entropy varient๋ฅผ ์ ์šฉํ•ด์„œ ์ตœ์ข…์ ์œผ๋กœ SAC๋กœ ์œ ๋„ํ•  ์ˆ˜ ์žˆ์Œ)

From Soft policy iteration to Soft Actor-Critic

Soft policy evaluate

policy ฯ€\pi ์˜ value๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ์—๋Š” maximum entropy objective๋ฅผ ์ ์šฉํ•˜๊ฒ ๋‹ค.
Tฯ€Q(st,at)โ‰œr(st,at)+ฮณEst+1โˆผp[V(st+1)],\mathcal{T}^\pi Q(s_t, a_t) \triangleq r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim p} \left[ V(s_{t+1}) \right],
๊ณ ์ •๋œ policy ์— ๋Œ€ํ•ด์„œ soft Q value๋Š” Bellman backup์— ์˜ํ•ด ๋ฐ˜๋ณต์ ์œผ๋กœ ์ ์šฉ๋˜์–ด ๊ณ„์‚ฐ๋œ๋‹ค.
whereV(st)=Eatโˆผฯ€[Q(st,at)โˆ’logโกฯ€(atโˆฃst)]\text{where} \quad V(s_t) = \mathbb{E}_{a_t \sim \pi} \left[ Q(s_t, a_t) - \log \pi(a_t \mid s_t) \right]
์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ  ์žˆ๋˜ state value function์—์„œ ํ™•๋ฅ  ๋ถ„ํฌ ฯ€(โ‹…โˆฃst)\pi(\cdot \mid s_t) ์˜ entropy ๋งŒ ์ถ”๊ฐ€๋œ ํ˜•ํƒœ์ด๋‹ค. entropyํ•ญ์— logํ•ญ์„ ๋ถ™์—ฌ์ฃผ์–ด ๋†’์€ ํ™•๋ฅ ์ด์—ˆ๋˜ ๊ฒƒ์— penalty๋ฅผ ๋”์šฑ ํฌ๊ฒŒํฌ๊ฒŒ ๋ถ€์—ฌํ•˜๋Š” ์‹์ด๋‹ค.

Soft policy improvement

policy๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์‰ฝ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ํŠน์ • policy set ฮ \Pi ์œผ๋กœ ์ œํ•œํ•˜๊ฒ ๋‹ค. ๊ทธ ์ด์œ ๋Š” ฯ€\pi ๊ฐ€ ฮ \Pi ์— ์†ํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ œ์•ฝํ•˜์— ๊ฐœ์„ ๋œ policy๋ฅผ project ํ•˜๊ณ  ์‹ถ์€ ๊ฒƒ. ์˜ˆ๋ฅผ ๋“ค์–ด ฯ€\pi ๋ฅผ ฮ \Pi ๋กœ restrict ํ•˜๋Š” ๊ณผ์ •์—์„œ Gaussian distribution์œผ๋กœ parameterizeํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
SAC์—์„œ๋Š” ํŽธ์˜์„ฑ(?) ์„ ์œ„ํ•ด Projection์„ KL divergence ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •์˜ํ•˜์˜€๋‹ค.
ฯ€new=argโกminโกฯ€โ€ฒโˆˆฮ DKL(ฯ€โ€ฒ(โ‹…โˆฃst)โ€‰โˆฅโ€‰expโก(Qฯ€old(st,โ‹…))Zฯ€old(st))\pi_{\text{new}} = \arg\min_{\pi' \in \Pi} D_{\text{KL}} \left( \pi'(\cdot \mid s_t) \, \Bigg\| \, \frac{\exp\left(Q^{\pi_{\text{old}}}(s_t, \cdot)\right)}{Z^{\pi_{\text{old}}}(s_t)} \right)
์ˆ˜์‹์„ ์‚ดํŽด๋ณด๋ฉด, Q ์—๋‹ค๊ฐ€ exp๋ฅผ ๋‹ฌ์•„์ฃผ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— action value function์ด ํด ์ˆ˜๋ก, ํ•ด๋‹น action ์ด ์ข‹๋‹ค๋Š” ๊ฒƒ์ด ๋”์šฑ ๋ถ€๊ฐ์ด ๋˜๋‹ˆ๊นŒ
โ†’ โ€œ์ด์ „ policy์—์„œ Q๊ฐ€ ๋†’์•˜๋˜ action์€ ฯ€new\pi_{\text{new}} ์—์„œ ํ›จ์”ฌ ๋†’์€ ํ™•๋ฅ ๋กœ ๋ฝ‘๊ฒ ๋‹คโ€ ๋ผ๊ณ  ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.
๊ทธ๋ฆฌ๊ณ  Zฯ€old(st)Z^{\pi_{old}}(s_{t}) ๋Š” ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋งŒ๋“ค์–ด ์ฃผ๊ธฐ ์œ„ํ•œ normalize term์ด๋‹ค.
๊ทธ๋ž˜์„œ ์œ„์™€ ๊ฐ™์ด Projectionํ•ด์„œ ์–ป๊ฒŒ ๋œ ฯ€new\pi_{\text{new}} ์˜ value function์€ ฯ€old\pi_{\text{old}} ๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜ ๊ฐ™์€๊ฒŒ ์ฆ๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค.

Problems with soft policy iteration

์—ฌ๊ธฐ๊นŒ์ง€ soft policy iteration์— ๊ด€ํ•œ ๋‚ด์šฉ์ด์—ˆ๋‹ค. ์ •๋ฆฌํ•˜์ž๋ฉด, ์ œํ•œ๋œ policy set ฮ \Pi ๋‚ด์—์„œ optimal maximum entropy policy ๋กœ ์ˆ˜๋ ดํ•  ์ˆ˜ ์žˆ์Œ์ด ์ˆ˜์‹์ ์œผ๋กœ ์ฆ๋ช…ํ–ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ optimalํ•œ ํ•ด๋ฅผ ์–ป๋Š”๊ฑด tabular case ์ผ ๋•Œ๋งŒ. continuous domain ์—์„œ๋Š” ์ˆ˜๋ ดํ•  ๋•Œ๊นŒ์ง€ ํ•˜๋Š”๊ฑด ๋„ˆ๋ฌด ๋น„์šฉ์ด ๋งŽ์ด ๋“ ๋‹ค.
continuous domain์—์„œ๋„ ๊ทผ์‚ฌ์‹œํ‚ค๋Š” practicalํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด SAC์ด๋‹ค.

Soft Actor-Critic

Train soft value function

Soft policy iterationํ•˜๋Š” ๊ฑฐ ๋Œ€์‹  Q-function๊ณผ policy๋ฅผ function approximationํ•ด์„œ ์“ฐ๊ฒ ๋‹ค.
ํ•œ๊ฐ€์ง€ ํŠน์ง•์€ ์›์น™์ ์œผ๋กœ๋Š” state value์— ๋Œ€ํ•ด ๋ณ„๋„์˜ function approximation์„ ๋‹ฌ์•„์ฃผ์ง€ ์•Š์•„๋„ ๋˜์ง€๋งŒ, soft value์— ๋Œ€ํ•œ function approximation์„ ๋‹ฌ์•„์ฃผ๋Š”๊ฒŒ ํ•™์Šต์— ์•ˆ์ •ํ™”๋œ๋‹ค.
JV(ฯˆ)=EstโˆผD[12(Vฯˆ(st)โˆ’Eatโˆผฯ€ฯ•[Qฮธ(st,at)โˆ’logโกฯ€ฯ•(atโˆฃst)])2]J_V(\psi) = \mathbb{E}_{s_t \sim \mathcal{D}} \left[ \frac{1}{2} \left( V_\psi(s_t) - \mathbb{E}_{a_t \sim \pi_\phi} \left[ Q_\theta(s_t, a_t) - \log \pi_\phi(a_t \mid s_t) \right] \right)^2 \right]
์œ„์˜ ์ˆ˜์‹์€ ์‹ค์ œ ๊ฐ€์น˜ VฯˆV_\psi ์™€ ๊ธฐ๋Œ€ ๊ฐ€์น˜ QฮธQ_\theta ์˜ squared residual error๋ฅผ minimizeํ•˜๋„๋ก soft value๋ฅผ ํ•™์Šตํ•œ๋‹ค.
๊ธฐ๋Œ€ ๊ฐ€์น˜ QฮธQ_\theta ๋ถ€๋ถ„์€ ์œ„์—์„œ ๋ดค๋˜ soft policy iteration Vฯ€(st)V^\pi(s_{t}) ์™€ ๊ฐ™๋‹ค.
โˆ‡^ฯˆJV(ฯˆ)=โˆ‡ฯˆVฯˆ(st)(Vฯˆ(st)โˆ’Qฮธ(st,at)+logโกฯ€ฯ•(atโˆฃst))\hat{\nabla}_\psi J_V(\psi) = \nabla_\psi V_\psi(s_t) \left( V_\psi(s_t) - Q_\theta(s_t, a_t) + \log \pi_\phi(a_t \mid s_t) \right)

Train soft Q-function

JQ(ฮธ)=E(st,at)โˆผD[12(Qฮธ(st,at)โˆ’Qฮธ^(st,at))2]J_Q(\theta) = \mathbb{E}_{(s_t, a_t) \sim \mathcal{D}} \left[ \frac{1}{2} \left( Q_\theta(s_t, a_t) - \hat{Q_\theta}(s_t, a_t) \right)^2 \right]
soft Q-function์€ TD-learning์„ ํ†ตํ•ด update๋œ๋‹ค.
withQฮธ^(st,at)=r(st,at)+ฮณEst+1โˆผp[Vฯˆโ€พ(st+1)]\text{with} \quad \hat{Q_\theta}(s_t, a_t) = r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim p} \left[ V_{\overline{\psi}}(s_{t+1}) \right]
์ด๋•Œ Target Q์— ์œ„์™€ ๊ฐ™์€ trick์„ ์‚ฌ์šฉํ•˜์—ฌ Q๋ฅผ Value function์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๊ณ , ๋ฏธ๋ถ„์„ ๋•Œ๋ฆฐ๋‹ค.
โˆ‡^ฮธJQ(ฮธ)=โˆ‡ฮธQฮธ(st,at)(Qฮธ(st,at)โˆ’r(st,at)โˆ’ฮณVฯˆ(st+1))\hat{\nabla}_\theta J_Q(\theta) = \nabla_\theta Q_\theta(s_t, a_t) \left( Q_\theta(s_t, a_t) - r(s_t, a_t) - \gamma V_\psi(s_{t+1}) \right)
Stochastic Gradient Descent๋ฅผ ์‚ฌ์šฉํžˆ์—ฌ optimizeํ•  ์ˆ˜ ์žˆ๋‹ค.

Train soft policy improvement

Jฯ€(ฯ•)=EstโˆผD[DKL(ฯ€ฯ•(โ‹…โˆฃst)โ€‰โˆฅโ€‰expโก(Qฮธ(st,โ‹…))Zฮธ(st))].J_\pi(\phi) = \mathbb{E}_{s_t \sim \mathcal{D}} \left[ \mathcal{D}_{\text{KL}} \left( \pi_\phi(\cdot \mid s_t) \, \Bigg\| \, \frac{\exp(Q_\theta(s_t, \cdot))}{Z_\theta(s_t)} \right) \right].
์œ„์—์„œ ๋ดค๋˜ soft policy improvement ์ˆ˜์‹์ด๋‹ค.
KL divergence๋ฅผ minimizeํ•˜์—ฌ ํ•™์Šตํ•˜๋Š”๋ฐ, ์ผ๋ฐ˜์ ์ธ policy gradient์˜€๋‹ค๋ฉด likelihood ratio (TRPO, PPO) ๋ฅผ ์‚ฌ์šฉํ•ด์„œ backpropagate ํ•˜๊ฒ ์ง€๋งŒ, SAC์˜ ๊ฒฝ์šฐ์—๋Š” target Q๊ฐ€ ฮธ\theta ๋กœ ์ •์˜๋˜์–ด์žˆ๊ณ , objective function์€ ฯ•\phi ๋กœ ์ •์˜ ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฏธ๋ถ„ํ•˜๋ฉด 0๋˜์–ด ์‚ฌ๋ผ์ง„๋‹ค.
ฮธ\theta ๋„ ์‹ ๊ฒฝ๋ง์ด๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ๋ผ์ง€๋ฉด ์•ˆ๋˜๊ธฐ ๋•Œ๋ฌธ์— target์„ ฯ•\phi ์œผ๋กœ reparameterizeํ•˜๋Š” trick์„ ์‚ฌ์šฉํ•œ๋‹ค.
at=fฯ•(ฯตt;st)\mathbf{a}_t = f_{\phi}(\epsilon_t; \mathbf{s}_t)
๋‹ค์‹œ ๋งํ•ด ์›๋ž˜ ฮธ\theta ๋กœ ์ •์˜ ๋˜์—ˆ๋˜ ์ˆ˜์‹์œผ๋กœ ฯ•\phi ์— ๋Œ€ํ•ด ์žฌ์ •์˜ํ•œ๋‹ค.
DKL(pโˆฅq)=โˆ‘xp(x)logโกp(x)q(x)D_{\text{KL}}(p \parallel q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)}
๋ฏธ๋ถ„์„ ํ•˜๊ธฐ ์ „์— KL divergence์˜ ์ˆ˜์‹ ์ •์˜๋Š” (์•„์‹œ๊ฒ ์ง€๋งŒ) ๋‹ค์Œ๊ณผ ๊ฐ™์€๋ฐ, log ์„ฑ์งˆ์— ์˜ํ•ด ํ’€์–ด์“ด soft policy improvement ์ˆ˜์‹์ด ๋ฐ‘์˜ ์ˆ˜์‹๊ณผ ๊ฐ™๋‹ค.
Jฯ€(ฯ•)=EstโˆผD,ฯตtโˆผN[logโกฯ€ฯ•(fฯ•(ฯตt;st)โˆฃst)โˆ’Qฮธ(st,fฯ•(ฯตt;st))]J_\pi(\phi) = \mathbb{E}_{s_t \sim \mathcal{D}, \epsilon_t \sim \mathcal{N}} \left[ \log \pi_\phi \left( f_\phi(\epsilon_t; \mathbf{s}_t) \mid \mathbf{s}_t \right) - Q_\theta \left( \mathbf{s}_t, f_\phi(\epsilon_t; \mathbf{s}_t) \right) \right]
(์œ„์˜ ์ˆ˜์‹์„ ๋ฏธ๋ถ„ํ•œ ๊ฒฐ๊ณผ)
โˆ‡^ฯ•Jฯ€(ฯ•)=โˆ‡ฯ•logโกฯ€ฯ•(atโˆฃst)+(โˆ‡atlogโกฯ€ฯ•(atโˆฃst)โˆ’โˆ‡atQฮธ(st,at))โˆ‡ฯ•fฯ•(ฯตt;st)\hat{\nabla}_{\phi} J_{\pi}(\phi) = \nabla_{\phi} \log \pi_{\phi}(a_t \mid s_t) + \left( \nabla_{a_t} \log \pi_{\phi}(a_t \mid s_t) - \nabla_{a_t} Q_\theta(s_t, a_t) \right) \nabla_{\phi} f_{\phi}(\epsilon_t; s_t)
(๋‚˜์ฒ˜๋Ÿผ ์ˆ˜์‹์— ์•ฝํ•œ ์‚ฌ๋žŒ์ด ์ „๊ฐœํ•˜๋‹ค๋ณด๋ฉด) " + โ€ ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์˜ค๋ฅธ์ชฝ ํ•ญ์€ ๋ฌธ์ œ์—†๋Š” ๊ฑฐ ๊ฐ™์€๋ฐ, ์™ผ์ชฝ logํ•ญ์€ ์™œ ํŠ€์–ด๋‚˜์˜ค๋Š”์ง€ ์˜๋ฌธ์ผ ์ˆ˜ ์žˆ๋‹ค.
๊ทธ ์ด์œ ๋Š” chain rule์—์„œ ์ค‘๊ฐ„ ๊ฐ’์ธ at=fฯ•(ฯตt;st)\mathbf{a}_t = f_{\phi}(\epsilon_t; \mathbf{s}_t) ๊ฐ€ ฯ•\phi ์— ์˜์กด์ ์ธ ์ƒํ™ฉ์ด๊ธฐ ๋•Œ๋ฌธ์— ํŽธ๋ฏธ๋ถ„์ด ์•„๋‹Œ total derivation์„ ๊ณ„์‚ฐํ•ด์ค˜์•ผ ํ•œ๋‹ค. ๊ทธ๋Ÿฐ ์ด์œ  ๋•Œ๋ฌธ์— ํŠ€์–ด๋‚˜์˜ค๋Š” ๊ฒƒ์ด๋‹ค.

Additional Features of SAC

1.
policy improvement ๋‹จ๊ณ„์—์„œ positive bias ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 2๊ฐœ์˜ Q-function์„ ์‚ฌ์šฉํ•œ๋‹ค.
2๊ฐœ์˜ Q-function์„ ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šตํ•˜์—ฌ, ๊ฐ Q-function์€ ๋ณ„๋„์˜ parameter ฮธi\theta_{i} ๋ฅผ ๊ฐ–๋Š”๋‹ค. ๊ทธ๋ฆฌ๊ณ  train๋„ ๋…๋ฆฝ์ ์œผ๋กœ optimizeํ•œ๋‹ค. โ†’ JQ(ฮธi)J_Q(\theta_{i})
2.
value gradient (Train soft value function ๋ถ€๋ถ„) ์™€ policy gradient (Train soft policy improvement ๋ถ€๋ถ„) ํ•  ๋•Œ์—๋Š” 2๊ฐœ์˜ Q-function ์ค‘์—์„œ ์ž‘์€ ๊ฐ’์„ ์‚ฌ์šฉํ•˜๊ฒ ๋‹ค.
Q-function์˜ overestimation ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•จ๋„ ์žˆ๊ณ , clipping ํšจ๊ณผ๋กœ ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.
3.
environment์œผ๋กœ ๋ถ€ํ„ฐ ๊ฒฝํ—˜์„ ์Œ“๊ณ , replay buffer์—์„œ batch๋งŒํผ ๋ฝ‘์•„์„œ function approximation์œผ๋กœ updateํ•˜๋Š” ๊ณผ์ •์„ ๋ฒˆ๊ฐˆ์•„๊ฐ€๋ฉด์„œ ํ•œ๋‹ค.

SAC pseudo code

โ€œ๋งจ ๋งˆ์ง€๋ง‰ ์ค„์€ ๊ธฐ์กด์˜ target value์— ๋Œ€ํ•ด ์กฐ๊ธˆ์”ฉ value๋ฅผ updateํ•˜๊ฒ ๋‹คโ€ ๋ผ๋Š” ์˜๋ฏธ. (DDPG, TD3 ์™€ ๋น„์Šทํ•œ soft value update)