Search
๐Ÿ’ช๐Ÿป

HIRO

์ƒ์„ฑ์ผ
2025/07/08 05:58
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

HIerarchical Reinforcement learning with Off-policy correction (HIRO)

Key idea: FuN + TD3 + off-policy correction
โ€ข
FuN์˜ ํ•œ๊ณ„์ 
1.
FuN์€ worker์™€ manager๊ฐ€ ๋ชจ๋‘ ํ˜„์žฌ policy๋กœ๋งŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๋ฐฉ์‹์ธ on-policy ์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ˆ˜์ง‘ ํšจ์œจ์ด ๋‚ฎ๊ณ , sample efficiency๊ฐ€ ๋งค์šฐ ๋–จ์–ด์ง„๋‹ค.
2.
manager๊ฐ€ ์„ค์ •ํ•˜๋Š” subgoal ์€ latent space์˜ vector๋กœ ์ด subgoal ์ด ์‹ค์ œ action๊ณผ ์–ผ๋งˆ๋‚˜ ์˜๋ฏธ ์žˆ๊ฒŒ ์—ฐ๊ฒฐ๋˜๋Š”์ง€ ํ•ด์„์ด ์–ด๋ ต๋‹ค. reward๋„ subgoal ์˜ cosine similarity ๊ธฐ๋ฐ˜์œผ๋กœ ์ถ”์ •๋จ โ†’ ๋ถˆ์•ˆ์ •ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.
3.
manager์™€ worker๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ time step ๋‹จ์œ„๋กœ update๋˜๋ฉด์„œ non-stationarity ๋ฐœ์ƒํ•œ๋‹ค. reward assignment๊ฐ€ ์–ด๋ ต๊ณ , worker๊ฐ€ subgoal ์„ ์ž˜ ๋”ฐ๋ผ๊ฐ€์ง€ ๋ชปํ•˜๋ฉด manager์˜ ํ•™์Šต๋„ ์‹คํŒจํ•  ์ˆ˜ ์žˆ๋‹ค.
Off-policy correction
Off-policy learning : behavior policy๋กœ ์ˆ˜์ง‘๋œ data๋ฅผ ์‚ฌ์šฉํ•ด์„œ target policy๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•.
off-policy learning์˜ ๋ฌธ์ œ์  : behavior policy๋กœ๋ถ€ํ„ฐ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๊ฐ€ target policy์—์„œ ๋‚˜์˜ฌ ๋งŒํ•œ data๊ฐ€ ์•„๋‹ ์ˆ˜๋„ ์žˆ๋‹ค โ†’ gradient๊ฐ€ ์ž˜๋ชป๋œ ๋ฐฉํ–ฅ์œผ๋กœ ํ˜๋Ÿฌ์„œ ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์ง„๋‹ค. ์ด ๋‘˜์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๊ธฐ๋Œ“๊ฐ’ ๊ณ„์‚ฐ์ด ํ‹€์–ด์ง„๋‹ค.
Eaโˆผฮผ[loss(a)]โ‰ Eaโˆผฯ€[loss(a)]\mathbb{E}_{a \sim \mu}[ \text{loss}(a) ] \neq \mathbb{E}_{a \sim \pi}[ \text{loss}(a) ]
๊ทธ๋Ÿฌ๋ฏ€๋กœ correction์„ ํ•ด์ค˜์•ผ ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ policy๋ฅผ updateํ•  ์ˆ˜ ์žˆ๋‹ค.
โ€ข
์–ด๋–ป๊ฒŒ correction์„ ํ•˜๋Š”๊ฐ€?
1.
Importance Sampling (IS)
์–ด๋–ค ๋ถ„ํฌ์—์„œ ๊ธฐ๋Œ€๊ฐ’์„ ๊ตฌํ•˜๊ณ  ์‹ถ์ง€๋งŒ, ๊ทธ ๋ถ„ํฌ์—์„œ ์ง์ ‘ sampling ํ•  ์ˆ˜ ์—†์„ ๋•Œ, ๋‹ค๋ฅธ ๋ถ„ํฌ๋กœ๋ถ€ํ„ฐ samplingํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์›ํ•˜๋Š” ๊ธฐ๋Œ€๊ฐ’์„ ๋ณด์ •ํ•ด์„œ ๊ณ„์‚ฐํ•œ๋‹ค.
w=ฯ€(aโˆฃs)ฮผ(aโˆฃs)w = \frac{\pi(a|s)}{\mu(a|s)}
Eaโˆผฮผ[wโ‹…f(a)]=Eaโˆผฯ€[f(a)]\mathbb{E}_{a \sim \mu} \left[ w \cdot f(a) \right] = \mathbb{E}_{a \sim \pi} \left[ f(a) \right]
2.
Clipping or truncation
importance weight๋ฅผ clippingํ•˜๊ฑฐ๋‚˜ truncated Importance sampling์„ ์‚ฌ์šฉํ•œ๋‹ค.
์˜ˆ: V-trace, Q(ฮป), GAE ๋“ฑ โ†’ ์•ˆ์ •์„ฑ์„ ๋†’์ด๋ฉฐ, bias/variance trade-off ๋ฅผ ์กฐ์ •ํ•œ๋‹ค.
โ€ข
HER์—์„œ์˜ off-policy correction
HER๋„ off-policy ๋ฐฉ์‹์ด๊ธฐ ๋•Œ๋ฌธ์—, hindsight goal์ด target policy๊ณผ๋Š” ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ์—๋Š” correction ๊ธฐ๋ฒ•(์˜ˆ: importance sampling)์€ ์•ˆ ์“ฐ๋Š” ๋Œ€์‹ , goal relabeling์„ ํ†ตํ•ด ์ ์ ˆํ•œ ๋ณด์ƒ์„ ์ฃผ๋Š” ๊ฒƒ์œผ๋กœ ์•”๋ฌต์ ์œผ๋กœ correction์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
HIRO๋Š” FuN์ฒ˜๋Ÿผ 2๊ฐœ์˜ hierarchy๋ฅผ ์‚ฌ์šฉํ•˜๋Š” hierarchical policy architecture ์ด์ง€๋งŒ, Off-policy Reinforcement Learning ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ TD3 ์˜ ๋งค์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ sample efficiency์™€ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๊ฐœ์„ ํ•˜์˜€๋‹ค.
ฯ€\pi ๊ฐ€ ์•„๋‹ˆ๋ผ ฮผ\mu ๋กœ ์ž‘์„ฑํ•œ ์ด์œ 
TD3๋Š” deterministic policy ๊ธฐ๋ฐ˜์ด๊ธฐ ๋•Œ๋ฌธ. ฯ€\pi๋Š” stochastic policy์ผ ๋•Œ ์“ด๋‹ค.
โ€ข
Low-level policy (ฮผloฮผ^{lo}) - worker
Input: st,gts_t, g_t
Output: ata_t
rtlow=โˆ’โˆฅst+1โˆ’stโˆ’gtโˆฅ2r_t^{low}=โˆ’โˆฅs_{t+1}โˆ’s_tโˆ’g_tโˆฅ_2
low-level policy๋Š” ํ˜„์žฌ ์ƒํƒœ sts_t์™€ high-level policy๊ฐ€ ์ œ์‹œํ•œ goal gtg_t๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ action ata_t๋ฅผ ์„ ํƒํ•˜๊ณ , ์ด action์€ ํ™˜๊ฒฝ๊ณผ ์ง์ ‘ ์ƒํ˜ธ์ž‘์šฉํ•œ๋‹ค. reward rtlowr_t^{low}๋Š” ์ด ํ–‰๋™์˜ ๊ฒฐ๊ณผ๋กœ์„œ goal gtg_t๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋”ฐ๋ผ๊ฐ”๋Š”์ง€๋ฅผ ์ธก์ •ํ•˜๋ฉฐ, ๊ทธ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐ๋œ๋‹ค.
Actor-Critic ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ด๋•Œ critic์€ Q(st,gt,at)Q(s_t, g_t, a_t), actor๋Š” ฮผlo(st,gt)\mu^{lo}(s_t, g_t)๋กœ TD3 ๊ธฐ๋ฐ˜ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
โ€ข
High-level policy (ฮผhiฮผ^{hi}) - manager
์ผ์ • ์ฃผ๊ธฐ๋งˆ๋‹ค (์˜ˆ: cc steps๋งˆ๋‹ค) goal gtg_t์„ ์ƒ์„ฑํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ goal ์€ state space ์•ˆ์˜ vector ํ˜น์€ target state๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
Rt=โˆ‘k=0cโˆ’1ฮณkrt+kenvR_t=โˆ‘_{k=0}^{cโˆ’1}ฮณ^kr_{t+k}^{env}
reward๋Š” ํ™˜๊ฒฝ์—์„œ ๋ฐ›๋Š” ์‹ค์ œ reward์˜ ๋ˆ„์ ๊ฐ’์„ ์˜๋ฏธํ•œ๋‹ค. ๋‹จ, worker ฯ€L\pi^L ๊ฐ€ goal gtg_t ๊ฐ€ ์•„๋‹Œ ์ด์ „ goal ๊ธฐ์ค€์œผ๋กœ ํ–‰๋™ํ–ˆ์„ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ goal mismatch๋กœ ์ธํ•œ off-policy ๋ฌธ์ œ ๋ฐœ์ƒํ•œ๋‹ค.
๋”ฐ๋ผ์„œ goal relabeling์„ ํ†ตํ•ด high-level policy๋„ off-policy๋กœ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.
โ€ข
subgoal transition function hh
h(st,gt,st+1)=st+gtโˆ’st+1h(s_t, g_t, s_{t+1}) = s_t + g_t - s_{t+1}
Low-level policy ๋ฅผ ์œ„ํ•œ intrinsic reward ๋Š” goal state ์™€์˜ distance ๋กœ ๊ณ„์‚ฐ๋œ๋‹ค. goal state ๋ฅผ ํ˜„์žฌ stateย sts_t ์—์„œ์˜ relative position ์œผ๋กœ ๋‹ค๋ฃจ๊ธฐ ๋•Œ๋ฌธ์— state ๊ฐ€ ๋ฐ”๋€Œ๋ฉด goal state ๋„ ๋”ฐ๋ผ ๋ฐ”๋€Œ๊ฒŒ ๋˜๊ณ , ์ด๋ฅผ ์œ„ํ•ด goal transition function hh ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค:
์ง๊ด€์ ์œผ๋กœ ๋ณด๋ฉด "goal ๋Œ€๋กœ ๊ฐ”์œผ๋ฉด st+1s_{t+1}์ด st+gts_t + g_t์™€ ๊ฐ™์•˜์„ ๊ฒƒ" ๊ณผ ๊ฐ™๋‹ค.

Off-policy correction in HIRO

โ€ข
(A) Low-level policy correction
low-level policy goal gtg_t์— ์กฐ๊ฑดํ™”๋œ policy์ด๋‹ค. ๋”ฐ๋ผ์„œ ๊ธฐ์กด replay buffer์— ์žˆ๋Š” transition๋“ค์„ ํ•™์Šตํ•  ๋•Œ, ํ˜„์žฌ ์ •์ฑ…๊ณผ mismatch๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด โ€œgoal-conditioned rewardโ€ ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. goal ๋ฐฉํ–ฅ๊ณผ ์‹ค์ œ ์ด๋™์ด ์œ ์‚ฌํ• ์ˆ˜๋ก ๊ฐ’์ด ์ปค์ง„๋‹ค (์ž‘์€ ์˜ค์ฐจ โ†’ ํฐ reward)
r(st,gt,at,st+1)=โˆ’โˆฅst+gtโˆ’st+1โˆฅ2r(s_t, g_t, a_t, s_{t+1}) = -\| s_t + g_t - s_{t+1} \|_2
goal-conditioned reward
Worker๋Š” Manager๊ฐ€ ์ง€์ •ํ•œ goal gtg_t๋ฅผ ๋‹ฌ์„ฑํ•˜๋ ค๋Š” ํ•™์Šต์„ ํ•˜๋Š”๋ฐ, reward๊ฐ€ sparseํ•œ ํ™˜๊ฒฝ์—์„œ๋Š” Worker๋Š” reward๋ฅผ ์ „ํ˜€ ๋ฐ›์„ ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Worker๊ฐ€ ํ˜„์žฌ state sts_t์™€ Manager์˜ goal gtg_t ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ค„์ผ ์ˆ˜๋ก ๋น„๋ก€ํ•ด์„œ intrinsic reward๋ฅผ ์ค€๋‹ค.
โ€ข
(B) High-level policy correction: goal relabeling
high-level policy๋Š” ๊ณผ๊ฑฐ low-level action๋“ค์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ณ , ์–ด๋–ค goal์„ ์ฃผ๋ฉด ๊ทธ๋Ÿฐ action๋“ค์ด ๋‚˜์™”์„์ง€๋ฅผ ์ถ”์ •ํ•ด์•ผ ํ•œ๋‹ค. low-level policy๋Š” ๋งค step๋งˆ๋‹ค updateํ•˜๋Š”๋ฐ, high-level policy์€ cc step๋งˆ๋‹ค updateํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. replay buffer์˜ transition (st,gt,at,st+1)(s_t, g_t, a_t, s_{t+1}) ์— ๋Œ€ํ•ด์„œ ํ˜„์žฌ์˜ low-level policy๊ฐ€ ์‹ค์ œ๋กœ ์ด ํ–‰๋™์„ ์ƒ์„ฑํ–ˆ์„ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๋„๋ก goal์„ relabel ํ•œ๋‹ค.
๊ทธ๋ž˜์„œ past low-level actions์„ ์ตœ๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ๋„๋ก ์‹ค์ œ goal ๋Œ€์‹  relabel๋œ goal g~\tilde{g} ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ•™์Šตํ•œ๋‹ค. ์ด๊ฒƒ์„ goal relabeling์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.
g~t=argmingtโˆ‘i=tt+cโˆ’1โˆฅaiโˆ’ฮผL(si,gt)โˆฅ22\tilde{g}_t=\text{argmin}_{g_t}โˆ‘_{i=t}^{t+cโˆ’1}โˆฅa_iโˆ’ฮผ^L(s_i,g_t)โˆฅ^2_2
goal์„ ์ฃผ์—ˆ์„ ๋•Œ ํ˜„์žฌ low-level policy๊ฐ€ replay buffer์— ์ €์žฅ๋˜์–ด ์žˆ๋˜ ๊ฒƒ๊ณผ ๊ฐ™์€ ํ–‰๋™์„ ํ•˜๋„๋ก goal gtg_t ๋ฅผ relabelingํ•˜์—ฌ ์ฐพ์•„์ค€๋‹ค.
โ€ข
Step 1 : Collect experience (st,gt,at,Rt,โ€ฆโ€‰)(s_t, g_t, a_t, R_t, \dots)
ํ™˜๊ฒฝ์—์„œ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ๋‹ค์Œ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•œ๋‹ค. (ํ˜„์žฌ state sts_t, high-level policy๊ฐ€ ์ƒ์„ฑํ•œ subgoal gtg_t, low-level policy ๊ฐ€ ์ˆ˜ํ–‰ํ•œ action ata_t, ํ™˜๊ฒฝ์—์„œ ๋ฐ›์€ reward RtR_t) ์ด๋“ค์€ replay buffer์— ์ €์žฅ๋œ๋‹ค.
โ€ข
Step 2: Low-level policy ฮผlo\mu^{lo} ํ•™์Šต
rt=โˆ’โˆฅst+1โˆ’stโˆ’gtโˆฅ2r_t = -\| s_{t+1} - s_t - g_t \|_2
low-level policy๋Š” subgoal์„ ๋ชฉํ‘œ๋กœ action์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค. ์—ฌ๊ธฐ์„œ gtg_t๋Š” state sts_t์™€ ํ•จ๊ป˜ ์ž…๋ ฅ๋˜๊ณ , reward๋Š” subgoal๊ณผ ์‹ค์ œ ์ด๋™ ๊ฐ„์˜ ์ฐจ์ด๋กœ ์ •์˜๋œ๋‹ค. (L2 distance) off-policy actor-critic (TD3) ๊ตฌ์กฐ๋กœ ํ•™์Šตํ•œ๋‹ค.
โ€ข
Step 3: High-level policy ฮผhi\mu^{hi} ํ•™์Šต
high-level policy๋Š” cc-step๋งˆ๋‹ค ํ•œ ๋ฒˆ ํ•™์Šต๋˜๊ณ , ์ „์ฒด ๊ตฌ๊ฐ„ tt๋ถ€ํ„ฐ t+ct + c๊นŒ์ง€ ๋ˆ„์  reward๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ ์‚ฌ์šฉ๋˜๋Š” goal g~t\tilde{g}_t๋Š” goal relabeling ์„ ํ†ตํ•ด ์–ป์€ ๊ฒƒ์œผ๋กœ โ€œ์ด ํ–‰๋™๋“ค์ด ํ˜„์žฌ low-level policy ๊ธฐ์ค€์œผ๋กœ ๋‚˜์˜ค๋„๋ก ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ์ ์ ˆํ•œ subgoal์€ ๋ฌด์—‡์ธ๊ฐ€?โ€๋ฅผ ์—ญ์œผ๋กœ ์ถ”์ •ํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.
โ€ข
Step 4: Repeat
์œ„์˜ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค.