Search
๐Ÿ’ช๐Ÿป

REINFORCE

์ƒ์„ฑ์ผ
2024/08/05 02:59
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

REINFORCE

์ด์ „ ๊ธ€์˜ ๋‹ค์–‘ํ•œ Policy gradient ์ค‘ Return์„ ์‚ฌ์šฉํ•˜๋Š” REINFORCE Algorithm์ด๋‹ค.
1.
ํ˜„์žฌ ์ฃผ์–ด์ง„ parameter ฮธ\theta๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ M๊ฐœ์˜ trajectory ์ƒ์„ฑ
2.
episode์—์„œ Return ๊ณ„์‚ฐ
3.
โˆ‡ฮธJ(ฮธ)=Eฯ€ฮธ[โˆ‘t=0Tโˆ’1Gtโˆ‡ฮธlogฯ€ฮธ(atโˆฃst)]\nabla_\theta J(\theta) = E_{\pi_\theta}[\sum^{T-1}_{t=0} G_t \nabla\theta log \pi_\theta (a_t|s_t)] ๋ฅผ sample mean์œผ๋กœ ๊ทผ์‚ฌ
4.
์ „์ฒด trajectory์— ๋Œ€ํ•œ Return์˜ ๊ธฐ๋Œ“๊ฐ’์œผ๋กœ objective function์ด ์ •์˜๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ Gradient ascent ๋ฐฉ์‹์„ parameter update์— ์ ์šฉ
โ€ข
REINFORCE ๋‹จ์ 
ํ•˜๋‚˜์˜ episode๊ฐ€ ์ข…๋ฃŒ๋˜์–ด์•ผ policy update๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์œผ๋ฉฐ, ์ด์™€ ๊ฐ™์€ ํŠน์ง•์„ ๊ฐ€์ง€๋ฏ€๋กœ Monte Carlo policy gradient๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ๋˜ํ•œ gradient๊ฐ€ Return์— ๋น„๋ก€ํ•˜๋ฏ€๋กœ ๋ถ„์‚ฐ์ด ํฌ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ update์˜ ๋Œ€์ƒ์ด ๋˜๋Š” policy์™€ ์‹ค์ œ sample์„ ์–ป๋Š” policy๊ฐ€ ๋™์ผํ•œ on-policy ๋ฐฉ์‹์ด๋ผ๋Š” ์ ์ด ๋‹จ์ ์ด๋‹ค.
โ€ข
REINFORCE with baseline
REINFORCE์˜ ๋‹จ์ ์ธ variance๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ baseline์„ ๋„์ž…ํ•œ ํ˜•ํƒœ๋ฅผ ๊ณ ๋ คํ•œ๋‹ค. ์ด๋•Œ baseline์— action๊ณผ ๋ฌด๊ด€ํ•œ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด, ์ด์ „์˜ ์ฆ๋ช…์— ์˜ํ•˜์—ฌ ๊ธฐ๋Œ“๊ฐ’์ด ๋ณ€ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์—ฌ์ „ํžˆ unbias๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.
โˆ‡ฮธJ(ฮธ)=Eฯ€ฮธ[โˆ‘t=0Tโˆ’1(Gtโˆ’b(st))โˆ‡ฮธlogฯ€ฮธ(atโˆฃst)]\nabla_\theta J(\theta) = E_{\pi_\theta}[\sum^{T-1}_{t=0} (G_t - b(s_t)) \nabla_\theta log \pi_\theta (a_t|s_t)]
1.
ํ˜„์žฌ ์ฃผ์–ด์ง„ policy ์˜ parameter ฮธ\theta์— ๋Œ€ํ•ด episode ์ƒ์„ฑ
2.
ํ•ด๋‹น time-step์˜ return ๊ณ„์‚ฐ
3.
๊ณ„์‚ฐ๋œ return์„ target์œผ๋กœ ํ•˜์—ฌ gradient descent ์ ์šฉํ•˜์—ฌ, baseline ( state-function ) parameter update
4.
gradient ascent ๋ฐฉ์‹์„ ์ ์šฉํ•˜์—ฌ policy์˜ parameter update
5.
ํ•˜๋‚˜์˜ episode์— ๋Œ€ํ•ด update ์ข…๋ฃŒ ์ดํ›„ ์ƒ์„ฑ๋œ ๋‹ค๋ฅธ episode์— ๋Œ€ํ•ด ๋‹ค์‹œ ๋ฐ˜๋ณต ์ ์šฉ
**5 ๋‹จ๊ณ„์—์„œ โˆ‡ฮธJ(ฮธ)=Eฯ€ฮธ[โˆ‘t=0Tโˆ’1(Gtโˆ’b(st))โˆ‡ฮธlogฯ€ฮธ(atโˆฃst)]\nabla_\theta J(\theta) = E_{\pi_\theta}[\sum^{T-1}_{t=0} (G_t - b(s_t)) \nabla_\theta log \pi_\theta (a_t|s_t)]๋ฅผ sample mean์œผ๋กœ ๊ทผ์‚ฌํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋ฉฐ 1-step๋งŒํผ ์ง„ํ–‰ํ•˜์—ฌ ์–ป์€ data๋กœ ์ฆ‰์‹œ updateํ•˜๋Š” ๋ฐฉ์‹์„ ํƒํ•˜๊ณ  ์žˆ์Œ
**Policy Gradient ์ˆ˜์‹
Loading PDFโ€ฆ