Search
๐Ÿ

4์žฅ

Monte-Carlo Reinforcement Learning

๊ณ„์†๋œ ์‹คํ–‰์œผ๋กœ ๋„์ถœํ•œ ์‹ค์ œ ๊ฐ’๋“ค์„ ํ†ตํ•ด์„œ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ์‹
1.
MC methods learn directly from episodes of experience
2.
MC is model-free: no knowledge of MDP transitions / rewards
3.
MC learns from complete episodes: no bootstrapping, ์ด ๊ฐ’๋“ค์„ ํ‰๊ท ๋‚ธ ๊ฒƒ์ด value
4.
๊ณ„์† loopํ•œ episode์˜ ๊ฒฝ์šฐ return ์ด ๋‚˜์˜ค์ง€ ์•Š์œผ๋ฏ€๋กœ, episode๋Š” ์ข…๋ฃŒ๋˜์–ด์•ผ ํ•œ๋‹ค.

First-Visit Monte-Carlo Policy Evaluation

The first time-step t that state s is visited in an episode, Increment counter N(s) โ† N(s) + 1
Increment total return S(s) โ† S(s) + Gt , Value is estimated by mean return V(s) = S(s)/N(s) โ†’ ๋Œ€์ˆ˜์˜ ๋ฒ•์น™์— ์ธํ•ด n์ด ๋ฌดํ•œ๋Œ€๋กœ ๊ฐ€๋ฉด V(s)๋Š” Vpi(s)๋กœ ์ˆ˜๋ ดํ•œ๋‹ค.

Every-Visit Monte-Carlo Policy Evaluation

Every time-step t that state s is visited in an episode, Increment counter N(s) โ† N(s) + 1 Increment total return S(s) โ† S(s) + Gt , Value is estimated by mean return V(s) = S(s)/N(s)
โ†’ ๋ชจ๋“  state๋ฅผ ๋ฐฉ๋ฌธํ•œ๋‹ค๋Š” ๊ฐ€์ •ํ•˜์—, ๋Œ€์ˆ˜ ๋ฒ•์น™์— ๋”ฐ๋ผ ๋‘ ๋ฐฉ์‹์˜ ๊ฒฐ๊ณผ๋Š” ๊ฐ™๋‹ค.

Incremental Monte-Carlo Update

V(s)๋ฅผ ๋ฐฉ๋ฌธํ•  ๋•Œ๋งˆ๋‹ค Incremental Mean์„ ํ†ตํ•ด updateํ•œ๋‹ค.
non-stationary์—์„œ๋Š” 1/N์„ a์™€ ๊ฐ™์€ ์ž‘์€ ๊ฐ’์œผ๋กœ ๊ณ ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„ ์ˆ˜ ๋„ ์žˆ๋‹ค.
โ†’ ์˜ค๋ž˜๋œ ๊ฒฝํ—˜์€ ์‚ญ์ œ์‹œํ‚ค๊ณ  1/N์€ ์ตœ์‹  ๊ธฐ์–ต์ด ์Œ“์ผ ์ˆ˜๋ก update๋˜๋Š” ๊ฒƒ์„ ๋ณด์ •์‹œ์ผœ์„œ ์ตœ์‹ ๊ฐ’์œผ๋กœ๋งŒ evaluateํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค.

Temporal-Difference Learning

1.
TD methods learn directly from episodes of experience
2.
TD is model-free: no knowledge of MDP transitions / rewards
3.
TD learns from incomplete episodes, by bootstrapping
TD Target : Rt+1 + ฮณV (St+1)
โ†’ one-step ๋” ๊ฐ€์„œ์˜ ์˜ˆ์ธก์น˜
TD error : ฮดt =Rt+1+ฮณV(St+1)โˆ’V(St)
โ†’ V(St) : one-step ๋” ๊ฐ€์„œ์˜ value ์˜ˆ์ธก์น˜ -V(St+1) : r* one-step์—์„œ์˜ value ์˜ˆ์ธก์น˜
โ‡’ one-step์„ ๋” ๊ฐ€์„œ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์ •ํ™•ํ•˜๋ฏ€๋กœ, ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ V๋ฅผ ์—…๋ฐ์ดํŠธ

MC vs TD

Advantages and Disadvantages of MC , TD

โ€ข
MC : final outcome์˜ return ๊ฐ’์ด ๋‚˜์˜จ๋’ค, , complete sequences ์—์„œ ํ•™์Šต ๊ฐ€๋Šฅ
terminating ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉ๊ฐ€๋Šฅ
1.
MC has high variance, zero bias
2.
Good convergence properties (even with function approximation)
3.
Not very sensitive to initial value
4.
Very simple to understand and use
โ€ข
TD : final outcome return ๊ฐ’์ด ๋‚˜์˜ค๊ธฐ ์ „, incomple sequences ์—์„œ ํ•™์Šต ๊ฐ€๋Šฅ
non-terminating ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
1.
TD has low variance, some bias
2.
Usually more efficient than MC
3.
TD(0) converges to vฯ€(s)
4.
More sensitive to initial value
episode๊ฐ€ ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ error๊ฐ€ ์ ์  ์ค„์–ด๋“ ๋‹ค.
MC๋Š” ์™„๋งŒํžˆ ์ค„์–ด๋“ค๊ณ , TD๋Š” a์— ๋”ฐ๋ผ ์ง„๋™ํ•˜๋Š” ๋ชจ์Šต ๊ด€์ฐฐ

Batch MC , TD

episode๋ฅผ ๋ฌดํ•œ๋ฒˆ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ์œผ๋ฉด MC์™€ TD๊ฐ€ ์ˆ˜๋ ดํ•œ๋‹ค.
MC : Markov property๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , non-Markov env์—์„œ ๋ณดํ†ต ํšจ๊ณผ์ 
TD : Markov property๋ฅผ ์‚ฌ์šฉํ•ด์„œ value๋ฅผ ์ถ”์ธก, Markov env์—์„œ ๋” ํšจ๊ณผ์ 

Monte-Carlo Backup

St์—์„œ ์‹œ์ž‘ํ•ด์„œ ๋‹ค์–‘ํ•œ branches๊ฐ€ ์กด์žฌํ•˜๋Š” tree๊ตฌ์กฐ
MC๋Š” ์–ด๋Š ํ•œ๊ฐ€์ง€๋ฅผ ๋๊นŒ์ง€ ํ•ด์„œ ์ด ๊ฐ’์œผ๋กœ St๋ฅผ update

Temporal-Difference Backup

TD๋Š” ํ•œ step๋งŒ ๊ฐ€๋ณด๊ณ , ๊ทธ ์ƒํƒœ์—์„œ ์ถ”์ธกํ•ด์„œ update = bootstrapping

Dynamic Programming Backup

DP๋Š” Sampling ๋๊นŒ์ง€ ํ•ด๋ณด์ง€ ์•Š๊ณ , ๊ทธ ์นธ์—์„œ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  actions์— ๋Œ€ํ•ด์„œ ํ•œ step ๊ฐ€๋ณด๊ณ , ๊ทธ ์นธ์— ์ ํ˜€ ์žˆ๋Š” value๋กœ full-width update

Bootstrapping and Sampling

โ€ข
Bootstrapping : ์ถ”์ธก์น˜๋กœ ์ถ”์ธก์น˜๋ฅผ update
MC๋Š” ๋๊นŒ์ง€ ๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— ์•ˆํ•จ, DP์™€ TD๋Š” ํ•œ step๋งŒ ๊ฐ€์„œ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ ์šฉ
โ€ข
ํ•œ sample์„ ํ†ตํ•ด update
MC, TD๋Š” Sample์„ ํ†ตํ•ด ์ ์šฉ, DP๋Š” ๊ทธ state์—์„œ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  action์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์•ˆํ•จ

n-Step Prediction

n์ด ๋ฌดํ•œ๋Œ€๋กœ ๊ฐ€๋Š” TD = MC

ฮป-return

TD lamda : TD(0) ๋ถ€ํ„ฐ MC๊นŒ์ง€ ๋ชจ๋“  ๊ฒƒ์„ ํ‰๊ท  ๋‚ด๋„ ๊ฐ€๋Šฅํ•˜๋‹ค.
โ€ข
(1 - lamda) ์— lamda๋ฅผ ๊ณ„์†ํ•ด์„œ ๊ณฑํ•ด์ฃผ๋Š” ๋ฐฉ์‹, lamda๊ฐ€ ๊ณฑํ•ด์ง€๋ฉฐ MC์— ๊ฐˆ์ˆ˜๋ก ๊ฐ€์ค‘์น˜๊ฐ€ ์ ์  ์ค„์–ด๋“œ๋Š”๋ฐฉ์‹

TD(ฮป) Weighting Function

โ†’ ์‹œ๊ฐ„์ด ์ง€๋‚ ์ˆ˜๋ก ๊ฐ€์ค‘์น˜๊ฐ€ ์ ์  ์ค„์–ด๋“ ๋‹ค.
Geometry means๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด computation์„ efficientํ•˜๊ณ , memorylessํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
TD(0)์™€ ๊ฐ™์€ ๋น„์šฉ์œผ๋กœ TD(lamda)๋ฅผ ๊ณ„์‚ฐ๊ฐ€๋Šฅํ•˜๋‹ค.

Forward-view TD(lamda)

โ€ข
Update value function towards the ฮป-return
โ€ข
Like MC, can only be computed from complete episodes (TD(0)์˜ ์žฅ์ ์ด ์‚ฌ๋ผ์ง)

Backward View TD(lamda)

โ€ข
Update online, every step, from incomplete sequences (TD(0) ์˜ ์žฅ์ ์ด ์‚ฌ๋ผ์ง)
โ€ข
Keep an eligibility trace for every state s
โ€ข
๊ณผ๊ฑฐ์—์„œ ์ฑ…์ž„์†Œ์žฌ๋ฅผ ๊ณ„์† ๊ธฐ๋กํ•˜๋ฉด์„œ ์˜ค๊ธฐ ๋•Œ๋ฌธ์—, ๋งค state๋งˆ๋‹ค update๊ฐ€ ๊ฐ€๋Šฅ
๋Œ€๋ถ€๋ถ„์˜ ํ™˜๊ฒฝ์—์„œ Backward TD(lamda)๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉ

Eligibility Traces

โ†’ ์ฑ…์ž„์ด ํฐ ๊ฒƒํ•œํ…Œ update๋ฅผ ๋งŽ์ด ํ•ด์ฃผ๋Š” ๋ฐฉ์‹
โ€ข
Frequency heuristic: assign credit to most frequent states
โ€ข
Recency heuristic: assign credit to most recent states
ํ•œ๋ฒˆ๋„ ๋ฐฉ๋ฌธํ•˜์ง€ ์•Š์œผ๋ฉด 0, state๋ฅผ ๋ฐฉ๋ฌธํ•˜๋ฉด 1์„ ๋”ํ•ด์ฃผ๊ณ  ๋ฐฉ๋ฌธํ•˜์ง€ ์•Š์œผ๋ฉด r๊ณฑํ•ด์„œ ์ค„์ธ๋‹ค.