Search
๐Ÿ

2์žฅ

Introduction to MDPs

โ€ข
Markov decision processes formally describe an environment for reinforcement learning
โ€ข
Where the environment is fully observable
โ€ข
Almost all RL problems can be formalised as MDPs, e.g.

Markov Property

The state captures all relevant information from the history

Markov Process

A Markov process is a memoryless random process.
โ†’ memoryless : ์–ด๋””์„œ ์–ด๋–ป๊ฒŒ ์™”๋“  ์ƒ๊ด€์—†์ด ๋ฏธ๋ž˜๋Š” ํ˜„์žฌ์—์„œ ๊ฒฐ์ •ํ•œ๋‹ค๋Š” ๋œป (Markov Property)
a sequence of random states S1, S2, ... with the Markov property.
โ†’ random process๋Š” sampling์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ. (Markov property๋ฅผ ๊ฐ€์ง„ random state์˜ ์ˆœ์„œ๋ฅผ ์˜๋ฏธ)
โ€ข
S : state๋“ค์˜ ์ง‘ํ•ฉ.
โ€ข
P : ๋‹ค์Œ state๋กœ ์ „์ดํ•  ํ™•๋ฅ .

Markov Reward Process

โ€ข
S : ์Šคํ…Œ์ดํŠธ๋“ค์˜ ์ง‘ํ•ฉ
โ€ข
P : ํŠธ๋žœ์ง€์…˜ ํ™•๋ฅ  ๋งคํŠธ๋ฆญ์Šค
โ€ข
R : ๋ฆฌ์›Œ๋“œ ํ•จ์ˆ˜ (์–ด๋–ค ์Šคํ…Œ์ดํŠธ์— ๋„์ฐฉํ•˜๋ฉด ์–ด๋–ค ๋ฆฌ์›Œ๋“œ๋ฅผ ์ค˜๋ผ.๋ฅผ ์Šคํ…Œ์ดํŠธ ๋ณ„๋กœ ์ •์˜ ํ•ด๋‘๋Š” ๊ฒƒ.)
โ€ข
ฮณ : discount factor

Return

The discount ฮณ โˆˆ [0, 1] is the present value of future rewards
ฮณ close to 0 leads to โ€myopicโ€ evaluation ฮณ close to 1 leads to โ€far-sightedโ€ evaluation

Why discount?

โ€ข
Mathematically convenient to discount rewards
โ€ข
Avoids infinite returns in cyclic Markov processes
โ€ข
It is sometimes possible to use undiscounted Markov reward processes (i.e. ฮณ = 1), e.g. if all sequences terminate.

Value Function

return ์˜ ๊ธฐ๋Œ“๊ฐ’. ์–ด๋–ค state์— ์™”์„ ๋•Œ, ๊ณ„์† sampling ํ•˜๋ฉด์„œ ์—ํ”ผ์†Œ๋“œ๋ฅผ ๋งŒ๋“ค์–ด๊ฐ„๋‹ค.
๊ทธ๋Ÿด ๋•Œ ์—ํ”ผ์†Œ๋“œ๋งˆ๋‹ค ๋ฆฌํ„ด์ด ์ƒ๊ธฐ๊ฒŒ ๋œ๋‹ค.

Bellman Equation for MRPs

immediate reward R(t+1)

Markov Decision Processes (MDP)

โ€ข
A Markov decision process (MDP) is a Markov reward process with decisions.
โ€ข
It is an environment in which all states are Markov.

Policies

โ€ข
A policy fully defines the behaviour of an agent
โ€ข
MDP policies depend on the current state (not the history)

Value Function

state-value function

โ€ข
input์€ state ํ•˜๋‚˜์ด๋‹ค.
โ€ข
action์ด ๋”ฐ๋กœ ์—†์—ˆ๋˜ MRP์™€๋Š” ๋‹ฌ๋ฆฌ action์ด ์ถ”๊ฐ€๋˜๋ฉด์„œ policy ฯ€๊ฐ€ ์ถ”๊ฐ€๋๋‹ค.
โ€ข
์—ฌ์ „ํžˆ ๊ทธ ์˜๋ฏธ๋Š” ํ˜„์žฌ ์Šคํ…Œ์ดํŠธ s์—์„œ ์ถœ๋ฐœํ–ˆ์„ ๋•Œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” return์˜ ๊ธฐ๋Œ“๊ฐ’์„ ์˜๋ฏธํ•œ๋‹ค.

action-value function

โ€ข
Qํ•จ์ˆ˜๋ผ๊ณ ๋„ ๋ถ€๋ฅธ๋‹ค.
โ€ข
input์€ state์™€ action 2๊ฐœ๊ฐ€ ๋“ค์–ด๊ฐ„๋‹ค.
โ€ข
state s์—์„œ action a๋ฅผ ํ–ˆ์„ ๋•Œ ๊ทธ ์ดํ›„์—๋Š” policy ฯ€๋ฅผ ๋”ฐ๋ผ์„œ ๊ฒŒ์ž„์„ ๋๊นŒ์ง€ ์ง„ํ–‰ํ–ˆ์„ ๋•Œ return์˜ ๊ธฐ๋Œ“๊ฐ’์„ ์˜๋ฏธํ•œ๋‹ค.

Bellman Expectation Equation

state-value function decompose

ํ•œ ์Šคํ…์„ ๊ฐ€๊ณ , ๊ทธ ๋‹ค์Œ ์Šคํ…Œ์ดํŠธ๋ถ€ํ„ฐ pi๋ฅผ ๋”ฐ๋ผ๊ฐ€๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

action-value function decompose

R(t+1)์„ ๋ฐ›๊ณ , ๋‹ค์Œ state์—์„œ ๋‹ค์Œ ์•ก์…˜์„ ํ•˜๋Š” ๊ฒƒ์˜ ๊ธฐ๋Œ“๊ฐ’๋“ค.
๋‹ค์Œ ์•ก์…˜์€ pi์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค.

Bellman Expectation Equation (Matrix Form)

โ†’ ์—ญํ–‰๋ ฌ์„ ํ†ตํ•ด์„œ direct solution์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.
MDP๊ฐ€ ์ปค์ง€๋ฉด ๊ณ„์‚ฐ๋Ÿ‰์ด ๋„ˆ๋ฌด ์ปค์ ธ๋ฒ„๋ฆฌ๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

Optimal Policy

โ€œpolicy pi๋ฅผ ๋”ฐ๋ž์„ ๋•Œ, ๋ชจ๋“  state-value๊ฐ€ piโ€™ ์„ ๋”ฐ๋ž์„ ๋•Œ๋ณด๋‹ค ๋” ์ข‹์„ ๋•Œ pi๊ฐ€ piโ€™ ์™€ ๊ฐ™๊ฑฐ๋‚˜ ๋›ฐ์–ด๋‚˜๋‹คโ€ ๋ผ๊ณ  ํ• ์ˆ˜ ์žˆ๋‹ค.
๋‹จ, ์—ฌ๊ธฐ์„œ ํ•˜๋‚˜์˜ state-value๋ผ๋„ piโ€™ ๊ฒƒ๋ณด๋‹ค ์ข‹๋‹ค๊ณ  ํ•  ์ˆ˜๋Š” ์—†๋‹ค.

Finding an Optimal Policy

An optimal policy can be found by maximising over qโˆ—(s,a),
There is always a deterministic optimal policy for any MDP
If we know qโˆ—(s,a), we immediately have the optimal policy

Solving the Bellman Optimality Equation

โ€ข
Bellman Optimality Equation is non-linear
โ€ข
No closed form solution (in general)
Many iterative solution methods