Search
๐Ÿ’ช๐Ÿป

MDP

์ƒ์„ฑ์ผ
2024/07/11 15:16
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

Markov Property

1. Grid world example

State : S={(1,1),(1,2),...,(4,2),(4,3)}S = \{(1,1), (1,2), ... , (4,2), (4,3)\}
Action: A={north,south,east,west}A = \{north, south, east, west\}
Reward : (4,3)(4,3) ๋„๋‹ฌ ์‹œ +1+1, (4,2)(4,2) ๋„๋‹ฌ ์‹œ โˆ’1-1, Negative reward c (๋ฌด์˜๋ฏธํ•œ ์ด๋™ ํŒจ๋„ํ‹ฐ)
Agent : Noisy movement
State trainsition probability
1.
(2,2)(2,2)์— ๋„๋‹ฌ ์‹œ ํ˜„์žฌ ์ƒํƒœ ์œ ์ง€
2.
์„ ํƒํ•œ ๋ฐฉํ–ฅ (80%)(80\%) , ์ขŒ/์šฐ ๋ฐฉํ–ฅ์— ๋Œ€ํ•ด 10%10\% ํ™•๋ฅ ๋กœ ์ด๋™
Terminal state : (4,2),(4,3)(4,2), (4,3)
Goal : Total sum of rewards๋ฅผ maximizeํ•˜๋Š” policy ์ฐพ๊ธฐ
Episode
Total Reward : 5c+1

2. Actions in grid world

โ€ข
Deterministic
โ—ฆ
Agent์˜ ๋‹ค์Œ state๊ฐ€ ํ˜„์žฌ state์™€ action์— ์˜ํ•ด์„œ ์™„๋ฒฝํžˆ ์ •ํ•ด์ง€๋Š” ๊ฒฝ์šฐ
โ–ช
Policy๊ฐ€ ์ •ํ•ด์ง€๋ฉด episode 1๊ฐœ๋งŒ ๊ฐ€๋Šฅ
โ€ข
Stochastic
โ—ฆ
Agent๊ฐ€ ๊ฐ™์€ state์—์„œ ๊ฐ™์€ action์„ ํ•˜๋”๋ผ๋„ ๋ฌด์ž‘์œ„์„ฑ์— ์˜ํ•ด ๋‹ค์–‘ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๊ฐ€๋Šฅ
โ–ช
Policy๊ฐ€ ์ •ํ•ด์ง€๋”๋ผ๋„ ์—ฌ๋Ÿฌ episode๊ฐ€ ๊ฐ€๋Šฅ

3. Markov property

โ€ข
Stochastic Process
โ—ฆ
์‹œ๊ฐ„์— ๋”ฐ๋ผ index๊ฐ€ ๋ถ€์—ฌ๋œ ํ™•๋ฅ  ๋ณ€์ˆ˜์˜ ์ง‘ํ•ฉ
โ—ฆ
์ด์‚ฐ ํ™•๋ฅ  ๊ณผ์ •, ์—ฐ์† ํ™•๋ฅ  ๊ณผ์ •์œผ๋กœ ๊ตฌ๋ถ„
โ€ข
Markov Process
โ—ฆ
ํ™•๋ฅ  ๊ณผ์ •์ด Markov property๋ฅผ ๋งŒ์กฑํ•˜๋ฉด Markov process๋ผ๊ณ  ํ•จ
โ—ฆ
Markov property ๋งŒ์กฑ?
โ–ช
ํ˜„์žฌ state St=sS_t = s ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‹ค์Œ state St+1=sโ€ฒS_{t+1} = s' ์ด ๋  ํ™•๋ฅ ์ด ๊ณผ๊ฑฐ์˜ ์ƒํƒœ๋“ค์— ๋…๋ฆฝ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ.
P(St+1=sโ€ฒโˆฃSt=s)=P(St+1=sโ€ฒโˆฃS0=s0,...,St=s)P(S_{t+1} = s'|S_t = s) = P(S_{t+1}=s'| S_0 = s_0 , ..., S_t=s)
โ€ข
State transition probability
โ—ฆ
ํ˜„์žฌ state ss์—์„œ ๋‹ค์Œ state sโ€ฒs'์ด ๋  ํ™•๋ฅ  (์ƒํƒœ ์ „์ด ํ™•๋ฅ )
โ€ข
State transition probability matrix
โ€ข
Pij=Psisj=p(sjโˆฃsi)=P(St+1=sjโˆฃSt=si)P_{ij} = P_{s_is_j} = p(s_j|s_i) = P(S_{t+1}=s_j | S_t =s_i)
โ€ข
๊ฐ๊ฐ์˜ ํ–‰์˜ ํ•ฉ์€ ํ˜„์žฌ ์ƒํƒœ์—์„œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๋ฏธ๋ž˜ ์ƒํƒœ๋กœ์˜ ์ง„ํ–‰ ํ™•๋ฅ ๋“ค์˜ ํ•ฉ์ด๋ฏ€๋กœ ํ•ฉ์€ 1
๊ฒฐ๊ณผ์ ์œผ๋กœ Markov process๋Š” (S, P)์˜ tuple ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ S๋Š” state์˜ ์ง‘ํ•ฉ์ด๊ณ  P๋Š” ์ƒํƒœ ์ „์ด ํ™•๋ฅ  ํ–‰๋ ฌ์ด ๋œ๋‹ค.

Markov Decision Process

1. MDP

โ€ข
MP
โ—ฆ
(State, Transition Probability)
โ€ข
MDP
โ—ฆ
(action, reward, discount factor)
โ—ฆ
๋ชจ๋“  state๋Š” Markov property๋ฅผ ๋งŒ์กฑ
โ€ข
Transition Probability
โ—ฆ
Pssโ€ฒa=p(sโ€ฒโˆฃs,a)=P(st+1=sโ€ฒโˆฃSt=s,At=a)P^a_{ss'} = p(s'|s,a)=P(s_{t+1}=s'|S_t=s, A_t=a)
โ€ข
Reward function
โ—ฆ
Action ์„ ํƒ์˜ ๊ธฐ์ค€
โ—ฆ
Rssโ€ฒaR^a_{ss'} : ํ˜„์žฌ state์—์„œ action์„ ํ•ด์„œ ๋‹ค์Œ state๋กœ ์ „์ด๋  ๋•Œ ์–ป๊ฒŒ ๋˜๋Š” reward
โ—ฆ
RsR_s : ํ˜„์žฌ state์— ์กด์žฌํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฐ›๊ฒŒ ๋˜๋Š” reward
โ—ฆ
RsaR^a_s: ํ˜„์žฌ state์—์„œ action์„ ์ทจํ–ˆ์„ ๋•Œ ์–ป๊ฒŒ ๋˜๋Š” reward
โ€ข
ฮณ\gamma : 0~1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ discount factor๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

2. Environment model

โ€ข
MDP์—์„œ์˜ ํ™˜๊ฒฝ ๋ชจ๋ธ
โ—ฆ
์ „์ด ํ™•๋ฅ 
โ—ฆ
์ „์ด ํ™•๋ฅ ์„ ์•„๋Š” ๊ฒฝ์šฐ โ‡’ MDP๋ฅผ ์•ˆ๋‹ค โ‡’ Model-based โ‡’ Dynamic programming โ‡’ optimal policy
โ—ฆ
์ „์ด ํ™•๋ฅ ์„ ๋ชจ๋ฅด๋Š” ๊ฒฝ์šฐ โ‡’ MDP๋ฅผ ๋ชจ๋ฅธ๋‹ค โ‡’ Model-free โ‡’ Reinforce learning โ‡’ optimal policy

3. Optimal policy in Grid world example

State : S={(1,1),(1,2),...,(4,2),(4,3)}S = \{(1,1), (1,2), ... , (4,2), (4,3)\}
Action: A={north,south,east,west}A = \{north, south, east, west\}
Reward : (4,3)(4,3) ๋„๋‹ฌ ์‹œ +1+1, (4,2)(4,2) ๋„๋‹ฌ ์‹œ โˆ’1-1, Negative reward c (๋ฌด์˜๋ฏธํ•œ ์ด๋™ ํŒจ๋„ํ‹ฐ)
Agent : Noisy movement
State trainsition probability
1.
(2,2)(2,2)์— ๋„๋‹ฌ ์‹œ ํ˜„์žฌ ์ƒํƒœ ์œ ์ง€
2.
์„ ํƒํ•œ ๋ฐฉํ–ฅ (80%)(80\%) , ์ขŒ/์šฐ ๋ฐฉํ–ฅ์— ๋Œ€ํ•ด 10%10\% ํ™•๋ฅ ๋กœ ์ด๋™
Terminal state : (4,2),(4,3)(4,2), (4,3)
Goal : Total sum of rewards๋ฅผ maximizeํ•˜๋Š” policy ์ฐพ๊ธฐ
โ€ข
Optimal policy(ฯ€โˆ—\pi_*)
โ—ฆ
๊ฐ state์—์„œ ์–ด๋–ค action์„ ์„ ํƒํ•  ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ ์ •๋ณด
โ—ฆ
์œ„์˜ ์˜ˆ์‹œ์—์„œ 9๊ฐœ ๊ฐ๊ฐ์˜ state๋งˆ๋‹ค ์ทจํ•˜๋Š” action์„ ๊ฒฐ์ •ํ•˜๋Š” ํ•จ์ˆ˜ = policy
โ–ช
์ด๋•Œ state transition probability์— ์˜ํ•ด์„œ ๊ฐ™์€ action์„ ์„ ํƒํ•˜๋”๋ผ๋„ ์ „์ด๋˜๋Š” state๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ
โ–ช
๊ทธ๋Ÿฌ๋ฏ€๋กœ ํ•˜๋‚˜์˜ policy์— ์˜ํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ episode๊ฐ€ ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ์Œ.
โ–ช
๋”ฐ๋ผ์„œ reward์˜ ๋‹จ์ˆœ ํ•ฉ์ด ์•„๋‹Œ ํ•ฉ์˜ ๊ธฐ๋Œ“๊ฐ’์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” policy optimal policy ฯ€โˆ—\pi_*๋กœ ๋‚˜ํƒ€๋ƒ„
โ‡’ ์œ„ ๊ทธ๋ฆผ์€ ๊ฐ ๋ชจ๋ธ๋งˆ๋‹ค ์„œ๋กœ ๋‹ค๋ฅธ Optimal policy
โ€ข
Negative reward์˜ ์ฐจ์ด๋กœ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ = optimal policy ๋‹ค๋ฆ„
โ€ข
Small negative reward (c)(c)์˜ ๊ฐ’์ด ํฌ๋ฉด ๊ฐ๊ฐ์˜ ์ด๋™์— ๋Œ€ํ•ด ์ฃผ์–ด์ง€๋Š” Penalty๊ฐ€ ์ž‘์Œ
โ—ฆ
1๋ฒˆ case์˜ (3,2)์—์„œ Optimal action์ด west๊ฐ€ ๋œ๋‹ค.
โ–ช
-1์— ๋„๋‹ฌํ•  ๊ฐ€๋Šฅ์„ฑ์„ ์™„์ „ํžˆ ๋ฐฐ์ œํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ
โ–ช
west action์„ ์ทจํ•˜๊ณ  transition probability์— ์˜ํ•ด north๋กœ ์ „์ด๋˜๊ธฐ๋ฅผ ํฌ๋งํ•˜๋Š” ๊ฒƒ
โ–ช
๊ฐ™์€ ์ด์œ ๋กœ (4,1)์—์„œ์˜ Optimal action๋„ south๊ฐ€ ๋œ๋‹ค.
โ—ฆ
2๋ฒˆ case์—์„œ๋Š” c์˜ ๊ฐ’์ด 1๋ฒˆ์— ๋น„ํ•ด ์ž‘์•„์กŒ์œผ๋ฏ€๋กœ, step์˜ ์ฆ๊ฐ€์— ๋”ฐ๋ฅธ penalty๊ฐ€ ๋Š˜์–ด๋‚จ
โ–ช
๋”ฐ๋ผ์„œ (3,2)์™€ (4,1)์—์„œ optimal action์ด ๋‹ฌ๋ผ์ง.
โ—ฆ
๋งˆ์ง€๋ง‰์œผ๋กœ 4๋ฒˆ case์˜ ๊ฒฝ์šฐ ๊ฐ step์— ๋Œ€ํ•œ penalty๊ฐ€ -2๋กœ ๊ต‰์žฅํžˆ ํผ
โ–ช
๋”ฐ๋ผ์„œ ์ตœ๋Œ€ํ•œ ๋น ๋ฅธ ์ข…๋ฃŒ๋ฅผ ํ•˜๋„๋ก Optimal policy๊ฐ€ ์ฃผ์–ด์ง
โ–ช
ํŠนํžˆ (3,2)์—์„œ east action์„ optimal๋กœ ์ทจ๊ธ‰ํ•จ
โ–ช
๋งŒ์•ฝ north action์„ optimal๋กœ ์ทจ๊ธ‰ํ•˜๋Š” ๊ฒฝ์šฐ ๊ฐ€์žฅ ํฐ reward๋Š” -1์ธ๋ฐ, ์ด ๊ฒฝ์šฐ๋Š” -1์„ ์–ป๊ธฐ ์œ„ํ•œ ํ™•๋ฅ ์ด 0.8*0.8=0.64๊ฐ€ ๋˜๋ฏ€๋กœ (3,2)์—์„œ์˜ optimal action์€ east์ž„

Reward, Policy

1. Reward

โ€ข
Scalar feedback
โ€ข
t step์—์„œ agent์˜ action์ด ์–ผ๋งˆ๋‚˜ ์ ์ ˆํ•œ์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„
โ‡’ agent์˜ ๋ชฉํ‘œ๋Š” reward์˜ ๋ˆ„์  ํ•ฉ์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๊ฒƒ
โ€ข
Reward Hypothesis
โ—ฆ
๊ฐ•ํ™”ํ•™์Šต์˜ ๋ชจ๋“  ๋ชฉํ‘œ๋Š” reward์˜ ๋ˆ„์  ํ•ฉ์˜ ๊ธฐ๋Œ“๊ฐ’์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ‘œํ˜„ ๊ฐ€๋Šฅ
โ—ฆ
๊ธฐ๋Œ“๊ฐ’์ด ์‚ฌ์šฉ๋˜๋Š” ์ด์œ ๋Š” ํ•˜๋‚˜์˜ policy๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ํ•˜๋”๋ผ๋„, transition probability์— ์˜ํ•ด์„œ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” episode๊ฐ€ ๋‹ค์–‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž„.

2. Known dynamics

โ€ข
๋ชจ๋“  transition์— ๋Œ€ํ•ด์„œ dynamics( p(sโ€ฒ,rโˆฃs,a)p(s',r|s,a) )๋ฅผ ์•ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด transition probability, reward ๋“ฑ์„ ์ด๋ฅผ ์‚ฌ์šฉํ•ด ๊ณ„์‚ฐ ๊ฐ€๋Šฅ
โ€ข
transition probability
โ—ฆ
ํ˜„์žฌ state ss์—์„œ action aa๋ฅผ ์„ ํƒํ•˜์—ฌ ๋‹ค์Œ state sโ€ฒs'์œผ๋กœ ์ „์ดํ•˜๋Š” ๊ณผ์ •์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” reward๋ฅผ ๋„์ž…ํ•˜์—ฌ transition probability๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ(Marginalization)
โ€ข
Expected reward (state-action pair)
โ—ฆ
๊ธฐ๋Œ“๊ฐ’ ๋ถ€๋ถ„์„ ๊ธฐ๋Œ“๊ฐ’์˜ ์ •์˜์— ์˜ํ•ด ํ‘œํ˜„.
โ—ฆ
p(rโˆฃs,a)p(r|s,a) ๊ฐ€ ํ˜„์žฌ state ss์—์„œ action aa๋ฅผ ์„ ํƒํ–ˆ์„ ๋•Œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” reward์— ๋Œ€ํ•œ ํ™•๋ฅ 
โ–ช
๋”ฐ๋ผ์„œ marginalization์„ ํ†ตํ•ด state ss์—์„œ action aa๋ฅผ ์„ ํƒํ–ˆ์„ ๋•Œ reward rr์„ ๋ฐ›๊ณ  ์ „์ด ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  sโ€ฒs'์— ๋Œ€ํ•œ ํ•ฉ์œผ๋กœ ํ‘œํ˜„ ๊ฐ€๋Šฅ
โ€ข
Expected reward (state-action-next_state)
โ—ฆ
๊ธฐ๋Œ“๊ฐ’ ๋ถ€๋ถ„์„ ๊ธฐ๋Œ“๊ฐ’ ์ •์˜์— ์˜ํ•ด ํ‘œํ˜„.
โ—ฆ
p(rโˆฃsโ€ฒ)p(r|s')์„ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์˜ ์ •์˜์— ์˜ํ•ด ํ‘œํ˜„
โ—ฆ
(s,a)(s,a) ์กฐ๊ฑด์„ ๋™์‹œ์— ์ถ”๊ฐ€
โ‡’ ๊ฐ ์‹์˜ ์ „๊ฐœ ๊ณผ์ •์— dynamics ํ•ญ์ด ์กด์žฌํ•˜๋ฏ€๋กœ, ์ด๋ฅผ ์•ˆ๋‹ค๋ฉด transition probability, reward๋Š” dynamics๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๋‹ค.

3. Return

โ€ข
t ์‹œ์ ๋ถ€ํ„ฐ ์ข…๋ฃŒ ์‹œ์ ๊นŒ์ง€ ๊ฐ๊ฐ€๋œ reward์˜ ๋ˆ„์  ํ•ฉ์ด๋‹ค.
โ€ข
๊ฐ๊ฐ€์œจ ฮณ\gamma๋Š” 0~1 ์‚ฌ์ด์˜ ์‹ค์ˆ˜ ๊ฐ’์œผ๋กœ, ๋ฏธ๋ž˜์˜ reward์˜ ๋ถˆํ™•์‹ค์„ฑ์— ๋Œ€ํ•œ ํ‘œํ˜„
โ—ฆ
ํฌ๊ธฐ์— ๋”ฐ๋ผ ๊ทผ์‹œ์•ˆ์ ์ด๊ฑฐ๋‚˜ ์›์‹œ์•ˆ์ ์ธ ๋ชจ๋ธ์ž„์„ ํ‘œํ˜„ ๊ฐ€๋Šฅ
โ—ฆ
Discount factor๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ 
โ–ช
return์˜ ๋ฌดํ•œ๋Œ€ ๋ฐœ์‚ฐ์„ ๋ง‰์Œ
โ–ช
๋ฏธ๋ž˜์˜ ๋ถˆํ™•์‹ค์„ฑ์— ๋Œ€ํ•œ ํ‘œํ˜„์ด ๊ฐ€๋Šฅ
โ–ช
๋ชจ๋“  sequence๊ฐ€ ๋ฐ˜๋“œ์‹œ ์ข…๋ฃŒ๋˜๋Š” ๊ฒฝ์šฐ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๋„ ์žˆ์Œ

4. Policy

โ€ข
Stochastic policy
โ—ฆ
์ฃผ์–ด์ง„ state์—์„œ ์ทจํ•˜๋Š” action์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ
โ—ฆ
ฯ€(aโˆฃs)=P(At=aโˆฃSt=s)\pi(a|s) = P(A_t=a | S_t=s)
โ€ข
Deterministic policy
โ—ฆ
ฯ€(s)=a\pi(s)=a
Policy๋Š” ๊ฐ state์—์„œ ์–ด๋–ค action์„ ์ทจํ•˜๋Š” ๊ฒƒ์ด optimalํ•œ์ง€, ์ฆ‰ total discounted reward๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•œ action ์„ ํƒ์— ๋Œ€ํ•œ guideline์ด๋‹ค. ์ด๋•Œ MDP์—์„œ์˜ policy๋Š” MDP ๊ฐ state๊ฐ€ Markov property๋ฅผ ๋งŒ์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์˜ค์ง ํ˜„์žฌ state์— ๋Œ€ํ•ด์„œ๋งŒ ์ข…์†์ ์ด๋‹ค.
โ€ข
Case1) Known MDP
โ—ฆ
Deterministic optimal policy ฯ€โˆ—(s)\pi_*(s)๊ฐ€ ์กด์žฌํ•œ๋‹ค.
โ€ข
Case2) Unknown MDP
โ—ฆ
Stochastic policy๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•˜๋ฉฐ, ฯตโˆ’greedy\epsilon-greedy ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.
ฯตโˆ’greedy\epsilon-greedy ๋ฐฉ์‹์€ ฯต\epsilon๋งŒํผ randomํ•œ action์„ ์„ ํƒํ•˜๊ณ  1โˆ’ฯต1-\epsilon๋งŒํผ greedyํ•œ action(optimal)์„ ์„ ํƒํ•˜์—ฌ ํ˜„์žฌ sample์—์„œ ๋น„๋ก ์ตœ์ ์€ ์•„๋‹์ง€๋ผ๋„, ๊ฒฝํ—˜ํ•˜์ง€ ์•Š์€ action์— ๋Œ€ํ•ด์„œ๋„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด๋•Œ ฯต\epsilon์˜ ํ™•๋ฅ ์€ ์„ ํƒ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  action์— ๋Œ€ํ•ด์„œ ๋™๋“ฑํ•˜๊ฒŒ ๋‚˜๋ˆ„๊ฒŒ ๋œ๋‹ค.

5. Notation

vฯ€(s)v_\pi(s) : State-value function โ†’ ํ˜„์žฌ state์—์„œ policy ฯ€\pi๋ฅผ ๋”ฐ๋ž์„ ๋•Œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  return์— ๋Œ€ํ•œ ๊ธฐ๋Œ“๊ฐ’
vโˆ—(s)v_*(s): Optimal state-value function โ†’ ํ˜„์žฌ state์—์„œ optimal policy ฯ€\pi๋ฅผ ๋”ฐ๋ž์„ ๋•Œ์˜ value
qฯ€(s,a)q_\pi(s,a) : Action value function โ†’ ํ˜„์žฌ state์—์„œ action aa๋ฅผ ์„ ํƒํ•˜๊ณ  ์ดํ›„์— ฯ€\pi๋ฅผ ๋”ฐ๋ฅผ ๋•Œ์˜ value
qโˆ—(s,a)q_*(s,a) : Optimal action-value function โ†’ ํ˜„์žฌ state์—์„œ action aa๋ฅผ ์„ ํƒํ•˜๊ณ  ์ดํ›„์— ฯ€โˆ—\pi_*์— ๋”ฐ๋ฅผ ๋•Œ์˜ value
**์ด๋•Œ value๋Š” ๊ฐ๊ฐ return์˜ ๊ธฐ๋Œ“๊ฐ’์„ ์˜๋ฏธํ•œ๋‹ค.