Search
πŸ’ͺ🏻

Bellman Equation

생성일
2024/07/11 15:16
μ΅œμ’… μˆ˜μ •μΌ
2025/11/11 08:14
νƒœκ·Έ
κ°•ν™”ν•™μŠ΅
μž‘μ„±μž

1. State-Value Function

β€’
vΟ€(s)=EΟ€[Gt∣St=s]v_{\pi}(s) = E_{\pi} [G_{t} | S_t = s]
β€’
A function representing the value of the current state ss, defined as the expected return when starting from state ss and following policy Ο€\pi.

2. Transformation using conditional expectation and marginalization

β€’
Expectation
β—¦
E(x)=βˆ‘xxβ‹…p(x)E(x) = \sum_x x \cdot p(x)
β€’
Marginalization
β—¦
Marginalization is a method for computing the probability density function of a specific variable when the joint probability density function is known.
β—¦
P(x)=βˆ‘yp(x,y)=βˆ‘yp(x∣y)β‹…p(y)P(x) = \sum_y p(x,y) = \sum_y p(x|y) \cdot p(y)
β€’
Law of Total Probability
β—¦
βˆͺBi=S,Bi∩Bj=βˆ…\cup B_i = S , B_i \cap B_j = \emptyset
β—¦
P(A)=βˆ‘nP(A∩Bn)=βˆ‘nP(A∣Bn)βˆ—P(Bn)P(A) = \sum_{n} P(A \cap B_n) = \sum_n P(A | B_n ) * P(B_n)
β€’
Transformation of Expectation
β—¦
E(x)=βˆ‘xxβˆ‘yp(x∣y)β‹…p(y)=βˆ‘yE(X=x∣Y=y)β‹…p(y){\small E(x) = \sum_{x} x \sum_{y} p(x | y ) \cdot p(y) = \sum_y E(X=x |Y=y) \cdot p(y)}
β€’
Extension to Conditional Expectation
β—¦
E(x∣z)=βˆ‘xxβ‹…p(x∣z)E(x|z) = \sum_x x \cdot p(x | z)
β—¦
E(x∣z)=βˆ‘yE(X=x∣Y=y,Z=z)β‹…p(y∣z)E(x|z) = \sum_{y} E(X=x | Y=y , Z= z) \cdot p(y|z)
β€’
Transformation of State Value Function
β—¦
βˆ‘aEΟ€[Gt∣St=s,At=a]β‹…p(a∣s)\sum_a E_{\pi} [G_t | S_t = s, A_t = a ] \cdot p(a | s)
β€’
Replace each term to action Value Function and Policy
β—¦
EΟ€[Gt∣St=s,At=a]=qΟ€(s,a),Β p(a∣s)=Ο€(a∣s)E_{\pi} [G_t | S_t = s, A_t = a ] = q_{\pi} (s,a), \ p(a|s) = \pi(a|s)
β—¦
βˆ‘aΟ€(a∣s)β‹…qΟ€(s,a)\sum_a \pi(a|s) \cdot q_\pi (s,a)

3. Transformation using Reward, Return, Transition Probability

β€’
Expected reward for state-action pair
β—¦
Rsa=E[Rt+1∣St=s,At=a]=βˆ‘rrβˆ‘sβ€²p(sβ€²,r∣s,a)R^a_s = E[R_{t+1} | S_t = s, A_t= a] = \sum_r r \sum_{s'} p(s',r |s,a)
β€’
Transition Probability
β—¦
Pss’a=βˆ‘rp(sβ€²,r∣s,a)P^a_{ss’} = \sum_{r} p(s',r |s,a)
β€’
Return
β—¦
Gt=Rt+1+Ξ³Rt+2+β‹…β‹…β‹…G_t = R_{t+1} + \gamma R_{t+2} + \cdot \cdot \cdot
β€’
Express Action Value Function using Return
β—¦
EΟ€[Gt∣St=s,At=a]=E[Rt+1+Ξ³Gt+1∣St=s,At=a]E_{\pi} [G_t | S_t = s, A_t = a ] = E[R_{t+1} +\gamma G_{t+1} | S_t = s, A_t = a]
β€’
Transformation using the action-value function expressed through marginalization and return
β—¦
βˆ‘aΟ€(a∣s)β‹…βˆ‘sβ€²,rEΟ€[Rt+1+Ξ³Gt+1∣St=s,At=a,St+1=sβ€²,Rt+1=r]β‹…P(St+1=sβ€²,Rt+1=r∣St=s,At=a){\small \sum_{a} \pi(a \mid s) \cdot \sum_{s',r} \mathbb{E}_\pi \left[ R_{t+1} + \gamma G_{t+1} \mid S_t = s, A_t = a, S_{t+1} = s', R_{t+1} = r \right] \cdot P\left(S_{t+1} = s', R_{t+1} = r \mid S_t = s, A_t = a \right)}
1.By the additivity of expectation and
2.By the Markov property,
the state-value function can be expressed as follows.
β—¦
βˆ‘aΟ€(a∣s)β‹…βˆ‘sβ€²,rp(sβ€²,r∣s,a)β‹…[r+Ξ³EΟ€[Gt+1∣St+1=sβ€²]]\sum_{a} \pi(a \mid s) \cdot \sum_{s',r} p(s', r \mid s, a) \cdot \left[ r + \gamma \mathbb{E}_\pi \left[G_{t+1} \mid S_{t+1} = s'\right] \right]
3. Distribute p(sβ€²,r∣s,a)p(s', r | s, a)
β—¦
βˆ‘rrβˆ‘sβ€²p(sβ€²,r∣s,a)+Ξ³βˆ‘sβ€²,rp(sβ€²,r∣s,a)EΟ€[Gt+1∣St+1=sβ€²]\sum_r r \sum_{s'} p(s', r | s,a) + \gamma \sum_{s',r} p(s', r | s,a) E_\pi [G_{t+1} | S_{t+1} = s']
Replace with 1. Expected reward for state-action pair 2. Transition Probability
β—¦
βˆ‘aΟ€(a∣s)[Rsa+Ξ³βˆ‘sβ€²Pssβ€²aVΟ€(sβ€²)]\sum_{a} \pi(a \mid s) \left[ R_s^a + \gamma \sum_{s'} P_{s s'}^a V_\pi(s') \right]
β€’
Expressed in terms of conditional expectation.
β—¦
EΟ€[Rt+1+Ξ³VΟ€(St+1)∣St=s]\mathbb{E}_\pi \left[ R_{t+1} + \gamma V_\pi(S_{t+1}) \mid S_t = s \right]

4. Action Value Function

β€’
qΟ€(s,a)=Et[Gt∣St=s,At=a]q_\pi (s,a) = E_t[G_t | S_t =s, A_t = a]
β€’
It is expressed as the expected return when taking action aa in the current state ss and following policy Ο€\pi
β€’
Transformation using marginalization and expectation
β—¦
βˆ‘sβ€²,rEΟ€[Rt+1+Ξ³Gt+1∣St=s,At=a,St+1=sβ€²,Rt+1=r]p(St+1=sβ€²,Rt+1=r∣St=s,At=a){\small \sum_{s', r} \mathbb{E}_\pi \left[ R_{t+1} + \gamma G_{t+1} \mid S_t = s, A_t = a, S_{t+1} = s', R_{t+1} = r \right] p(S_{t+1} = s', R_{t+1} = r \mid S_t = s, A_t = a)}
μ—¬κΈ°μ„œ Markov Property에 μ˜ν•΄, St=s,At=aS_t = s, A_t = a 쑰건은 식에 영ν–₯을 μ£Όμ§€ μ•ŠμœΌλ―€λ‘œ μ‚­μ œκ°€ κ°€λŠ₯ν•˜λ‹€. λ”ν•˜μ—¬ κΈ°λŒ“κ°’μ˜ 가법성에 μ˜ν•΄ 식을 μ „κ°œν•˜λ©΄ μ•„λž˜μ™€ κ°™λ‹€.
β—¦
βˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³EΟ€[Gt+1∣St+1=sβ€²]]\sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma \mathbb{E}_\pi \left[ G_{t+1} \mid S_{t+1} = s' \right] \right]
μœ„ μ‹μ˜ κΈ°λŒ“κ°’ 파트λ₯Ό λ‹€μ‹œ λ§ˆμ§„ν™”λ₯Ό μ‚¬μš©ν•˜μ—¬ ν‘œν˜„ν•˜λ©΄ μ•„λž˜μ™€ κ°™λ‹€.
β—¦
βˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³βˆ‘aβ€²EΟ€[Gt+1∣St+1=sβ€²,At+1=aβ€²]p(At+1=aβ€²βˆ£St+1=sβ€²)]\sum_{s', r} p(s', r | s, a) \left[ r + \gamma \sum_{a'} \mathbb{E}_\pi \left[ G_{t+1} \mid S_{t+1} = s', A_{t+1} = a' \right] p(A_{t+1} = a' \mid S_{t+1} = s') \right]
Policy와 Action Value Function의 μ •μ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 식을 λ‹€μ‹œ ν‘œν˜„ν•˜λ©΄ μ•„λž˜μ™€ κ°™λ‹€.
β—¦
βˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³βˆ‘aβ€²Ο€(aβ€²βˆ£sβ€²)qΟ€(sβ€²,aβ€²)]\sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma \sum_{a'} \pi(a' \mid s') q_\pi(s', a') \right]
그리고 이것은 κΈ°λŒ“κ°’μ˜ μ •μ˜μ— μ˜ν•΄ κ°„λž΅ν•˜κ²Œ ν‘œν˜„ν•  수 μžˆλ‹€.
β—¦
EΟ€[Rt+1+Ξ³qΟ€(St+1,At+1)∣St=s,At=a]\mathbb{E}_\pi \left[ R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a \right]

Bellman Optimality Equation

Optimal Value Function/Optimal Policy

μ •μ˜

β€’
vβˆ—(s)=max⁑π vΟ€(s)v_*(s) = \displaystyle\max_\pi\ v_\pi (s)
λͺ¨λ“  Policy에 λŒ€ν•˜μ—¬ State Value Function의 값을 μ΅œλŒ€λ‘œ ν•˜λŠ” Policyλ₯Ό μ μš©ν•˜μ˜€μ„ λ•Œμ˜ State Value Function
β€’
qβˆ—(s)=max⁑π qΟ€(s)q_*(s) = \displaystyle\max_\pi\ q_\pi (s)
λͺ¨λ“  Policy에 λŒ€ν•˜μ—¬ Action Value Function의 값을 μ΅œλŒ€λ‘œ ν•˜λŠ” Policyλ₯Ό μ μš©ν•˜μ˜€μ„ λ•Œμ˜ Action Value Function
β€’
Ο€(a∣s)={1ifΒ a=arg⁑max⁑aqβˆ—(s,a)0otherwise\pi_(a|s) = \begin{cases} 1 & \text{if } a = \arg\max\limits_a q_*(s, a) \\ 0 & \text{otherwise} \end{cases}
β†’ qβˆ—(s,a)q_βˆ—(s,a)λ₯Ό μ΅œλŒ€λ‘œ ν•˜λŠ” action을 μ„ νƒν•˜λŠ” 것은 각 state에 λŒ€ν•΄ action value function이 μ΅œλŒ€κ°€ λ˜λŠ” action을 λ‹¨μˆœνžˆ μ„ νƒν•œλ‹€λ©΄ 그것은 Optimal Policyκ°€ 아닐 수 μžˆλ‹€. μ™œλƒν•˜λ©΄ optimal action value function은 각 state에 λŒ€ν•΄μ„œ κ°€λŠ₯ν•œ action의 수만큼 μ‘΄μž¬ν•˜λ©°, μ΄λŠ” ν•΄λ‹Ή stateμ—μ„œ κ·Έ action을 μ„ νƒν•˜κ³  이후에 optimal policyλ₯Ό λ”°λ₯Ό λ•Œμ˜ ν•΄λ‹Ή state-action pair의 valueλ₯Ό λ‚˜νƒ€λ‚΄λŠ” 값이기 λ•Œλ¬Έμ΄λ‹€. λ”°λΌμ„œ Optimal action value function을 μ΅œλŒ€λ‘œ ν•˜λŠ” action을 μ„ νƒν•˜λ„λ‘ ν•΄μ•Όν•œλ‹€.

Optimal Value Function의 λ³€ν˜•

β€’
Optimal State Value Function의 μ •μ˜λ‘œ ν‘œν˜„
β—¦
vβˆ—(s)=max⁑aEΟ€[Gt∣St=s,At=a]v_*(s) = \max_a \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right]
β€’
Return/Expected Reward/Transition Probabilityλ₯Ό ν™œμš©ν•œ λ³€ν˜•
β—¦
max⁑aEΟ€βˆ—[Rt+1+Ξ³Gt+1∣St=s,At=a]\max_a \mathbb{E}_{\pi*} \left[ R_{t+1} + \gamma G_{t+1} \mid S_t = s, A_t = a \right]
μ—¬κΈ°μ„œ Optimal Policyλ₯Ό μ μš©ν•˜λ©΄ μ΄λŠ” Optimal Policyλ₯Ό μ μš©ν–ˆμ„ λ•Œ μ•žμœΌλ‘œ 받을 Return의 κΈ°λŒ“κ°’μ΄ λ˜λ―€λ‘œ Optimal State Value Function으둜 λ³€ν™˜ κ°€λŠ₯ν•˜λ‹€.
β—¦
max⁑aE[Rt+1+Ξ³vβˆ—(St+1)∣St=s,At=a]\max_a \mathbb{E} \left[ R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a \right]
μœ„ μˆ˜μ‹μ„ κΈ°λŒ“κ°’μ˜ μ •μ˜μ— μ˜ν•΄ ν’€λ©΄ μ•„λž˜μ™€ κ°™λ‹€.
β—¦
max⁑aβˆ‘sβ€²,rp(sβ€²,r∣s,a)[r+Ξ³vβˆ—(sβ€²)]\max_a \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma v_*(s') \right]
β€’
Expected reward for state-action pair
β—¦
Rsa=E[Rt+1∣St=s,At=a]=βˆ‘rrβˆ‘sβ€²p(sβ€²,r∣s,a)R^a_s = E[R_{t+1} | S_t = s, A_t= a] = \sum_r r \sum_{s'} p(s',r |s,a)
β€’
Transition Probability
β—¦
Pss’a=βˆ‘rp(sβ€²,r∣s,a)P^a_{ss’} = \sum_{r} p(s',r |s,a)
λ₯Ό μ‚¬μš©ν•˜μ—¬ 식을 μ •λ¦¬ν•˜λ©΄ μ•„λž˜μ™€ κ°™λ‹€.
β—¦
max⁑a[Rsa+Ξ³βˆ‘sβ€²pssβ€²aVβˆ—(sβ€²)]\max_a [R_s^a + \gamma \sum_{s'} p_{s s'}^a V_*(s')]
β€’
Optimal Action Value Function의 μ •μ˜λ‘œ ν‘œν˜„
β—¦
qβˆ—(s,a)=E[Rt+1+Ξ³max⁑aβ€²qβˆ—(St+1,aβ€²)∣St=s,At=a]q_*(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') \mid S_t = s, A_t = a \right]
β€’
κΈ°λŒ“κ°’μ˜ μ •μ˜λ₯Ό ν™œμš©ν•΄ λ³€ν˜•
β—¦
βˆ‘sβ€²,rP(sβ€²,r∣s,a)[r+Ξ³max⁑aβ€²qβˆ—(sβ€²,aβ€²)]\sum_{s', r} P(s', r \mid s, a) \left[ r + \gamma \max_{a'} q_*(s', a') \right]
β€’
Expected Reward/Transition Probabilityλ₯Ό ν™œμš©ν•œ λ³€ν˜•
β—¦
Rsa+Ξ³βˆ‘sβ€²Pssβ€²amax⁑aβ€²qβˆ—(sβ€²,aβ€²)=Rsa+Ξ³βˆ‘sβ€²P(sβ€²βˆ£s,a)vβˆ—(sβ€²)R_s^a + \gamma \sum_{s'} P_{s s'}^a \max_{a'} q_*(s', a') = R_s^a + \gamma \sum_{s'} P(s' \mid s, a) v_*(s')