1. State-Value Function
β’
β’
A function representing the value of the current state , defined as the expected return when starting from state and following policy .
2. Transformation using conditional expectation and marginalization
β’
Expectation
β¦
β’
Marginalization
β¦
Marginalization is a method for computing the probability density function of a specific variable when the joint probability density function is known.
β¦
β’
Law of Total Probability
β¦
β¦
β’
Transformation of Expectation
β¦
β’
Extension to Conditional Expectation
β¦
β¦
β’
Transformation of State Value Function
β¦
β’
Replace each term to action Value Function and Policy
β¦
β¦
3. Transformation using Reward, Return, Transition Probability
β’
Expected reward for state-action pair
β¦
β’
Transition Probability
β¦
β’
Return
β¦
β’
Express Action Value Function using Return
β¦
β’
Transformation using the action-value function expressed through marginalization and return
β¦
1.By the additivity of expectation and
2.By the Markov property,
the state-value function can be expressed as follows.
β¦
3. Distribute
β¦
Replace with 1. Expected reward for state-action pair 2. Transition Probability
β¦
β’
Expressed in terms of conditional expectation.
β¦
4. Action Value Function
β’
β’
It is expressed as the expected return when taking action in the current state and following policy
β’
Transformation using marginalization and expectation
β¦
μ¬κΈ°μ Markov Propertyμ μν΄, 쑰건μ μμ μν₯μ μ£Όμ§ μμΌλ―λ‘ μμ κ° κ°λ₯νλ€. λνμ¬ κΈ°λκ°μ κ°λ²μ±μ μν΄ μμ μ κ°νλ©΄ μλμ κ°λ€.
β¦
μ μμ κΈ°λκ° ννΈλ₯Ό λ€μ λ§μ§νλ₯Ό μ¬μ©νμ¬ νννλ©΄ μλμ κ°λ€.
β¦
Policyμ Action Value Functionμ μ μλ₯Ό μ¬μ©νμ¬ μμ λ€μ νννλ©΄ μλμ κ°λ€.
β¦
κ·Έλ¦¬κ³ μ΄κ²μ κΈ°λκ°μ μ μμ μν΄ κ°λ΅νκ² ννν μ μλ€.
β¦
Bellman Optimality Equation
Optimal Value Function/Optimal Policy
μ μ
β’
λͺ¨λ Policyμ λνμ¬ State Value Functionμ κ°μ μ΅λλ‘ νλ Policyλ₯Ό μ μ©νμμ λμ State Value Function
β’
λͺ¨λ Policyμ λνμ¬ Action Value Functionμ κ°μ μ΅λλ‘ νλ Policyλ₯Ό μ μ©νμμ λμ Action Value Function
β’
β λ₯Ό μ΅λλ‘ νλ actionμ μ ννλ κ²μ κ° stateμ λν΄ action value functionμ΄ μ΅λκ° λλ actionμ λ¨μν μ ννλ€λ©΄ κ·Έκ²μ Optimal Policyκ° μλ μ μλ€. μλνλ©΄ optimal action value functionμ κ° stateμ λν΄μ κ°λ₯ν actionμ μλ§νΌ μ‘΄μ¬νλ©°, μ΄λ ν΄λΉ stateμμ κ·Έ actionμ μ ννκ³ μ΄νμ optimal policyλ₯Ό λ°λ₯Ό λμ ν΄λΉ state-action pairμ valueλ₯Ό λνλ΄λ κ°μ΄κΈ° λλ¬Έμ΄λ€. λ°λΌμ Optimal action value functionμ μ΅λλ‘ νλ actionμ μ ννλλ‘ ν΄μΌνλ€.
Optimal Value Functionμ λ³ν
β’
Optimal State Value Functionμ μ μλ‘ νν
β¦
β’
Return/Expected Reward/Transition Probabilityλ₯Ό νμ©ν λ³ν
β¦
μ¬κΈ°μ Optimal Policyλ₯Ό μ μ©νλ©΄ μ΄λ Optimal Policyλ₯Ό μ μ©νμ λ μμΌλ‘ λ°μ Returnμ κΈ°λκ°μ΄ λλ―λ‘ Optimal State Value FunctionμΌλ‘ λ³ν κ°λ₯νλ€.
β¦
μ μμμ κΈ°λκ°μ μ μμ μν΄ νλ©΄ μλμ κ°λ€.
β¦
β’
Expected reward for state-action pair
β¦
β’
Transition Probability
β¦
λ₯Ό μ¬μ©νμ¬ μμ μ 리νλ©΄ μλμ κ°λ€.
β¦
β’
Optimal Action Value Functionμ μ μλ‘ νν
β¦
β’
κΈ°λκ°μ μ μλ₯Ό νμ©ν΄ λ³ν
β¦
β’
Expected Reward/Transition Probabilityλ₯Ό νμ©ν λ³ν
β¦