Search
🦾

Reinforcement Learning

생성일
2024/07/16 06:17
μ΅œμ’… μˆ˜μ •μΌ
2025/11/11 07:15
νƒœκ·Έ
κ°•ν™”ν•™μŠ΅
μž‘μ„±μž

1. DP vs RL

β€’
Dynamic Programming
β—¦
Model-based
β—¦
A method designed to efficiently perform complex computations when updating values by considering all possible next states.
β€’
Reinforcement Learning (RL)
β—¦
Model-free
β—¦
Updates values approximately based on sampled experiences.
β—¦
Finds an approximate solution to the Bellman optimality equation using the updated values.
β€’
Tabular Updation Method
β—¦
Commonly used in both Dynamic Programming (DP) and Reinforcement Learning (RL).
β—¦
Aims to find the optimal policy by creating a table to store value function estimates and updating them iteratively.
β—¦
In DP: all values in the table are updated at each iteration.
β–ͺ
Typically uses the state-value function for computational efficiency.
β–ͺ
Value function is updated greedily
β—¦
In RL: only the values corresponding to sampled transitions are updated at each iteration.
β–ͺ
Transition probabilities are unknown.
β–ͺ
The policy is updated using the action-value function.
β–ͺ
Since updates are sample-based, the method is not affected by the dimensionality of the state space.
β–ͺ
Thus, RL can avoid the curse of dimensionality inherent in DP.

2. GPI

β€’
Generalized Policy Iteration (GPI) performs Policy Evaluation (PE) and Policy Improvement (PI) based on sampled experiences.
β€’
The value function is approximated using samples, and the policy is updated greedily based on this approximation.
β€’
In the policy evaluation phase, value is updated only from sampled data, so full iteration until convergence is not possible.
β—¦
Transition probabilities and rewards cannot be computed exactly.
β€’
If the policy becomes stabilized, the values change very little in subsequent iterations.
β—¦
At this point, the Q-function satisfies the Bellman optimality equation, indicating an optimal policy.