1. DP vs RL
β’
Dynamic Programming
β¦
Model-based
β¦
A method designed to efficiently perform complex computations when updating values by considering all possible next states.
β’
Reinforcement Learning (RL)
β¦
Model-free
β¦
Updates values approximately based on sampled experiences.
β¦
Finds an approximate solution to the Bellman optimality equation using the updated values.
β’
Tabular Updation Method
β¦
Commonly used in both Dynamic Programming (DP) and Reinforcement Learning (RL).
β¦
Aims to find the optimal policy by creating a table to store value function estimates and updating them iteratively.
β¦
In DP: all values in the table are updated at each iteration.
βͺ
Typically uses the state-value function for computational efficiency.
βͺ
Value function is updated greedily
β¦
In RL: only the values corresponding to sampled transitions are updated at each iteration.
βͺ
Transition probabilities are unknown.
βͺ
The policy is updated using the action-value function.
βͺ
Since updates are sample-based, the method is not affected by the dimensionality of the state space.
βͺ
Thus, RL can avoid the curse of dimensionality inherent in DP.
2. GPI
β’
Generalized Policy Iteration (GPI) performs Policy Evaluation (PE) and Policy Improvement (PI) based on sampled experiences.
β’
The value function is approximated using samples, and the policy is updated greedily based on this approximation.
β’
In the policy evaluation phase, value is updated only from sampled data, so full iteration until convergence is not possible.
β¦
Transition probabilities and rewards cannot be computed exactly.
β’
If the policy becomes stabilized, the values change very little in subsequent iterations.
β¦
At this point, the Q-function satisfies the Bellman optimality equation, indicating an optimal policy.