1. Double DQN
β’
Target in Q-learning
β¦
βͺ
Since is estimated value, a high value does not necessarily indicate the best action β it may simply be high by chance.
β’
Addressing Q-value overestimation bias
β¦
In the target calculation, the action selection and the Q-value evaluation for the selected action are calculated using separate networks.
β’
Loss function
β¦
Action selection
βͺ
A network parameterized by
β¦
Calculate Q-value
βͺ
A network parameterized by
β¦
Since it is unlikely that the same action will simultaneously have the highest Q-value in both networks, the overestimation problem is alleviated.
2. Overestimation
β’
Jensenβs inequality
β’
approximates the true Q-value over an infinite number of samples
β’
According to Jensenβs inequality, applying the max operator to Q-values that have not been fully updated can lead to overestimation.
3. Prioritized Replay
β’
Online RL
β¦
Causes temporal correlation problems between consecutive transitions.
β¦
Even if a rare experience has high value, it may be discarded.
β¦
DQN addresses this issue by using a replay buffer.
β’
Replay Buffer
β¦
All samples, whether important or not, have an equal probability of being selected.
β¦
Therefore, it is necessary to assign higher sampling probabilities to more important samples by applying weights.
β’
Importance of Samples
β¦
The importance of each sample is evaluated based on the magnitude of its TD error.
3. Prioritizing with TD error
β’
Model-based
β¦
Value iteration
βͺ
Prioritize updates for states with large value changes.
βͺ
Important updates are immediately reflected in the value estimation of other states.
βͺ
Particularly effective in asynchronous methods.
β’
Model-free
β¦
Transitions corresponding to failures occur more frequently than those corresponding to successes.
β¦
When a particular attempt leads to a successful outcome, the value difference becomes significantly large.
β’
Calculate weight based on TD-error
β’
Problems
β¦
Updating all transitions in the replay buffer is inefficient, so priority updates only for the transitions sampled in a minibatch.
βͺ
Among the initially sampled transitions, those with large TD-errors are likely to be selected repeatedly, while others may be ignored.
βͺ
This can lead to overfitting due to a reduction in sample diversity.
β¦
Since the priorities keep changing, the sampling distribution of transitions also changes over time, introducing bias.
β’
Addressing Sample Diversity Issues
β¦
Use stochastic sampling prioritization
β¦
Prioritization probability
βͺ
βͺ
β 1 the probability of being selected increases based on the TD-error
βͺ
When =0, prioritization is completely ignored
β¦
: A hyperparameter that controls how much prioritization is applied
β’
Addressing Sampling Distribution Issues
β¦
Use importance sampling weights
β¦
Importancee sampling weights
βͺ
βͺ
: full correction
βͺ
For stability weights are normalized by multiplying
β¦
This reduces the influence of samples that are frequently selected during updates.
