Search
🦾

Double DQN

생성일
2024/07/22 02:59
μ΅œμ’… μˆ˜μ •μΌ
2025/11/11 05:42
νƒœκ·Έ
κ°•ν™”ν•™μŠ΅
μž‘μ„±μž

1. Double DQN

β€’
Target in Q-learning
β—¦
r+Ξ³max⁑aQ(sβ€²,a)r + \gamma \displaystyle \max_{a} Q(s',a)
β–ͺ
Since Q(sβ€²,a)Q(s',a) is estimated value, a high value does not necessarily indicate the best action β€” it may simply be high by chance.
β€’
Addressing Q-value overestimation bias
β—¦
In the target calculation, the action selection and the Q-value evaluation for the selected action are calculated using separate networks.
β€’
Loss function
L(ΞΈ)=[rt+1+Ξ³Q^(st+1,arg max⁑aQ(st+1,a;ΞΈ);ΞΈ^)βˆ’Q(st,at;ΞΈ)]2{\small L(\theta) = [r_{t+1}+\gamma \hat Q(s_{t+1}, \displaystyle\argmax_a Q(s_{t+1}, a ; \theta) ; \hat \theta) - Q(s_t,a_t ;\theta)]^2}
β—¦
Action selection
β–ͺ
A network parameterized by ΞΈ\theta
β—¦
Calculate Q-value
β–ͺ
A network parameterized by ΞΈ^\hat \theta
β—¦
Since it is unlikely that the same action will simultaneously have the highest Q-value in both networks, the overestimation problem is alleviated.

2. Overestimation

β€’
Jensen’s inequality
E[max⁑aQ(sβ€²,a)]β‰₯max⁑aE[Q(sβ€²,a)]\displaystyle E[\max_{a}Q(s',a)] \geq \max_{a}E[Q(s',a)]
β€’
E[Q(sβ€²,a)]E[Q(s',a)] approximates the true Q-value over an infinite number of samples
β€’
According to Jensen’s inequality, applying the max operator to Q-values that have not been fully updated can lead to overestimation.

3. Prioritized Replay

β€’
Online RL
β—¦
Causes temporal correlation problems between consecutive transitions.
β—¦
Even if a rare experience has high value, it may be discarded.
β—¦
DQN addresses this issue by using a replay buffer.
β€’
Replay Buffer
β—¦
All samples, whether important or not, have an equal probability of being selected.
β—¦
Therefore, it is necessary to assign higher sampling probabilities to more important samples by applying weights.
β€’
Importance of Samples
β—¦
The importance of each sample is evaluated based on the magnitude of its TD error.

3. Prioritizing with TD error

β€’
Model-based
β—¦
Value iteration
β–ͺ
Prioritize updates for states with large value changes.
β–ͺ
Important updates are immediately reflected in the value estimation of other states.
β–ͺ
Particularly effective in asynchronous methods.
β€’
Model-free
β—¦
Transitions corresponding to failures occur more frequently than those corresponding to successes.
β—¦
When a particular attempt leads to a successful outcome, the value difference becomes significantly large.
β€’
Calculate weight based on TD-error
P(i)=piΞ±βˆ‘kpkΞ±(probabilizationΒ process)P(i) = \frac{p_i^{\alpha}}{\sum_k p_k^{\alpha}} \quad \text{(probabilization process)} pi=∣ri+1+Ξ³max⁑aQ^(si+1,a;ΞΈ^)βˆ’Q(si,ai;ΞΈ)∣+Ο΅p_i = \left| r_{i+1} + \gamma \max_a \hat{Q}(s_{i+1}, a; \hat{\theta}) - Q(s_{i}, a_{i}; \theta) \right| + \epsilon
β€’
Problems
β—¦
Updating all transitions in the replay buffer is inefficient, so priority updates only for the transitions sampled in a minibatch.
β–ͺ
Among the initially sampled transitions, those with large TD-errors are likely to be selected repeatedly, while others may be ignored.
β–ͺ
This can lead to overfitting due to a reduction in sample diversity.
β—¦
Since the priorities keep changing, the sampling distribution of transitions also changes over time, introducing bias.
β€’
Addressing Sample Diversity Issues
β—¦
Use stochastic sampling prioritization
β—¦
Prioritization probability
β–ͺ
P(i)=piΞ±βˆ‘kpkΞ±P(i) = \frac {p^\alpha_i}{\sum_k p^{\alpha} _{k}}
β–ͺ
Ξ±\alpha β†’ 1 the probability of being selected increases based on the TD-error
β–ͺ
When Ξ±\alpha=0, prioritization is completely ignored
β—¦
Ξ±\alpha : A hyperparameter that controls how much prioritization is applied
β€’
Addressing Sampling Distribution Issues
β—¦
Use importance sampling weights
β—¦
Importancee sampling weights
β–ͺ
wi=(1NΒ 1P(i))Ξ²w_i = (\frac{1}{N} \ \frac{1}{P(i)})^\beta
β–ͺ
Ξ²=1\beta=1 : full correction
β–ͺ
For stability weights are normalized by multiplying 1max⁑kwk\frac{1}{\displaystyle \max_{k} w_k}
β—¦
This reduces the influence of samples that are frequently selected during updates.

4. Double DQN with prioritized replay pseudo code