Search
🦾

DQN

생성일
2024/07/22 02:58
μ΅œμ’… μˆ˜μ •μΌ
2025/11/11 02:18
νƒœκ·Έ
κ°•ν™”ν•™μŠ΅
μž‘μ„±μž

1. Introduction

β€’
Introduced a CNN architecture to handle high-dimensional image inputs
β—¦
Enables processing of high-dimensional and continuous state spaces
β€’
Represents Q-values through a Q-network
β—¦
Since Q-values must be output for all actions, the action space must be discrete, finite, and of much lower dimensionality compared to the state space

2. DQN Architecture

β€’
Input
β—¦
To satisfy the Markov property, four consecutive frames are stacked as the input.
β—¦
A single frame does not contain sufficient information to predict the future in terms of velocity and direction.
β—¦
Since the Q-value is computed for a given state input, there are few constraints on the dimensionality of the state space.
β€’
Convolution Layer
β—¦
Applies convolution operations to the input images to extract feature vectors and reduce dimensionality.
β—¦
The last feature maps are flattened so that they can be processed by fully connected layers.
β—¦
Because spatial information is crucial in games, max poolingβ€”which tends to blur positional detailsβ€”is not applied.
β€’
Output
β—¦
Produces Q-values for each possible action.
β—¦
The network parameters are updated to ensure that the predicted Q-values approximate the optimal Q-values.

3. Naive DQN

β€’
Q-learning
Q(St,At)←Q(St,At)+Ξ±[Rt+1+Ξ³max⁑aQ(St+1,a)βˆ’Q(St,At)]{\small Q(S_t, A_t) ← Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \displaystyle\max_a Q(S_{t+1} ,a ) - Q(S_t, A_t)]}
β—¦
Target policy: A greedy policy w.r.t QQ
β—¦
Behavior policy: An Ο΅βˆ’greedy\epsilon - greedy w.r.t QQ
β—¦
The learning process updates the network in a way that minimizes the difference between the Q-value of the current state and its target value.
β€’
Naive DQN
L(ΞΈ)=[rt+1+Ξ³max⁑aQ(st+1,a;ΞΈ)βˆ’Q(st,at;ΞΈ)]2L(\theta) = [r_{t+1} + \gamma \displaystyle \max_a Q(s_{t+1}, a ; \theta) - Q(s_t, a_t ; \theta)]^2
β—¦
Define the loss function as the Mean Squared Error
β—¦
The network is updated by minimizing it
β€’
Problems
β—¦
Temporal Correlations
β–ͺ
In online RL, parameter updates are executed based on transitions collected over time.
β–ͺ
Strong correlations exist between consecutive transitions, which can lead the network to overfit to specific situations.
β–ͺ
As a result, the model may fail to sufficiently learn from rare but important experiences.
β—¦
Non-stationary target
β–ͺ
Since the same Q-network is used both to compute the target and to update the Q-values, the target function changes frequently, making convergence difficult.

4. Replay buffer

β€’
Addressing the Temporal Correlation problem
β€’
Store transitions (s,a,r,sβ€²){(s, a, r, s')} in replay buffer and randomly sample minibatches from it for training
β€’
This reduces temporal correlations between transitions, as each batch can be composed of experiences from different experience and time steps.
β€’
Improve training efficiency
β€’
Past experiences can be reused

5. Target Network

β€’
Initialize the target Q-network parameters ΞΈ^\hat{\theta} to be identical to the behavior Q-network parameters ΞΈ\theta at the start of training.
β€’
During training, ΞΈ^\hat{\theta} fixed while continuously updating ΞΈ\theta.
β€’
After a certain number of steps, update ΞΈ^\hat{\theta} by copying ΞΈ\theta.
β€’
This approach stabilizes training by addressing the problem of a continuously changing target function.

6. DQN

1.
From the current state StS_t, select an action ata_t using the Ο΅βˆ’greedy\epsilon-greedy policy based on the behavior network parameters.
2.
Store the transition (st,at,rt+1,st+1)(s_t, a_t, r_{t+1}, s_{t+1}) in the replay buffer.
3.
The replay buffer holds the most recent NN transitions; when a new transition is added, the oldest one is removed.
4.
Randomly sample a minibatch of transitions from the replay buffer.
5.
For the sampled data, compute the target values using the target network parameters.
6.
Compute the loss function L(ΞΈ)L(\theta).
β€’
L(ΞΈ)=1Bβˆ‘βˆ£{i=1}∣B[ri+1+Ξ³max⁑aQ^(si+1,a;ΞΈ^)βˆ’Q(si,ai;ΞΈ)]2{\small L(\theta) = \frac{1}{B} \sum_{|\{i=1\}|}^{B} \left[ r_{i+1} + \gamma \max_{a} \hat{Q}(s_{i+1}, a; \hat{\theta}) - Q(s_{i}, a_{i}; \theta) \right]^2}
β€’
βˆ‡ΞΈL(ΞΈ)=βˆ’1Bβˆ‘βˆ£{i=1}∣[ri+1+Ξ³max⁑aQ^(si+1,a;ΞΈ^)βˆ’Q(si,ai;ΞΈ)]βˆ‡ΞΈQ(si,ai;ΞΈ){\small \nabla_{\theta} L(\theta) = -\frac{1}{B} \sum_{|\{i=1\}|} \left[ r_{i+1} + \gamma \max_{a} \hat{Q}(s_{i+1}, a; \hat{\theta}) - Q(s_{i}, a_{i}; \theta) \right] \nabla_{\theta} Q(s_{i}, a_{i}; \theta)}
7.
Update the behavior network parameters.
8.
After a fixed number of steps, update the target network parameters.

7. Multi-step Learning

Multi-step learning refers to training that uses the n-step return when computing the target values. By choosing an appropriate number of steps, the learning speed can be improved. The loss function used in multi-step learning is defined as follows.
β€’
β€…β€Šrt+1(n)=βˆ‘k=0nβˆ’1Ξ³krt+k+1 \; r^{(n)}_{t+1} = \sum_{k=0}^{n-1} \gamma^k r_{t+k+1}
β€’
L(ΞΈ)=[rt+1(n)+Ξ³nmax⁑aQ^(st+n,a;ΞΈ^)βˆ’Q(st,at;ΞΈ)]2{\small L(\theta) = \Big[ r^{(n)}_{t+1} + \gamma^n \max_a \hat{Q}(s_{t+n}, a; \hat{\theta})-Q(s_t, a_t; \theta) \Big]^2}