Search
🦾

Dueling DQN

생성일
2024/07/22 03:00
μ΅œμ’… μˆ˜μ •μΌ
2025/11/11 06:46
νƒœκ·Έ
κ°•ν™”ν•™μŠ΅
μž‘μ„±μž

1. Introduction

β€’
Compute the advantage function and the state-value function separately, then combine them to obtain the action-value function.
β€’
The CNN encoder is shared, while only the parameters of the fully connected layers differ for each function.
β€’
Useful for evaluating the state value even when the action is not significant in the environment.

2. Value function

β€’
When actions have little influence on the environment in a given state, learning V(s)V(s) becomes more important for accurately evaluating the state value.
β€’
When actions have little influence on the environment in a given state, learning V(s)V(s) becomes more important for accurately evaluating the state value.
β€’
First case (the car is at a distant position)
β—¦
The importance of action selection decreases.
β—¦
Only when computing the state value, the car in front is considered.
β€’
Second case (the car is at a close position)
β—¦
The choice of action becomes more important.
β—¦
Each nearby car is also taken into account when computing the advantage.

3. Identifiability issue

β€’
The relationship among the Action value function, Advantage function, State value function
QΟ€(s,a)=VΟ€(s)+AΟ€(s,a){\small \quad Q^{\pi}(s, a) = V^{\pi}(s) + A^{\pi}(s, a)}
EaβˆΌΟ€(s)[QΟ€(s,a)]=VΟ€(s)β‡’EaβˆΌΟ€(s)[AΟ€(s,a)]=0{\small \quad \mathbb{E}_{a \sim \pi(s)} [ Q^{\pi}(s, a) ] = V^{\pi}(s) \quad \Rightarrow \quad\mathbb{E}_{a \sim \pi(s)} [ A^{\pi}(s, a) ] = 0}
aβˆ—=arg⁑max⁑aQ(s,aβ€²){\small \quad a^* = \arg\max_a Q(s, a')}
∴Qβˆ—(s,aβˆ—)=Vβˆ—(s)β‡’Aβˆ—(s,aβˆ—)=0{\small \therefore \quad Q^*(s, a^*) = V^*(s) \quad \Rightarrow \quad A^*(s, a^*) = 0}
β€’
In Dueling DQN separate networks are used to compute the state-value function and the advantage function.
β—¦
The decomposition of a single Q-value is not uniquely defined.
β–ͺ
(V(s)+const)+(A(s,a)βˆ’const)(V(s)+const) + (A(s,a) -const)
β–ͺ
V(s;ΞΈ,Ξ²)V(s ; \theta, \beta), A(s,a;ΞΈ,Ξ±)A(s,a ; \theta, \alpha) can’t be regarded as a good estimator.
β€’
Approach 1) Add an advantage function term for the optimal action.
β—¦
For the optimal action, the following property holds.
Qβˆ—(s,aβˆ—)=Vβˆ—(s)β‡’Aβˆ—(s,aβˆ—)=0{\small \quad Q^*(s, a^*) = V^*(s) \quad \Rightarrow \quad A^*(s, a^*) = 0}
β—¦
Q(s,a∣θ,Ξ±,Ξ²)=V(s∣θ,Ξ²)+[A(s,a∣θ,Ξ±)βˆ’max⁑aβ€²A(s,aβ€²βˆ£ΞΈ,Ξ±)]{\small Q(s, a \mid \theta, \alpha, \beta) = V(s \mid \theta, \beta)+ \left[ A(s, a \mid \theta, \alpha) - \max_{a'} A(s, a' \mid \theta, \alpha) \right]} β‡’aβˆ—=arg⁑max⁑aβ€²Q(s,aβ€²βˆ£ΞΈ,Ξ±,Ξ²)thenmax⁑aβ€²A(s,aβ€²βˆ£ΞΈ,Ξ±)=A(s,aβˆ—βˆ£ΞΈ,Ξ±){\small \Rightarrow a^* = \arg\max_{a'} Q(s, a' \mid \theta, \alpha, \beta) \quad \text{then} \\ \quad \max_{a'} A(s, a' \mid \theta, \alpha) = A(s, a^* \mid \theta, \alpha)}
β—¦
Since the max term has high variability, it reduces stability.
β€’
Approach2) Subtract the mean
β—¦
Q(s,a∣θ,Ξ±,Ξ²)=V(s∣θ,Ξ²)+[A(s,a∣θ,Ξ±)βˆ’1∣Aβˆ£βˆ‘aβ€²A(s,aβ€²βˆ£ΞΈ,Ξ±)]{\small Q(s, a \mid \theta, \alpha, \beta)= V(s \mid \theta, \beta)+\left[ A(s, a \mid \theta, \alpha)-\frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a' \mid \theta, \alpha) \right]}
β—¦
For the optimal action, an error equal to the difference between the max and the mean occurs.
β–ͺ
The objective is to find the action that maximizes QQ
β–ͺ
This is equivalent to maximizing the term composed of the advantage function.
β–ͺ
Whether the max or the mean is subtracted does not affect policy determination.
β–ͺ
This is because the state-value function does not vary with the action.