Search
๐Ÿฆพ

Actor-Critic

์ƒ์„ฑ์ผ
2024/08/05 02:59
์ตœ์ข… ์ˆ˜์ •์ผ
2025/11/10 09:34
ํƒœ๊ทธ
๊ฐ•ํ™”ํ•™์Šต
์ž‘์„ฑ์ž

1. Introduction

โ€ข
DQN: Approximates the value function using a neural network.
โ€ข
REINFORCE: Approximates the policy using a neural network.
โ€ข
Actor-Critic: Utilizes both networks โ€” one for the policy (actor) and one for the value function (critic).

2. Actor-Critic

โ€ข
REINFORCE
โ—ฆ
The policy can only be updated after a full episode has finished
โ—ฆ
Since the gradient is proportional to the return, it exhibits high variance.
โ€ข
Actor-Critic
โ—ฆ
By using an estimator instead of GtG_t, the update can be performed without waiting for the episode to finish
โ–ช
That helps alleviate the high variance problem.
โ—ฆ
Therefore, both the critic network parameterized by ฯ•\phi and the actor network parameterized by ฮธ\theta are updated.
โ€ข
Gradient(AC)
โˆ‡ฮธJ(ฮธ)โ‰ˆโˆ‘t=0Tโˆ’1(Estโˆผpฮธ(st),atโˆผฯ€ฮธ(atโˆฃst)[ฮณtโˆ‡ฮธlogโกฯ€ฮธ(atโˆฃst)ย (r(st,at)+ฮณVฯ€ฮธ(st+1)โˆ’Vฯ€ฮธ(st))]){\small \nabla_\theta J(\theta)\approx \sum_{t=0}^{T-1} \left( \mathbb{E}_{s_t \sim p_\theta(s_t), a_t \sim \pi_\theta(a_t | s_t)} \left[ \gamma^t \nabla_\theta \log \pi_\theta (a_t | s_t)\ (r(s_t,a_t) + \gamma V_{\pi_\theta}(s_{t+1})-V_{\pi_\theta}(s_t)) \right] \right)}
โ€ข
Critic
โ—ฆ
Evaluate the value of the action selected by the actor.
โ—ฆ
Update in the direction that improves the accuracy of the value estimation
โ–ช
thus aiming to minimize the MSE between the target and the estimated value.
MC-target:ฯ•โ†ฯ•+ฮฒ(Gtโˆ’Vฯ•(st))โˆ‡ฯ•Vฯ•(st)\text{MC-target:} \quad \phi \leftarrow \phi + \beta \left( G_t - V_{\phi}(s_t) \right) \nabla_{\phi} V_{\phi}(s_t) TD-target:ฯ•โ†ฯ•+ฮฒ(rt+1+ฮณVฯ•(st+1)โˆ’Vฯ•(st))โˆ‡ฯ•Vฯ•(st)\text{TD-target:} \quad \phi \leftarrow \phi + \beta \left( r_{t+1} + \gamma V_{\phi}(s_{t+1}) - V_{\phi}(s_t) \right) \nabla_{\phi} V_{\phi}(s_t)
โ€ข
Actor
โ—ฆ
Select an action.
โ—ฆ
Reflect the criticโ€™s evaluation during the update.
โ—ฆ
The actorโ€™s objective function is defined with respect to the return, and it is maximized during the update
MC-PG:ฮธโ†ฮธ+ฮฑ(Gtโˆ’Vฯ•(st))โˆ‡ฮธlogโกฯ€ฮธ(atโˆฃst)\text{MC-PG:} \quad \theta \leftarrow \theta + \alpha \left( G_t - V_{\phi}(s_t) \right) \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) TD-PG:ฮธโ†ฮธ+ฮฑ(rt+1+ฮณVฯ•(st+1)โˆ’Vฯ•(st))โˆ‡ฮธlogโกฯ€ฮธ(atโˆฃst)\text{TD-PG:} \quad \theta \leftarrow \theta + \alpha \left( r_{t+1} + \gamma V_{\phi}(s_{t+1})-V_{\phi}(s_t) \right) \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t)
โ€ข
Pseudo Code
1.
Select an action in a given state based on the current parameters.
2.
Apply the selected action and observe the reward and next state.
3.
Using the reward and action obtained in step 2, compute the TD-error and update the critic network through gradient descent.
4.
In this step, MSE of the TD-error is used as the objective function for the critic update.
5.
Reflect the result of the TD-error and update the actor network using gradient ascent.
โˆ‡ฮธJ(ฮธ)โ‰ˆโˆ‘t=0Tโˆ’1(Estโˆผpฮธ(st),atโˆผฯ€ฮธ(atโˆฃst)[โˆ‡ฮธlogโกฯ€ฮธ(atโˆฃst)ย (r(st,at)+ฮณVฯ€ฮธ(st+1)โˆ’Vฯ€ฮธ(st))]){\small \nabla_\theta J(\theta)\approx \sum_{t=0}^{T-1} \left( \mathbb{E}_{s_t \sim p_\theta(s_t), a_t \sim \pi_\theta(a_t | s_t)} \left[ \nabla_\theta \log \pi_\theta (a_t | s_t)\ (r(s_t,a_t) + \gamma V_{\pi_\theta}(s_{t+1})-V_{\pi_\theta}(s_t)) \right] \right)}
โ€ข
in this pseudocode, it proceeds in a 1-step manner where the update is approximated by the sample mean based on the data obtained from each step.

3. A3C

โ€ข
A3C is an actor-critic algorithm that utilizes multiple networks.
โ€ข
It consists of a global network and multiple worker agents.
โ—ฆ
Each worker agent learns independently from environment and asynchronously updates the global network.
โ—ฆ
Since each agent uses a different policy, it naturally promotes exploration.
โ€ข
This independence allows each agent to generate diverse experiences at different time steps
โ—ฆ
Thereby reducing temporal correlation.
โ€ข
In addition, when constructing the advantage function in the critic network, A3C uses the n-step return instead of the Q-function
โ—ฆ
It increases practicality by allowing the critic to operate with only a single parameterized network.
โ€ข
Asynchronous
โ—ฆ
Initialize each worker agentโ€™s parameters by copying them from the global network.
โ—ฆ
tmaxmaxtmax_{max} steps while computing gradients.
โ—ฆ
Asynchronously update the global network parameters using the accumulated gradients.
โ—ฆ
Copy the updated global network parameters for further use.
โ—ฆ
As a result, the parameters are reset to include updates made by other workers.
โ€ข
Advantage
โ—ฆ
Various forms of the policy gradient consist of two components: one that contains information about the direction in which the parameters should be updated, and another that determines how much the parameters should move in that direction.
โ—ฆ
In this term, the advantage function can be used in place of the term that indicates the magnitude of update.
โ—ฆ
Using the advantage function reduces the variance compared to using the return directly.
โ—ฆ
However, since it requires two parameterized functions (QQ and VV), the n-step return is used instead of the Q-function.
โ—ฆ
As the value of nn increases, the variance of the advantage estimate increases, whereas a smaller nn results in lower variance.
โ—ฆ
n-step Return
Gt(n)=Rt+1+Rt+2+...+ฮณnV(sn+t)G_t ^{(n)} = R_{t+1}+R_{t+2}+ ... + \gamma^n V(s_{n+t})
โ€ข
Pseudo Code
1.
Initialize the workerโ€™s network with the parameters of the global network.
2.
tmaxt_{\text{max}} steps to obtain a trajectory (worker).
3.
Traverse the time steps in reverse order to compute the n-step return.
4.
Accumulate the gradients for both the actor and critic using the computed n-step returns.
5.
Apply the accumulated gradients to update the global network asynchronously, regardless of whether other agents have finished.