1. Introduction
โข
DQN: Approximates the value function using a neural network.
โข
REINFORCE: Approximates the policy using a neural network.
โข
Actor-Critic: Utilizes both networks โ one for the policy (actor) and one for the value function (critic).
2. Actor-Critic
โข
REINFORCE
โฆ
The policy can only be updated after a full episode has finished
โฆ
Since the gradient is proportional to the return, it exhibits high variance.
โข
Actor-Critic
โฆ
By using an estimator instead of , the update can be performed without waiting for the episode to finish
โช
That helps alleviate the high variance problem.
โฆ
Therefore, both the critic network parameterized by and the actor network parameterized by are updated.
โข
Gradient(AC)
โข
Critic
โฆ
Evaluate the value of the action selected by the actor.
โฆ
Update in the direction that improves the accuracy of the value estimation
โช
thus aiming to minimize the MSE between the target and the estimated value.
โข
Actor
โฆ
Select an action.
โฆ
Reflect the criticโs evaluation during the update.
โฆ
The actorโs objective function is defined with respect to the return, and it is maximized during the update
โข
Pseudo Code
1.
Select an action in a given state based on the current parameters.
2.
Apply the selected action and observe the reward and next state.
3.
Using the reward and action obtained in step 2, compute the TD-error and update the critic network through gradient descent.
4.
In this step, MSE of the TD-error is used as the objective function for the critic update.
5.
Reflect the result of the TD-error and update the actor network using gradient ascent.
โข
in this pseudocode, it proceeds in a 1-step manner where the update is approximated by the sample mean based on the data obtained from each step.
3. A3C
โข
A3C is an actor-critic algorithm that utilizes multiple networks.
โข
It consists of a global network and multiple worker agents.
โฆ
Each worker agent learns independently from environment and asynchronously updates the global network.
โฆ
Since each agent uses a different policy, it naturally promotes exploration.
โข
This independence allows each agent to generate diverse experiences at different time steps
โฆ
Thereby reducing temporal correlation.
โข
In addition, when constructing the advantage function in the critic network, A3C uses the n-step return instead of the Q-function
โฆ
It increases practicality by allowing the critic to operate with only a single parameterized network.
โข
Asynchronous
โฆ
Initialize each worker agentโs parameters by copying them from the global network.
โฆ
steps while computing gradients.
โฆ
Asynchronously update the global network parameters using the accumulated gradients.
โฆ
Copy the updated global network parameters for further use.
โฆ
As a result, the parameters are reset to include updates made by other workers.
โข
Advantage
โฆ
Various forms of the policy gradient consist of two components: one that contains information about the direction in which the parameters should be updated, and another that determines how much the parameters should move in that direction.
โฆ
In this term, the advantage function can be used in place of the term that indicates the magnitude of update.
โฆ
Using the advantage function reduces the variance compared to using the return directly.
โฆ
However, since it requires two parameterized functions ( and ), the n-step return is used instead of the Q-function.
โฆ
As the value of increases, the variance of the advantage estimate increases, whereas a smaller results in lower variance.
โฆ
n-step Return
โข
Pseudo Code
1.
Initialize the workerโs network with the parameters of the global network.
2.
steps to obtain a trajectory (worker).
3.
Traverse the time steps in reverse order to compute the n-step return.
4.
Accumulate the gradients for both the actor and critic using the computed n-step returns.
5.
Apply the accumulated gradients to update the global network asynchronously, regardless of whether other agents have finished.


