🦾

REINFORCE

생성일

2024/08/05 02:59

최종 수정일

2025/11/10 08:43

태그

강화학습

작성자

1. REINFORCE

•

Among the various approximate forms of the policy gradient discussed earlier, this corresponds to the REINFORCE algorithm, which uses the return(Gt)(G_t)(Gt​).

Generate MMM trajectories based on current parameter θ\thetaθ

Compute Return for each episode

Approximate ∇θJ(θ)=Eπθ[∑t=0T−1Gt∇θlogπθ(at∣st)]\nabla_\theta J(\theta) = E_{\pi_\theta}[\sum^{T-1}_{t=0} G_t \nabla_\theta log \pi_\theta (a_t|s_t)]∇θ​J(θ)=Eπθ​​[∑t=0T−1​Gt​∇θ​logπθ​(at​∣st​)] using the sample mean

Since the objective function is defined as the expected return over the entire trajectory, apply gradient ascent to update the parameters.

•

Disadvantages of REINFORCE 

◦

The policy can only be updated after a full episode has finished

◦

Since the gradient is proportional to the return, it exhibits high variance.

◦

On-policy method

2. REINFORCE with baseline

•

To reduce the variance, a baseline is introduced

•

If this baseline is a function independent of the action, then, as proven earlier, it does not effect the expectation of the gradient. 

\therefore \nabla_\theta J(\theta) = E_{\pi_\theta}[\sum^{T-1}_{t=0} (G_t - b(s_t)) \nabla_\theta log \pi_\theta (a_t|s_t)]

Generate an episode based on the current policy parameter θ\thetaθ.

Compute the return for each time step.

Use the computed return as the target and apply gradient descent to update the parameters of the baseline (state-value function).

Apply the gradient ascent method to update the policy parameters.

After updating with one episode, repeat the process for newly generated episodes.

•

∇θJ(θ)=Eπθ[∑T−1t=0(Gt−b(st))∇θlogπθ(at∣st)]{\scriptsize \nabla_\theta J(\theta) = E_{\pi_\theta}[\sum^{T-1}{t=0} (G_t - b(s_t)) \nabla_\theta log \pi_\theta (a_t|s_t)]}∇θ​J(θ)=Eπθ​​[∑T−1t=0(Gt​−b(st​))∇θ​logπθ​(at​∣st​)] is approximated using the sample mean, and the update is performed in an online manner — that is, the policy is updated immediately using data obtained from each 1-step interaction.