Search
🦾

REINFORCE

생성일
2024/08/05 02:59
μ΅œμ’… μˆ˜μ •μΌ
2025/11/10 08:43
νƒœκ·Έ
κ°•ν™”ν•™μŠ΅
μž‘μ„±μž

1. REINFORCE

β€’
Among the various approximate forms of the policy gradient discussed earlier, this corresponds to the REINFORCE algorithm, which uses the return(Gt)(G_t).
1.
Generate MM trajectories based on current parameter ΞΈ\theta
2.
Compute Return for each episode
3.
Approximate βˆ‡ΞΈJ(ΞΈ)=Eπθ[βˆ‘t=0Tβˆ’1Gtβˆ‡ΞΈlogπθ(at∣st)]\nabla_\theta J(\theta) = E_{\pi_\theta}[\sum^{T-1}_{t=0} G_t \nabla_\theta log \pi_\theta (a_t|s_t)] using the sample mean
4.
Since the objective function is defined as the expected return over the entire trajectory, apply gradient ascent to update the parameters.
β€’
Disadvantages of REINFORCE
β—¦
The policy can only be updated after a full episode has finished
β—¦
Since the gradient is proportional to the return, it exhibits high variance.
β—¦
On-policy method

2. REINFORCE with baseline

β€’
To reduce the variance, a baseline is introduced
β€’
If this baseline is a function independent of the action, then, as proven earlier, it does not effect the expectation of the gradient.
βˆ΄βˆ‡ΞΈJ(ΞΈ)=Eπθ[βˆ‘t=0Tβˆ’1(Gtβˆ’b(st))βˆ‡ΞΈlogπθ(at∣st)]\therefore \nabla_\theta J(\theta) = E_{\pi_\theta}[\sum^{T-1}_{t=0} (G_t - b(s_t)) \nabla_\theta log \pi_\theta (a_t|s_t)]
1.
Generate an episode based on the current policy parameter ΞΈ\theta.
2.
Compute the return for each time step.
3.
Use the computed return as the target and apply gradient descent to update the parameters of the baseline (state-value function).
4.
Apply the gradient ascent method to update the policy parameters.
5.
After updating with one episode, repeat the process for newly generated episodes.
β€’
βˆ‡ΞΈJ(ΞΈ)=Eπθ[βˆ‘Tβˆ’1t=0(Gtβˆ’b(st))βˆ‡ΞΈlogπθ(at∣st)]{\scriptsize \nabla_\theta J(\theta) = E_{\pi_\theta}[\sum^{T-1}{t=0} (G_t - b(s_t)) \nabla_\theta log \pi_\theta (a_t|s_t)]} is approximated using the sample mean, and the update is performed in an online manner β€” that is, the policy is updated immediately using data obtained from each 1-step interaction.