记录一点点GAE的细节
References
- High-Dimensional Continuous Control Using Generalized Advantage Estimation
- Notes on the Generalized Advantage Estimation Paper
Implementation
1 | def compute_gae(next_value, rewards, masks, values, gamma=0.99, tau=0.95): |
然而文章中推导的最终形式为
一眼看过去并不容易看出来这个implemention与GAE理论形式的关系,这里用的技巧是通过反向累加
Properties
Bias-variance tradeoff using hyperparameter $\lambda$
- $\lambda$ closed to 1 leads to high variance and low bias
- $\lambda$ closed to 0 leads to low variance and high bias
More specifically,
- when $\lambda=1$, advantage function == total gain $A_{t}^{GAE(\gamma,1)}=\sum_{l=t}^{\infty}[\gamma^{l-t}R_{l}]-V(s_{t})$
- when $\lambda=0$, advantage function == td error $A_{t}^{GAE(\gamma,0)}=R_{t}+\gamma V_{s_{t+1}}-V(s_{t})$