Chen Shawn's Blogs

╭(●`∀´●)╯ ╰(●’◡’●)╮

0%

Details of Generalized Advantage Estimator

记录一点点GAE的细节

References

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def compute_gae(next_value, rewards, masks, values, gamma=0.99, tau=0.95):
"""compute_gae
next_value: type float, the value estimation for the last terminated state
rewards: type list, the rewards obtained via directly interacting with environment
masks: type list, with value 0 for terminated states and 1 for otherwise
values: type list, estimated using value function estimator
tau: corresponding to the hyper-parameter lambda in the paper
"""
values = values + [next_value]
gae = 0
returns = []
for step in reversed(range(len(rewards))):
delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step]
gae = delta + gamma * tau * masks[step] * gae
returns.insert(0, gae + values[step])
return returns

然而文章中推导的最终形式为

一眼看过去并不容易看出来这个implemention与GAE理论形式的关系,这里用的技巧是通过反向累加

Properties

Bias-variance tradeoff using hyperparameter $\lambda$

  • $\lambda$ closed to 1 leads to high variance and low bias
  • $\lambda$ closed to 0 leads to low variance and high bias

More specifically,

  • when $\lambda=1$, advantage function == total gain $A_{t}^{GAE(\gamma,1)}=\sum_{l=t}^{\infty}[\gamma^{l-t}R_{l}]-V(s_{t})$
  • when $\lambda=0$, advantage function == td error $A_{t}^{GAE(\gamma,0)}=R_{t}+\gamma V_{s_{t+1}}-V(s_{t})$