Chen Shawn's Blogs

╭(●`∀´●)╯ ╰(●’◡’●)╮

0%

On the Numerical Instablity of Conjugate Gradient

问题描述

在tensorflow计算图中实现的Conjugate gradient算法,一开始迭代数值就变成nan。

这个问题是在实现natural gradient optimization的时候发现的,代码实现的思路严格参照了Wikipedia上有关conjugate gradient的讲解,由于代码真正运行是在静态的计算图中,运行时也很难debug出到底出问题的是哪一步。

最后问题解决的办法也非常暴力——在屏幕上打印出所有的变量一一检查,逐层定位问题。经检查,最有可能的两个问题是:

  • Conjugate gradient算法实现中可能会出现分母为零的情况
  • 计算KL divergence时出现浮点数下溢,导致log输出NaN

理论上来讲这种情况是不会发生的,因为information geometry optimization的核心度量Fisher information matrix一般来讲都是非奇异的。给定$N$个不同的样本,FIM的rank不小于$N\times{\dim(\mathcal{Z})}$,其中$\dim(\mathcal{Z})$为模型输出空间概率分布本身的维度。这也就是说我们只要把batch size给到足够大,FIM不会出现奇异的情况

但实际代码运行的时候,由于一些数值上的不稳定性,FIM奇异还是有可能会出现

Code implementation

Conjugate gradient in TensorFlow static computational graph

Conjugate gradient的实现确实是比较复杂,想要将其写在TensorFlow的静态计算图内部还是比较麻烦,遇到的坑会比用numpy实现多不少,以下先放代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def hessian_vector_product(x, grad, variable):
kl_grad_prod = tf.reduce_sum(grad * x)
return tf.stop_gradient(tf.gradients(kl_grad_prod, variable)[0])


def build_conjugate_gradient(x, kl_grad, variable, n_iter=10, func_Ax=hessian_vector_product):
"""build_conjugate_gradient
:param x: type tf.Tensor, the initial value of x
:param kl_grad: type tf.Tensor, the gradient of the objective
:param variable: type tf.Variable
:return: the converged conjugate gradient vector \tilde{x} = H^{-1}x

Fixed number of iterations in the inner loop
"""
x = tf.stop_gradient(x)
r = x - func_Ax(x, kl_grad, variable)
p = tf.stop_gradient(r)
r_dot_r = tf.reduce_sum(tf.square(r))
for k in range(n_iter):
Ap = func_Ax(p, kl_grad, variable)
p_dot_Ap = tf.reduce_sum(p * Ap)
alpha = r_dot_r / (p_dot_Ap + EPSILON)
x = x + alpha * p
r = r - alpha * Ap
r_dot_r_new = tf.reduce_sum(tf.square(r))
beta = r_dot_r_new / (r_dot_r + EPSILON)
r_dot_r = r_dot_r_new
p = r + beta * p
return x

Ideas worth noting:

  • The conjugate gradient does not require direct access to the explicit form of the matrix, only matrix-vector-product is needed. Let $\hat{F}\in{\mathbb{R}^{N\times{N}}}$ be the FIM estimated with a batch of data, and let $v\in{\mathbb{R}^{N}}$ be an arbitrary vector, then $\hat{F}^{-1}v=\nabla_{\theta}(v^{T}\nabla_{\theta}D_{KL}(\pi_{\text{old}}||\pi))$.
  • When calculating the matrix-vector-product $\hat{F}^{-1}v$, remember to block the gradient from vector $v$ using tf.stop_gradient in every iteration. Your gradient should only come from the KL divergence term.
  • There are devision operations for variables alpha and beta. Remember to add a small number (EPSILON) for the denominator in case of NaN output.

github上也可以找到其他的基于TensorFlow静态计算图的conjugate gradient实现,其中一个印象比较深刻的实现方法是将while循环判断也写在了计算图的内部,用tf.while_loop实现,代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def _conjugate_gradient_ops(self, pg_grads, kl_grads, max_iterations=20, residual_tol=1e-10):
i0 = tf.constant(0, dtype=tf.int32)
loop_condition = lambda i, r, p, x, rdotr: tf.logical_and(
tf.greater(rdotr, residual_tol), tf.less(i, max_iterations))

def body(i, r, p, x, rdotr):
fvp = utils.ops.flatten_vars(tf.gradients(
tf.reduce_sum(tf.stop_gradient(p)*kl_grads),
self.policy_network.params))
z = fvp + self.cg_damping * p
alpha = rdotr / (tf.reduce_sum(p*z) + 1e-8)
x += alpha * p
r -= alpha * z
new_rdotr = tf.reduce_sum(r*r)
beta = new_rdotr / (rdotr + 1e-8)
p = r + beta * p
new_rdotr = tf.Print(new_rdotr, [i, new_rdotr], 'Iteration / Residual: ')
return i+1, r, p, x, new_rdotr

_, r, p, stepdir, rdotr = tf.while_loop(
loop_condition,
body,
loop_vars=[i0,
pg_grads,
pg_grads,
tf.zeros_like(pg_grads),
tf.reduce_sum(pg_grads*pg_grads)])
fvp = utils.ops.flatten_vars(tf.gradients(
tf.reduce_sum(tf.stop_gradient(stepdir)*kl_grads),
self.policy_network.params))
shs = 0.5 * tf.reduce_sum(stepdir*fvp)
lm = tf.sqrt((shs + 1e-8) / self.max_kl)
fullstep = stepdir / lm
neggdotstepdir = tf.reduce_sum(pg_grads*stepdir) / lm
return fullstep, neggdotstepdir

Model definition and environment interaction

调了一晚上的bug,根源就在这部分代码中。在jupyter notebook中打印出了所有变量的实际值后,发现一些奇怪的现象:

  • 问题出在从普通的gradient转成natural gradient的过程中,普通gradient数值没有问题,natural gradient时常数值爆炸,或者直接出现NaN
  • 搭了一个小网络来测试conjugate gradient实现的正确与否,发现偶尔会出现FIM奇异的情况,推测是numerical instability所致
  • 进一步的测试中发现,每一步的迭代中并不总是会出现FIM奇异。相对而言较大、且靠近底层的参数不容易出现FIM奇异,参数量较小且靠近输出层的参数容易出现NaN,当时最频繁出现NaN的参数,就是网络最后一层的bias

猜测这种numerical instability可能是由于KL divergence不稳定导致的,所以干脆把旧的policy网络和新的policy网络写成同一个,设置reuse=True的来共享网络参数,这样KL divergence就会始终为0,毕竟我们需要的只是KL divergence的gradient和Hessian而已。

后续实验中发现这样写会使得网络迭代更新速度缓慢,且容易收敛到sub-optimal点,有关这个问题之后如有新的发现再来更新这里的内容吧

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def collect_multi_batch(env, agent, maxlen, batch_size=64, qsize=5):
"""collect_multi_batch
See collect_one_trajectory docstring
:return: three lists of batch data (s, a, r)
"""
que = []
s_init = env.reset()
que.append(s_init[None, :])
for it in range(qsize - 1):
st, r, done, _ = env.step([-0.99])
que.append(st[None, :])
# Interact with environment
buffer_s, buffer_a, buffer_r = [], [], []
for it in range(maxlen):
s = np.concatenate(que, axis=-1)
a = agent.choose_action(s)
buffer_s.append(s)
s, r, done, _ = env.step(a)
que.pop(0)
que.append(s[None, :])
buffer_a.append(a[None, :])
r = (r + 0.3) * 2.0
buffer_r.append(r)
if done:
break
# Accumulate rewards
discounted = 1.0
last_value = model.value_estimate(s)
discounted_r = []
for it in range(len(buffer_r) - 2, -1, -1):
last_value = r + args.gamma * last_value
discounted_r.append(last_value)
state_data, action_data, reward_data = [], [], []
for it in range(0, maxlen, batch_size):
if it >= len(buffer_s):
break
states_array = np.concatenate(buffer_s[it: it + batch_size], axis=0)
actions_array = np.concatenate(buffer_a[it: it + batch_size], axis=0)
rewards_array = np.array(discounted_r[it: it + batch_size], dtype=np.float32)[:, None]
# rewards_array = np.clip(rewards_array, -1.0, 5.0)
state_data.append(states_array)
action_data.append(actions_array)
reward_data.append(rewards_array)
return state_data, action_data, reward_data


class TNPGModel(object):
def __init__(self, v_lr, pi_lr, model_dir, delta=1e-3):
self.state = tf.placeholder(tf.float32, [None, 10], name='state')
self.action = tf.placeholder(tf.float32, [None, 1], name='action')
self.reward = tf.placeholder(tf.float32, [None, 1], name='reward')

# Advantage function definition
print(' [*] Building advantage function...')
kwargs = {'kernel_initializer': tf.orthogonal_initializer()}
with tf.variable_scope('value'):
h1 = tf.layers.dense(self.state, 128, activation=tf.nn.relu, name='h1', **kwargs)
self.value = tf.layers.dense(h1, 1, activation=None, name='value', **kwargs)
self.advantage = self.reward - self.value

self.v_loss = tf.reduce_mean(tf.square(self.advantage))
v_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='value')
self.v_train = tf.train.AdamOptimizer(v_lr).minimize(self.v_loss, var_list=v_vars)

# Policy function definition
print(' [*] Building policy function...')
self.policy, pi_vars = build_gaussian_network(self.state, 1, scope='policy')
old_policy, old_vars = build_gaussian_network(self.state, 1, scope='policy', trainable=False, reuse=True)
with tf.name_scope('policy_ops'):
# self.assign_op = [old.assign(new) for old, new in zip(old_vars, pi_vars)]
self.sample_op = self.policy.sample(1)
with tf.name_scope('surrogate_loss'):
ratio = self.policy.prob(self.action) / old_policy.prob(self.action)
surrogate = ratio * self.advantage
self.pi_loss = -tf.reduce_mean(surrogate)

# Convert Adam gradient to natural gradient
print(' [*] Building natural gradient...')
with tf.variable_scope('policy_optim'):
kl = tf.distributions.kl_divergence(old_policy, self.policy)
optim = tf.train.AdamOptimizer(pi_lr)
pi_grads_and_vars = optim.compute_gradients(surrogate, var_list=pi_vars)
pi_grads = [pair[0] for pair in pi_grads_and_vars]
kl_grads = tf.gradients(kl, pi_vars)

conj_grads = []
for grad, kl_grad, var in zip(pi_grads, kl_grads, pi_vars):
conj = build_conjugate_gradient(grad, kl_grad, var)
nat_grad = tf.sqrt((2.0 * delta) / (tf.reduce_sum(grad * conj) + EPSILON)) * conj
conj_grads.append((nat_grad, var))
self.pi_train = optim.apply_gradients(conj_grads)

# Summaries definition
print(' [*] Building summaries...')
model_variance = tf.reduce_mean(self.policy._scale)
self.sums = tf.summary.merge([
tf.summary.scalar('max_rewards', tf.reduce_max(self.reward)),
tf.summary.scalar('mean_advantage', tf.reduce_mean(self.advantage)),
tf.summary.scalar('pi_loss', self.pi_loss),
tf.summary.scalar('v_loss', self.v_loss),
tf.summary.scalar('model_variance', model_variance)
], name='summaries')

config = tf.ConfigProto()
# config.gpu_options.allow_growth = True
self.sess = tf.Session(config=config)
self.sess.run(tf.global_variables_initializer())
print(' [*] Model built finished')
_, self.counter = load(self.sess, model_dir)


def choose_action(self, s):
a = self.sess.run(self.sample_op, feed_dict={self.state: s})
return a[0, :, 0]


def update(self, s, a, r, v_iter, pi_iter, writer=None, counter=0):
feed_dict = {self.state: s, self.action: a, self.reward: r}
# self.sess.run(self.assign_op)
# update policy
for _ in range(pi_iter):
self.sess.run(self.pi_train, feed_dict=feed_dict)
# update value function
for _ in range(v_iter):
self.sess.run(self.v_train, feed_dict=feed_dict)
if writer is not None:
sumstr = self.sess.run(self.sums, feed_dict=feed_dict)
writer.add_summary(sumstr, counter)