tensorflow - 动量优化器的学习率变化

在运行现有的 Tensorflow 实现时，我发现不同时期之间的学习率保持相同。原始实现使用tf.train.MomentumOptimizer，并具有衰减率设置。

我对动量优化器的理解是学习率应该随着纪元的增加而降低。为什么我的训练过程中学习率保持不变。学习率是否也可能取决于性能，例如，如果性能变化不大，那么学习率将保持不变。我认为我不太清楚动量优化器的基 native 制，并且对学习率与纪元保持相同感到困惑，尽管我认为它应该根据给定的衰减率不断下降。

优化器定义如下

        learning_rate =  0.2
        decay_rate = 0.95            
        self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate, 
                                                    global_step=global_step, 
                                                    decay_steps=training_iters,  
                                                    decay_rate=decay_rate, 
                                                    staircase=True)

        optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node).minimize(self.net.cost, 
                                                                            global_step=global_step)

最佳答案

如果不看代码，很难判断我的回答是否对您有帮助。

但是，如果您需要了解动量优化器的工作原理以及学习率应如何衰减。

首先是 Vanilla GradientDescentMinimizer 的最基本更新:

W^(n+1)=W^(n)-alpha*(成本相对于W的梯度)(W^n)

你只是遵循与梯度相反的方向。
具有学习率衰减的GradientDescentMinimizer:

W^(n+1)=W^(n)-alpha(n)*(W的成本梯度)(W^n)

唯一改变的是学习率 alpha ，它现在取决于 Tensorflow 中最常用的步骤，即指数衰减，在 N 步骤之后，学习率除以某个常数，即 10。

这种变化通常发生在训练后期，因此您可能需要经过几个时期才能看到它的衰减。

动量优化器: 你必须保留一个额外的变量:你之前完成的更新，即你必须在每个时间步存储:

更新^(n)=(W^(n)-W^(n-1))

那么动量修正后的更新为:

update^(n+1)=mupdate^(n)-alpha(成本相对于W的梯度)(W^n)

所以你正在做的是做简单的梯度下降，但通过记住最近的过去来纠正它(有更聪明和更复杂的方法来做到这一点，比如内斯特罗夫的动量)

MomentumOptimizer 具有学习率衰减:

更新^(n)=(W^(n)-W^(n-1))

update^(n+1)=mupdate^(n)-alpha(n)(成本相对于 W 的梯度)(W^n)

alpha 现在也依赖于 n。

因此，在某一时刻，它会开始减慢，就像梯度下降一样，学习率会衰减，但下降会受到动量的影响。

要完整回顾这些方法和更多内容，您可以the excellent website这比我解释得更好，Alec Radford's famous visualization这胜过千言万语。

学习率不应取决于性能，除非在衰减中指定!
查看有问题的代码会有所帮助!

EDIT1::这是一个工作示例，我认为它可以回答您提出的两个问题:

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

#Pure SGD
BATCH_SIZE=1
#Batch Gradient Descent
#BATCH_SIZE=1000
starter_learning_rate=0.001

xdata=np.linspace(0.,2*np.pi,1000)[:,np.newaxis]
ydata=np.sin(xdata)+np.random.normal(0.0,0.05,size=1000)[:,np.newaxis]
plt.scatter(xdata,ydata)

x=tf.placeholder(tf.float32,[None,1])
y=tf.placeholder(tf.float32, [None,1])
#We define global_step as a variable initialized at 0
global_step=tf.Variable(0,trainable=False)

w1=tf.Variable(0.05*tf.random_normal((1,100)),tf.float32)
w2=tf.Variable(0.05*tf.random_normal((100,1)),tf.float32)

b1=tf.Variable(np.zeros([100]).astype("float32"),tf.float32)
b2=tf.Variable(np.zeros([1]).astype("float32"),tf.float32)

h1=tf.nn.relu(tf.matmul(x,w1)+b1)
y_model=tf.matmul(h1,w2)+b2


L=tf.reduce_mean(tf.square(y_model-y))

#We want to decrease the learning rate after having seen all the data 5 times
NUM_EPOCHS_PER_DECAY=5
LEARNING_RATE_DECAY_FACTOR=0.1
#Since the mechanism of the decay depends on the number of iterations and not epochs we have to connect the number of epochs to the number of iterations

#So if we have batch_size=1 we have to iterate exactly 1000 times to do one epoch so 5*1000=5000 iterations before decaying if the batch_size was 1000 1 iterations=1epoch and thus we decrease it after 5 iterations 
num_batches_per_epoch=int(xdata.shape[0]/float(BATCH_SIZE))
decay_steps=int(num_batches_per_epoch*NUM_EPOCHS_PER_DECAY)
decayed_learning_rate=tf.train.exponential_decay(starter_learning_rate, 
                                                     global_step, 
                                                     decay_steps, 
                                                     LEARNING_RATE_DECAY_FACTOR,
                                                     staircase=True)

#So now we have an object that depends on global_step and that will be divided by 10 every decay_steps iterations i.e. when global_step=N*decay_steps with N a non zero integer         

#We now create a train_step to which we pass the learning rate created each time this function is called global_step will be incremented by 1 we are gonna check that it is the case BE CAREFUL WE HAVE TO GIVE IT GLOBAL_STEP AS AN ARGUMENT
train_step=tf.train.GradientDescentOptimizer(decayed_learning_rate).minimize(L,global_step=global_step)
sess=tf.Session()
sess.run(tf.initialize_all_variables())
GLOBAL_s=[]
lr_val=[]
COSTS=[]
for i in range(16000):
    #We will do 1600 iterations so as there is a decay every 5000 iterations we will see 3 decays (5000,10000,15000)
    start_data=(i*BATCH_SIZE)%1000
    COSTS.append([sess.run(L, feed_dict={x:xdata,y:ydata})])
    GLOBAL_s.append([sess.run(global_step)])
    lr_val.append([sess.run(decayed_learning_rate)])
    #I see the train_step as implicitely executing sess.run(tf.add(global_step,1))
    sess.run(train_step,feed_dict={x:xdata[start_data:start_data+BATCH_SIZE],y:ydata[start_data:start_data+BATCH_SIZE]})


plt.figure()
plt.subplot(211)
plt.plot(GLOBAL_s,lr_val,"-b")
plt.title("Evolution of learning rate" )
plt.subplot(212)
plt.plot(GLOBAL_s,COSTS,".g")
plt.title("Evolution of cost" )



#notice two things first global_step is actually being incremented and learning rate is actually being decayed

(显然你可以写MomentumOptimize()而不是GradientDescentOptimizer()...)

这是我得到的两个图: 总结一下，当你调用train_step时，tensorflow运行tf.add(global_step,1)

关于tensorflow - 动量优化器的学习率变化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40443402/

tensorflow - 动量优化器的学习率变化

上一篇：Firebase.ServerValue.TIMESTAMP 在直接 child_added 上返回不同的值

下一篇：compilation - linux内核编译: ERROR: "function" [path/to/module/module. ko]未定义