在运行现有的 Tensorflow 实现时,我发现不同时期之间的学习率保持相同。原始实现使用tf.train.MomentumOptimizer
,并具有衰减率
设置。
我对动量优化器的理解是学习率应该随着纪元的增加而降低。为什么我的训练过程中学习率保持不变。学习率是否也可能取决于性能,例如,如果性能变化不大,那么学习率将保持不变。我认为我不太清楚动量优化器的基 native 制,并且对学习率与纪元保持相同感到困惑,尽管我认为它应该根据给定的衰减率不断下降。
优化器定义如下
learning_rate = 0.2
decay_rate = 0.95
self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate,
global_step=global_step,
decay_steps=training_iters,
decay_rate=decay_rate,
staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node).minimize(self.net.cost,
global_step=global_step)
最佳答案
如果不看代码,很难判断我的回答是否对您有帮助。
但是,如果您需要了解动量优化器的工作原理以及学习率应如何衰减。
首先是 Vanilla
GradientDescentMinimizer
的最基本更新:W^(n+1)=W^(n)-alpha*(成本相对于W的梯度)(W^n)
你只是遵循与梯度相反的方向。
具有学习率衰减的
GradientDescentMinimizer
:W^(n+1)=W^(n)-alpha(n)*(W的成本梯度)(W^n)
唯一改变的是学习率 alpha ,它现在取决于 Tensorflow 中最常用的步骤,即指数衰减,在 N 步骤之后,学习率除以某个常数,即 10。
这种变化通常发生在训练后期,因此您可能需要经过几个时期才能看到它的衰减。
动量优化器
: 你必须保留一个额外的变量:你之前完成的更新,即你必须在每个时间步存储:更新^(n)=(W^(n)-W^(n-1))
那么动量修正后的更新为:
update^(n+1)=mupdate^(n)-alpha(成本相对于W的梯度)(W^n)
所以你正在做的是做简单的梯度下降,但通过记住最近的过去来纠正它(有更聪明和更复杂的方法来做到这一点,比如内斯特罗夫的动量)
MomentumOptimizer
具有学习率衰减:
更新^(n)=(W^(n)-W^(n-1))
update^(n+1)=mupdate^(n)-alpha(n)(成本相对于 W 的梯度)(W^n)
alpha 现在也依赖于 n。
因此,在某一时刻,它会开始减慢,就像梯度下降一样,学习率会衰减,但下降会受到动量的影响。
要完整回顾这些方法和更多内容,您可以the excellent website这比我解释得更好,Alec Radford's famous visualization这胜过千言万语。
学习率不应取决于性能,除非在衰减中指定!
查看有问题的代码会有所帮助!
EDIT1::这是一个工作示例,我认为它可以回答您提出的两个问题:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
#Pure SGD
BATCH_SIZE=1
#Batch Gradient Descent
#BATCH_SIZE=1000
starter_learning_rate=0.001
xdata=np.linspace(0.,2*np.pi,1000)[:,np.newaxis]
ydata=np.sin(xdata)+np.random.normal(0.0,0.05,size=1000)[:,np.newaxis]
plt.scatter(xdata,ydata)
x=tf.placeholder(tf.float32,[None,1])
y=tf.placeholder(tf.float32, [None,1])
#We define global_step as a variable initialized at 0
global_step=tf.Variable(0,trainable=False)
w1=tf.Variable(0.05*tf.random_normal((1,100)),tf.float32)
w2=tf.Variable(0.05*tf.random_normal((100,1)),tf.float32)
b1=tf.Variable(np.zeros([100]).astype("float32"),tf.float32)
b2=tf.Variable(np.zeros([1]).astype("float32"),tf.float32)
h1=tf.nn.relu(tf.matmul(x,w1)+b1)
y_model=tf.matmul(h1,w2)+b2
L=tf.reduce_mean(tf.square(y_model-y))
#We want to decrease the learning rate after having seen all the data 5 times
NUM_EPOCHS_PER_DECAY=5
LEARNING_RATE_DECAY_FACTOR=0.1
#Since the mechanism of the decay depends on the number of iterations and not epochs we have to connect the number of epochs to the number of iterations
#So if we have batch_size=1 we have to iterate exactly 1000 times to do one epoch so 5*1000=5000 iterations before decaying if the batch_size was 1000 1 iterations=1epoch and thus we decrease it after 5 iterations
num_batches_per_epoch=int(xdata.shape[0]/float(BATCH_SIZE))
decay_steps=int(num_batches_per_epoch*NUM_EPOCHS_PER_DECAY)
decayed_learning_rate=tf.train.exponential_decay(starter_learning_rate,
global_step,
decay_steps,
LEARNING_RATE_DECAY_FACTOR,
staircase=True)
#So now we have an object that depends on global_step and that will be divided by 10 every decay_steps iterations i.e. when global_step=N*decay_steps with N a non zero integer
#We now create a train_step to which we pass the learning rate created each time this function is called global_step will be incremented by 1 we are gonna check that it is the case BE CAREFUL WE HAVE TO GIVE IT GLOBAL_STEP AS AN ARGUMENT
train_step=tf.train.GradientDescentOptimizer(decayed_learning_rate).minimize(L,global_step=global_step)
sess=tf.Session()
sess.run(tf.initialize_all_variables())
GLOBAL_s=[]
lr_val=[]
COSTS=[]
for i in range(16000):
#We will do 1600 iterations so as there is a decay every 5000 iterations we will see 3 decays (5000,10000,15000)
start_data=(i*BATCH_SIZE)%1000
COSTS.append([sess.run(L, feed_dict={x:xdata,y:ydata})])
GLOBAL_s.append([sess.run(global_step)])
lr_val.append([sess.run(decayed_learning_rate)])
#I see the train_step as implicitely executing sess.run(tf.add(global_step,1))
sess.run(train_step,feed_dict={x:xdata[start_data:start_data+BATCH_SIZE],y:ydata[start_data:start_data+BATCH_SIZE]})
plt.figure()
plt.subplot(211)
plt.plot(GLOBAL_s,lr_val,"-b")
plt.title("Evolution of learning rate" )
plt.subplot(212)
plt.plot(GLOBAL_s,COSTS,".g")
plt.title("Evolution of cost" )
#notice two things first global_step is actually being incremented and learning rate is actually being decayed
(显然你可以写MomentumOptimize()而不是GradientDescentOptimizer()...)
这是我得到的两个图:
总结一下,当你调用train_step时,tensorflow运行tf.add(global_step,1)
关于tensorflow - 动量优化器的学习率变化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40443402/