python - 从头开始实现PLP和在PyTorch中实现MLP有什么区别?

标签 python numpy neural-network deep-learning pytorch

跟进How to update the learning rate in a two layered multi-layered perceptron?的问题

鉴于XOR问题:

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

和一个简单的
  • 带有
  • 的两层多层感知器(MLP)
    它们和之间的
  • Sigmoid激活
  • 均方误差(MSE)作为损失函数/优化准则

  • 如果我们像这样从头开始训练模型:
    from itertools import chain
    import matplotlib.pyplot as plt
    import numpy as np
    np.random.seed(0)
    
    def sigmoid(x): # Returns values that sums to one.
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_derivative(sx):
        # See https://math.stackexchange.com/a/1225116
        return sx * (1 - sx)
    
    # Cost functions.
    def mse(predicted, truth):
        return 0.5 * np.mean(np.square(predicted - truth))
    
    def mse_derivative(predicted, truth):
        return predicted - truth
    
    X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
    Y = xor_output = np.array([[0,1,1,0]]).T
    
    # Define the shape of the weight vector.
    num_data, input_dim = X.shape
    # Lets set the dimensions for the intermediate layer.
    hidden_dim = 5
    # Initialize weights between the input layers and the hidden layer.
    W1 = np.random.random((input_dim, hidden_dim))
    
    # Define the shape of the output vector. 
    output_dim = len(Y.T)
    # Initialize weights between the hidden layers and the output layer.
    W2 = np.random.random((hidden_dim, output_dim))
    
    # Initialize weigh
    num_epochs = 5000
    learning_rate = 0.3
    
    losses = []
    
    for epoch_n in range(num_epochs):
        layer0 = X
        # Forward propagation.
    
        # Inside the perceptron, Step 2. 
        layer1 = sigmoid(np.dot(layer0, W1))
        layer2 = sigmoid(np.dot(layer1, W2))
    
        # Back propagation (Y -> layer2)
    
        # How much did we miss in the predictions?
        cost_error = mse(layer2, Y)
        cost_delta = mse_derivative(layer2, Y)
    
        #print(layer2_error)
        # In what direction is the target value?
        # Were we really close? If so, don't change too much.
        layer2_error = np.dot(cost_delta, cost_error)
        layer2_delta = cost_delta *  sigmoid_derivative(layer2)
    
        # Back propagation (layer2 -> layer1)
        # How much did each layer1 value contribute to the layer2 error (according to the weights)?
        layer1_error = np.dot(layer2_delta, W2.T)
        layer1_delta = layer1_error * sigmoid_derivative(layer1)
    
        # update weights
        W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
        W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
        #print(np.dot(layer0.T, layer1_delta))
        #print(epoch_n, list((layer2)))
    
        # Log the loss value as we proceed through the epochs.
        losses.append(layer2_error.mean())
        #print(cost_delta)
    
    
    # Visualize the losses
    plt.plot(losses)
    plt.show()
    

    我们从时期0开始急剧下降,然后迅速趋于饱和:

    enter image description here

    但是,如果我们使用pytorch训练类似的模型,则训练曲线在饱和之前会逐渐降低损耗:

    enter image description here

    从头开始的MLP与PyTorch代码之间有什么区别?

    为什么会在不同点实现融合?

    除了权重初始化,从头开始的代码中的np.random.rand()和默认的手电筒初始化外,我似乎看不出模型有什么不同。

    PyTorch的代码:
    from tqdm import tqdm
    import numpy as np
    
    import torch
    from torch import nn
    from torch import tensor
    from torch import optim
    
    import matplotlib.pyplot as plt
    
    torch.manual_seed(0)
    device = 'gpu' if torch.cuda.is_available() else 'cpu'
    
    # XOR gate inputs and outputs.
    X = xor_input = tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)
    Y = xor_output = tensor([[0],[1],[1],[0]]).float().to(device)
    
    
    # Use tensor.shape to get the shape of the matrix/tensor.
    num_data, input_dim = X.shape
    print('Inputs Dim:', input_dim) # i.e. n=2 
    
    num_data, output_dim = Y.shape
    print('Output Dim:', output_dim) 
    print('No. of Data:', num_data) # i.e. n=4
    
    # Step 1: Initialization. 
    
    # Initialize the model.
    # Set the hidden dimension size.
    hidden_dim = 5
    # Use Sequential to define a simple feed-forward network.
    model = nn.Sequential(
                # Use nn.Linear to get our simple perceptron.
                nn.Linear(input_dim, hidden_dim),
                # Use nn.Sigmoid to get our sigmoid non-linearity.
                nn.Sigmoid(),
                # Second layer neurons.
                nn.Linear(hidden_dim, output_dim),
                nn.Sigmoid()
            )
    model
    
    # Initialize the optimizer
    learning_rate = 0.3
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    
    # Initialize the loss function.
    criterion = nn.MSELoss()
    
    # Initialize the stopping criteria
    # For simplicity, just stop training after certain no. of epochs.
    num_epochs = 5000 
    
    losses = [] # Keeps track of the loses.
    
    # Step 2-4 of training routine.
    
    for _e in tqdm(range(num_epochs)):
        # Reset the gradient after every epoch. 
        optimizer.zero_grad() 
        # Step 2: Foward Propagation
        predictions = model(X)
    
        # Step 3: Back Propagation 
        # Calculate the cost between the predictions and the truth.
        loss = criterion(predictions, Y)
        # Remember to back propagate the loss you've computed above.
        loss.backward()
    
        # Step 4: Optimizer take a step and update the weights.
        optimizer.step()
    
        # Log the loss value as we proceed through the epochs.
        losses.append(loss.data.item())
    
    
    plt.plot(losses)
    

    最佳答案

    手动滚动代码和PyTorch代码之间的差异列表

    事实证明,您手动编写的代码与PyTorch代码之间存在很大的差异。这是我发现的内容,大致按对输出影响最大至最小的顺序列出:

  • 您的代码和PyTorch代码使用两个不同的函数来报告丢失。
  • 您的代码和PyTorch代码设置初始权重非常不同。您在问题中提到了这一点,但结果却对结果产生了重大影响。
  • 默认情况下, torch.nn.Linear 图层会向模型添加一堆额外的“bias”权重。因此,Pytorch模型的第一层有效地具有3x5权重,第二层具有6x1权重。手动滚动代码中的图层分别具有2x55x1权重。
  • 这个偏见似乎可以帮助模型更快地学习和适应。如果您关闭偏见,Pytorch模型大约要花费大约两倍的训练时间才能达到0损失。
  • 奇怪的是,似乎Pytorch模型使用的学习率实际上是您指定的学习率的一半。另外,也可能是2的杂散因素被发现到您手推的数学/代码中的某个地方。

  • 如何从手卷代码和Pytorch代码中获得相同的结果

    通过仔细考虑上述4个因素,可以实现手动滚动和Pytorch码之间的完全对等。使用正确的调整和设置,这两个代码片段将产生相同的结果:

    enter image description here

    最重要的调整-使损失报告功能匹配

    关键区别在于,最终您将使用两个完全不同的函数来测量两个代码段中的损失:
  • 在手动滚动代码中,将损失测量为layer2_error.mean()。如果解压缩该变量,则可以看到layer2_error.mean()是一个有点螺丝钉且毫无意义的值:
    layer2_error.mean()
    == np.dot(cost_delta, cost_error).mean()
    == np.dot(mse_derivative(layer2, Y), mse(layer2, Y)).mean()
    == np.sum(.5 * (layer2 - Y) * ((layer2 - Y)**2).mean()).mean()
    
  • 另一方面,在PyTorch代码中,损耗是按照mse的传统定义(即np.mean((layer2 - Y)**2)的等效值)来衡量的。您可以通过如下修改PyTorch循环来向自己证明这一点:
    def mse(x, y):
        return np.mean((x - y)**2)
    
    torch_losses = [] # Keeps track of the loses.
    torch_losses_manual = [] # for comparison
    
    # Step 2-4 of training routine.
    
    for _e in tqdm(range(num_epochs)):
        # Reset the gradient after every epoch. 
        optimizer.zero_grad() 
        # Step 2: Foward Propagation
        predictions = model(X)
    
        # Step 3: Back Propagation 
        # Calculate the cost between the predictions and the truth.
        loss = criterion(predictions, Y)
        # Remember to back propagate the loss you've computed above.
        loss.backward()
    
        # Step 4: Optimizer take a step and update the weights.
        optimizer.step()
    
        # Log the loss value as we proceed through the epochs.
        torch_losses.append(loss.data.item())
        torch_losses_manual.append(mse(predictions.detach().numpy(), Y.detach().numpy()))
    
    plt.plot(torch_losses, lw=5, label='torch_losses')
    plt.plot(torch_losses_manual, lw=2, label='torch_losses_manual')
    plt.legend()
    

  • 输出:

    enter image description here

    同样重要-使用相同的初始权重

    PyTorch使用它自己的特殊例程来设置初始权重,该初始权重与np.random.rand产生非常不同的结果。我还不能完全复制它,但是对于第二件事,我们可以劫持Pytorch。这是一个将获得与Pytorch模型使用的初始权重相同的初始权重的函数:
    import torch
    from torch import nn
    torch.manual_seed(0)
    
    def torch_weights(nodes_in, nodes_hidden, nodes_out, bias=None):
        model = nn.Sequential(
            nn.Linear(nodes_in, nodes_hidden, bias=bias),
            nn.Sigmoid(),
            nn.Linear(nodes_hidden, nodes_out, bias=bias),
            nn.Sigmoid()
        )
    
        return [t.detach().numpy() for t in model.parameters()]
    

    最后-在Pytorch中,关闭所有偏差权重并将学习率提高一倍

    最终,您可能希望在自己的代码中实现偏差权重。现在,我们只需要关闭Pytorch模型中的偏差并将手动滚动模型的结果与无偏差Pytorch模型的结果进行比较即可。

    另外,为了使结果匹配,您需要将Pytorch模型的学习率提高一倍。这可以有效地沿x轴缩放结果(即,使速率加倍意味着到达损耗曲线上的某些特定特征所需的时间只有原来的一半)。

    把它放在一起

    为了从我的文章开始处的情节中再现hand_rolled_losses数据,您需要做的就是拿手滚动的代码并将mse函数替换为:
    def mse(predicted, truth):
        return np.mean(np.square(predicted - truth))
    

    用以下命令初始化权重的行:
    W1,W2 = [w.T for w in torch_weights(input_dim, hidden_dim, output_dim)]
    

    以及跟踪损失的行:
    losses.append(cost_error)
    

    而且你应该很好走。

    为了从图中重现torch_losses数据,我们还需要在Pytorch模型中关闭偏差权重。为此,您只需要更改定义Pytorch模型的行即可,如下所示:
    model = nn.Sequential(
        # Use nn.Linear to get our simple perceptron.
        nn.Linear(input_dim, hidden_dim, bias=None),
        # Use nn.Sigmoid to get our sigmoid non-linearity.
        nn.Sigmoid(),
        # Second layer neurons.
        nn.Linear(hidden_dim, output_dim, bias=None),
        nn.Sigmoid()
    )
    

    您还需要更改定义learning_rate的行:
    learning_rate = 0.3 * 2
    

    完整的代码 list

    手卷代码

    这是我的手动神经网络代码版本的完整列表,以帮助再现我的结果:
    from itertools import chain
    import matplotlib.pyplot as plt
    import numpy as np
    import scipy as sp
    import scipy.stats
    import torch
    from torch import nn
    
    np.random.seed(0)
    torch.manual_seed(0)
    
    def torch_weights(nodes_in, nodes_hidden, nodes_out, bias=None):
        model = nn.Sequential(
            nn.Linear(nodes_in, nodes_hidden, bias=bias),
            nn.Sigmoid(),
            nn.Linear(nodes_hidden, nodes_out, bias=bias),
            nn.Sigmoid()
        )
    
        return [t.detach().numpy() for t in model.parameters()]
    
    def sigmoid(x): # Returns values that sums to one.
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_derivative(sx):
        # See https://math.stackexchange.com/a/1225116
        return sx * (1 - sx)
    
    # Cost functions.
    def mse(predicted, truth):
        return np.mean(np.square(predicted - truth))
    
    def mse_derivative(predicted, truth):
        return predicted - truth
    
    X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
    Y = xor_output = np.array([[0,1,1,0]]).T
    
    # Define the shape of the weight vector.
    num_data, input_dim = X.shape
    # Lets set the dimensions for the intermediate layer.
    hidden_dim = 5
    # Define the shape of the output vector. 
    output_dim = len(Y.T)
    
    W1,W2 = [w.T for w in torch_weights(input_dim, hidden_dim, output_dim)]
    
    num_epochs = 5000
    learning_rate = 0.3
    losses = []
    
    for epoch_n in range(num_epochs):
        layer0 = X
        # Forward propagation.
    
        # Inside the perceptron, Step 2. 
        layer1 = sigmoid(np.dot(layer0, W1))
        layer2 = sigmoid(np.dot(layer1, W2))
    
        # Back propagation (Y -> layer2)
    
        # In what direction is the target value?
        # Were we really close? If so, don't change too much.
        cost_delta = mse_derivative(layer2, Y)
        layer2_delta = cost_delta *  sigmoid_derivative(layer2)
    
        # Back propagation (layer2 -> layer1)
        # How much did each layer1 value contribute to the layer2 error (according to the weights)?
        layer1_error = np.dot(layer2_delta, W2.T)
        layer1_delta = layer1_error * sigmoid_derivative(layer1)
    
        # update weights
        W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
        W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
    
        # Log the loss value as we proceed through the epochs.
        losses.append(mse(layer2, Y))
    
    # Visualize the losses
    plt.plot(losses)
    plt.show()
    

    pytorch代码
    import matplotlib.pyplot as plt
    from tqdm import tqdm
    import numpy as np
    
    import torch
    from torch import nn
    from torch import tensor
    from torch import optim
    
    torch.manual_seed(0)
    device = 'gpu' if torch.cuda.is_available() else 'cpu'
    
    num_epochs = 5000
    learning_rate = 0.3 * 2
    
    # XOR gate inputs and outputs.
    X = tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)
    Y = tensor([[0],[1],[1],[0]]).float().to(device)
    
    # Use tensor.shape to get the shape of the matrix/tensor.
    num_data, input_dim = X.shape
    num_data, output_dim = Y.shape
    
    # Step 1: Initialization. 
    
    # Initialize the model.
    # Set the hidden dimension size.
    hidden_dim = 5
    # Use Sequential to define a simple feed-forward network.
    model = nn.Sequential(
        # Use nn.Linear to get our simple perceptron.
        nn.Linear(input_dim, hidden_dim, bias=None),
        # Use nn.Sigmoid to get our sigmoid non-linearity.
        nn.Sigmoid(),
        # Second layer neurons.
        nn.Linear(hidden_dim, output_dim, bias=None),
        nn.Sigmoid()
    )
    
    # Initialize the optimizer
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    
    # Initialize the loss function.
    criterion = nn.MSELoss()
    
    def mse(x, y):
        return np.mean((x - y)**2)
    
    torch_losses = [] # Keeps track of the loses.
    torch_losses_manual = [] # for comparison
    
    # Step 2-4 of training routine.
    
    for _e in tqdm(range(num_epochs)):
        # Reset the gradient after every epoch. 
        optimizer.zero_grad() 
        # Step 2: Foward Propagation
        predictions = model(X)
    
        # Step 3: Back Propagation 
        # Calculate the cost between the predictions and the truth.
        loss = criterion(predictions, Y)
        # Remember to back propagate the loss you've computed above.
        loss.backward()
    
        # Step 4: Optimizer take a step and update the weights.
        optimizer.step()
    
        # Log the loss value as we proceed through the epochs.
        torch_losses.append(loss.data.item())
        torch_losses_manual.append(mse(predictions.detach().numpy(), Y.detach().numpy()))
    
    plt.plot(torch_losses, lw=5, c='C1', label='torch_losses')
    plt.plot(torch_losses_manual, lw=2, c='C2', label='torch_losses_manual')
    plt.legend()
    

    笔记

    偏重

    您可以在this tutorial中找到一些非常有启发性的示例,这些示例显示了偏权重以及如何实现它们。他们列出了一堆神经网络的纯Python实现,与您的手工实现非常相似,因此您很可能可以改编他们的一些代码以自己实现偏差。

    用于产生权重的初始猜测的函数

    这是我从相同的tutorial改编而成的函数,可以为权重生成合理的初始值。我认为Pytorch内部使用的算法有些不同,但这会产生类似的结果:
    import scipy as sp
    import scipy.stats
    
    def tnorm_weights(nodes_in, nodes_out, bias_node=0):
        # see https://www.python-course.eu/neural_network_mnist.php
        wshape = (nodes_out, nodes_in + bias_node)
        bound = 1 / np.sqrt(nodes_in)
        X = sp.stats.truncnorm(-bound, bound)
        return X.rvs(np.prod(wshape)).reshape(wshape) 
    

    关于python - 从头开始实现PLP和在PyTorch中实现MLP有什么区别?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54247143/

    相关文章:

    python - Python中的主成分分析

    python - ndarray 到 TFRecord 的缓慢序列化

    python - Python 脚本中的多线程

    python - 从大文件中打印行号之间的行

    python - 如何创建一个中心和角落为 1 的 numpy 数组

    python - 神经网络(无隐藏层)与逻辑回归?

    image - 人工神经网络图像变换

    python - 如何使用 TensorFlow 连接两个具有不同形状的张量?

    python - PyMC3 在 Possion 模型创建期间生成错误

    python - 使用 zlib 以 C 语言读取 Python 压缩数据