1bit SGD 与普通 SGD 在 4 个 GPU 中的 Python CNTK 速度比较

标签 python neural-network gpu deep-learning cntk

我安装的版本2.0.beta7来自带有 Ubuntu (python 3.4) 的 Azure NC24 GPU VM 中的 CNTK。该机器有 4 个 NVIDIA K80 GPU。构建信息:

            Build type: release
            Build target: GPU
            With 1bit-SGD: yes
            With ASGD: yes
            Math lib: mkl
            CUDA_PATH: /usr/local/cuda-8.0
            CUB_PATH: /usr/local/cub-1.4.1
            CUDNN_PATH: /usr/local
            Build Branch: HEAD
            Build SHA1: 8e8b5ff92eff4647be5d41a5a515956907567126
            Built by Source/CNTK/buildinfo.h$$0 on bbdadbf3455d
            Build Path: /home/philly/jenkins/workspace/CNTK-Build-Linux

我在分布式模式下运行 CIFAR 示例:

mpiexec -n 4 python TrainResNet_CIFAR10_Distributed.py -n resnet20 -q 32

Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.018s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.3 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.4 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.8 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.6 samples per second)
...
...
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.883s (6300.4 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.883s (6299.7 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.884s (6299.7 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.884s (6299.2 samples per second)

但是,当我使用 1 位 SGD 运行它时,我得到:

mpiexec -n 4 python TrainResNet_CIFAR10_Distributed.py -n resnet20 -q 1 -a 50000

...
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.055s (4939.1 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)

如解释here 1bit 应该比正常的对应物更快。感谢您的帮助。

最佳答案

当 GPU 之间的通信时间与小批量的计算时间相比很大时,1 位 sgd 是一种有效的策略。

你上面的实验有两个“问题”:你正在训练的模型参数很少(计算量不大),4 个 GPU 在同一台机器上(与说遍历相比,通信不是那么糟糕)网络)。 此外,在机器内部 CNTK 使用 NVIDIA nccl这比 1 位使用的通用 MPI 实现优化得更好。 更新:在发表此评论时,默认情况下不使用 NCCL。

关于1bit SGD 与普通 SGD 在 4 个 GPU 中的 Python CNTK 速度比较,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41441841/

相关文章:

python - 使用 Heroku 在 Django 应用程序中部署时出现错误

python - 获取笔记本的详细实时报告/监控

python - 机器学习神经网络: Can Training Error and Testing Error be equal?

音频分割

python - 在存储在谷歌驱动器中的谷歌 Colaboratory 中导入 zip 文件

time - cudamemcpy 设备->主机时间在迭代之间发生变化

python - 我应该使用哪个循环来删除使用 Selenium 的表中的所有数据条目?

java - 将 Java 与 Nvidia GPU (CUDA) 结合使用

c++ - 调用 vkCreateGraphicsPipelines 时出现段错误

python - 如何通过 python 源代码文件中的注释覆盖 vim 选项?