linux - PBS 通信错误 : Nodes can not communicate

标签 linux queue batch-processing pbs torque

我成功安装了 pbs 服务器,启动了服务并可以使用 pbsnodes 命令查看节点。队列在 qstat -q 命令中正确显示。在我提交测试作业后,以下内容出现在我的 sched_log、server_log 和 mom 节点 mom_log 文件中:

计划日志:

08/16/2017 14:18:48.476;64; pbs_sched.19885;Job;2.headnode;Job Run
08/16/2017 14:19:28.215;02; pbs_sched.19885;Req;headnode3;Can not open connection to mom
08/16/2017 14:19:28.215;02; pbs_sched.19885;Req;headnode4;Can not open connection to mom
08/16/2017 14:19:28.238;02; pbs_sched.19885;Req;headnode5;Can not open connection to mom
08/16/2017 14:19:28.239;02; pbs_sched.19885;Req;headnode6;Can not open connection to mom

服务器日志:

08/16/2017 14:40:37.829;01;PBS_Server.27737;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.233:15003]
08/16/2017 14:40:37.829;01;PBS_Server.27739;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.232:15003]
08/16/2017 14:40:37.829;01;PBS_Server.27793;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.235:15003]
08/16/2017 14:40:38.828;01;PBS_Server.27736;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.234:15003]

妈妈日志:

08/16/2017 18:50:36.215;01;   pbs_mom.10833;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 11123 MOM status update intervals
08/16/2017 18:51:22.308;01;   pbs_mom.10838;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
08/16/2017 18:51:22.308;01;   pbs_mom.10838;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 11124 MOM status update intervals
08/16/2017 18:52:06.402;01;   pbs_mom.10859;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status update successfully sent after 11124 MOM status update intervals
08/16/2017 18:53:21.555;02;   pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 18:58:26.182;02;   pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:03:31.815;02;   pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:08:31.407;02;   pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:13:37.039;02;   pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:18:41.670;02;   pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:23:46.455;02;   pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0

如何解决这个问题?是由于任何类型的身份验证失败吗?在那种情况下,我应该设置 ssh key 身份验证登录吗?

有趣的是,我有另一台带有 Torque 的服务器,名为 headnode2,ip 为 .89.231,但没有显示任何错误。我没有按照任何额外的步骤来配置那个。

最佳答案

您可能只需要配置防火墙。我会跑

# iptables-save > iptables.bak && iptables -F

在服务器和一个测试节点上,然后向该节点提交作业,看它是否运行。

关于linux - PBS 通信错误 : Nodes can not communicate,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45709837/

相关文章:

ruby-on-rails - 每当 cron 中的 gem 命令不起作用时

php - 如何在CentOs中安装php-xml

spring - Spring JMS/AQ。如何为多个使用者队列创建持久订阅。 ils子

text - 合并许多 txt 文件内容并跳过批处理命令文件中的第一行

workflow-foundation-4 - 一次为大量记录创建工作流服务实例

linux - 逃命~!在 awk(bash 命令)中,反斜杠不是行中的最后一个字符

python - 错误 : Setup script exited with error: command 'i586-linux-gnu-gcc' failed with exit status 1

具有可变大小的 C++ 全局队列数组

c - C 中不正确的插入队列

vbscript - 批量关闭大写锁定 - 替代方案