java - 连接到远程任务管理器失败。这可能表明远程任务管理器已丢失

标签 java apache-flink

我创建了一个包含 1 个作业管理器和 2 个任务管理器的 Flink 独立集群。
提交批处理任务/作业时,任务管理器之一会抛出以下错误。 flink 仪表板显示两个任务管理器都处于 Activity 状态。示例字数统计程序有效。

java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'hostname/127.0.0.1:46537' has failed. This might indicate that the remote task manager has been lost.
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132)
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84)
    at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
    at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156)
    at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480)
    at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502)
    at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:86)
    at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:42)
    at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
    at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:94)
    at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:490)
    at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager + 'hostname/127.0.0.1:46537' has failed. This might indicate that the remote task manager has been lost.
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    ... 1 more
Caused by: java.net.ConnectException: Connection refused: ekablr-ca-s010/127.0.0.1:46537
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
    ... 6 more

最佳答案

要检查的事项:

  • 您的工作中是否有某些因素导致 TaskManager 速度过慢,以至于它无法在超时之前响应请求?观察运行该任务管理器的计算机上的 CPU 使用情况。
  • TaskManager 是否真的仍然处于 Activity 状态(即仍在发送心跳,您可以在仪表板的 TaskManager 部分中看到实时更新),或者 JobManager 是否只是没有放弃它足以将其标记为死亡?您的工作有可能杀死正在运行的 TaskManager,但我通常只在通过 JNI 或其他直接内存操作类型的东西运行的 native 代码中看到它。

关于java - 连接到远程任务管理器失败。这可能表明远程任务管理器已丢失,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50178233/

相关文章:

java - 如何在 Eclipse 中解析带有 HTML 标签的注释?

docker - 在确定分区位置之前 60000 毫秒的 Kafka 客户端超时已过期

streaming - Flink 1.2 支持 python 流式编程吗?

configuration - Yarn可以动态分配资源给Flink吗?

java - jvisualvm 去哪儿了?

java - 这个java阻塞队列变体可能吗?

java - 如何在 Flink 中增加 SinkFunction 的 numRecordsOutPut 指标?

apache-flink - 将 DataSet<Row> 保存为 CSV

java - 从 javascript/AJAX 调用 MySQL 查询

java - Jackson 无法识别 JSON 负载中的 @class 类型标识符