hadoop - 使用Flume将网页数据流式传输到HDFS

标签 hadoop flume cloudera-cdh

我有3个节点集群,使用的是最新的cloudera包裹,版本为5.9。这三个平台上的操作系统都是CentOS 6.7。我是第一次使用Flume。

我的目的是将网页数据流式传输到HDFS中。但是此网页是第三方网站,在我的情况下是新闻网站,因此我不知道该使用哪个端口进行连接。

Curl和telnet发生在端口80上,因此我使用了它。但是出错了。

我的Flume.conf是:

tier1.sources  = http-source
tier1.channels = mem-channel-1
tier1.sinks    = hdfs-sink
tier1.sources.http-source.type     = http
tier1.sources.http-source.handler = org.apache.flume.source.http.JSONHandler
tier1.sources.http-source.bind     = 132.247.1.32
tier1.sources.http-source.port     = 80
tier1.sources.http-source.channels = mem-channel-1
tier1.channels.mem-channel-1.type   = memory
tier1.sinks.hdfs-sink.type         = hdfs
tier1.sinks.hdfs-sink.channel      = mem-channel-1
tier1.sinks.hdfs-sink.hdfs.path    = /flume/events/%y-%m-%d/%H%M/%S
# Other properties are specific to each type of
# source, channel, or sink. In this case, we
# specify the capacity of the memory channel.
tier1.channels.mem-channel-1.capacity = 100

错误
2016-12-19 16:45:00,353 WARN org.mortbay.log: failed SelectChannelConnector@132.247.1.32:80: java.net.BindException: Cannot assign requested address
2016-12-19 16:45:00,353 WARN org.mortbay.log: failed Server@36772002: java.net.BindException: Cannot assign requested address
2016-12-19 16:45:00,353 ERROR org.apache.flume.source.http.HTTPSource: Error while starting HTTPSource. Exception follows.
java.net.BindException: Cannot assign requested address**
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart2016-12-19 16:45:00,364 ERROR org.apache.flume.lifecycle.LifecycleSupervisor: Unable to start EventDrivenSourceRunner: { source:org.apache.flume.source.http.HTTPSource{
name:http-source,state:IDLE} } - Exception follows.
java.lang.RuntimeException: java.net.BindException: Cannot assign requested address(SelectChannelConnector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:207)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

最佳答案

尝试如下更改源配置:
httpagent.sources.http-source.port = 80
httpagent.sources.http-source.bind =本地主机
httpagent.sources.http-source.url = 132.247.1.32

注意:如果132.247.1.32不起作用,请尝试提供主机名。

关于hadoop - 使用Flume将网页数据流式传输到HDFS,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41233373/

相关文章:

apache - 如何测量从csv文件导入数据到Hbase的时间?

hadoop - 如何在hadoop中显示确切的工作详细信息?

hadoop - Kryos/Java 序列化程序的 Spark-shell --conf 选项

hadoop - Flume - 有没有办法将 avro 事件(标题和正文)存储到 hdfs 中?

csv - hive 中的额外空行

apache-spark - yarn 集群模式无法用spark读取Hbase数据

hadoop - Hadoop中的复制会导致数据冗余,那么为什么要在HDFS中进行呢?

hadoop - 当 rolloverSize 设置为 150 MB 时,每隔几秒就会刷新一次 Flume 消息

hadoop - 从水槽流数据以从不同目录收集数据

java - 无法通过 Java API 访问 HDFS (Cloudera-CDH4.4.0)