hadoop - 无法启动 Nutch 爬行

标签 hadoop elasticsearch web-crawler hbase nutch

我正在尝试在 Ubuntu 14.04 上部署 Nutch 2.3 + ElasticSearch 1.4 + HBase 0.94 以下 tutorial .当我尝试开始爬行注入(inject)网址时:

$NUTCH_ROOT/runtime/local/bin/nutch inject urls

我得到:
InjectorJob: starting at 2017-10-12 19:27:48
InjectorJob: Injecting urlDir: urls

这个过程会持续几个小时。

我怎么知道发生了什么事?

配置文件:

nutch-site.xml
<configuration>
  <property>
    <name>http.agent.name</name>
    <value>mycrawlername</value> <!-- this can be changed to something more sane if you like -->
  </property>
  <property>
    <name>http.robots.agents</name>
    <value>mycrawlername</value> <!-- this is the robot name we're looking for in robots.txt files -->
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
  </property>
  <property>
    <name>plugin.includes</name>
    <!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
    <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
  </property>
  <property>
    <name>db.ignore.external.links</name>
    <value>true</value> <!-- do not leave the seeded domains (optional) -->
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value> <!-- where is ElasticSearch listening -->
  </property>
</configuration>

hbase-site.xml
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>/home/kike/RIWS/hbase-0.94.14/</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>false</value>
    </property>
</configuration>

日志文件:

HBase 主日志
2017-10-12 19:27:49,593 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:47778
2017-10-12 19:27:49,596 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:47778
2017-10-12 19:27:49,609 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x15f11684f3f0017 with negotiated timeout 40000 for client /127.0.0.1:47778
2017-10-12 19:31:11,092 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=1.99 MB, free=239.7 MB, max=241.69 MB, blocks=2, accesses=18, hits=16, hitRatio=88,88%, , cachingAccesses=18, cachingHits=16, cachingHitsRatio=88,88%, , evictions=0, evicted=0, evictedPerRun=NaN
2017-10-12 19:31:24,623 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row= for max=2147483647 rows using org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@1646b7c
2017-10-12 19:31:24,630 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 0 catalog row(s) and gc'd 0 unreferenced parent region(s)
2017-10-12 19:32:13,832 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x15f11684f3f0017
2017-10-12 19:32:13,849 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:47778 which had sessionid 0x15f11684f3f0017
2017-10-12 19:32:14,852 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:47817
2017-10-12 19:32:14,853 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:47817
2017-10-12 19:32:14,880 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x15f11684f3f0018 with negotiated timeout 40000 for client /127.0.0.1:47817

Hadoop日志
2017-10-12 19:27:48,871 INFO  crawl.InjectorJob - InjectorJob: starting at 2017-10-12 19:27:48
2017-10-12 19:27:48,871 INFO  crawl.InjectorJob - InjectorJob: Injecting urlDir: urls

编辑:

一段时间后,hadoop 日志显示:
2017-10-12 20:34:59,333 ERROR crawl.InjectorJob - InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
    at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
    at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
    at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:133)
    at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
    ... 7 more
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
    at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:139)
    at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:115)
    ... 9 more

但是如果我输入 jps 我可以看到 HMaster 正在运行:
31672 Jps
20553 HMaster
19739 Elasticsearch

最佳答案

您的错误日志显示: (hbase.MasterNotRunningException)

org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
    at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
    at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)

我们需要设置 Hbase
open ~/Desktop/Nutch/hbase/conf/hbase-site.xml并添加以下 2 个节点。我们需要告诉 hbase rootdir并指定 zookeeper 的数据目录.
open ~/Desktop/Nutch/hbase/conf/hbase-site.xml

<configuration>
        <property>
            <name>hbase.rootdir</name>
            <value>file:///Users/sntiwari/Desktop/Nutch/hbase</value>
        </property>
        <property>
            <name>hbase.zookeeper.property.dataDir</name>
            <value>/Users/sntiwari/Desktop/Nutch/zookeeper</value>
        </property>
    </configuration>

接下来,我们需要告诉 gora使用 Hbase因为它是默认的数据存储。
open ~/Desktop/Nutch/nutch/conf/gora.properties
# open ~/Desktop/Nutch/nutch/runtime/local/conf/gora.properties

# Add this line under `HBaseStore properties` (to keep things organised)
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

我们需要添加/取消注释 gora-hbase对我们的 ivy.xml 的依赖(可能是第 118 行)。
open ~/Desktop/Nutch/nutch/ivy/ivy.xml

# Find and Uncomment this line (aprrox 118)
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />

** 测试你的 Hbase **
# Start it up!
~/Desktop/Nutch/hbase/bin/start-hbase.sh

# Stop it (Can take a while, be patient)
~/Desktop/Nutch/hbase/bin/stop-hbase.sh

# Access the shell
 ~/Desktop/Nutch/hbase/bin/hbase shell

# list               = list all tables
# disable 'webpage'  = disable the table (before dropping)
# drop 'webpage'     = drop the table (webpage is created & used by nutch)
# exit               = exit from hbase

# For the next part, we need to start hbase
~/Desktop/Nutch/hbase/bin/start-hbase.sh

也遵循一些测试步骤:
  • 首先检查版本兼容性。
  • 确保设置了 JAVA_HOME 和 NUTCH_JAVA_HOME 环境变量
  • 编译 nutch [您需要使用 ant 编译 Apache Nutch (ant runtime)]
  • 关于hadoop - 无法启动 Nutch 爬行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46715792/

    相关文章:

    hadoop - 是否有相当于 "SHOW TABLES"的 apache pig?

    hadoop - hive :如何获取群集的名称

    search - 随机性对搜索结果的影响

    elasticsearch - Kibana未显示来自ES + Logstash的消息

    Python + Mechanize 异步任务

    php - 使用 php thrift 的 Hive 查询不起作用

    unit-testing - hadoop mapreduce 作业的最佳单元测试工具/方法

    sql-server - 无法使用Logstash conf文件将数据推送到Elasticsearch中Windows Powershell中显示执行操作错误失败

    asp.net - 抓取由 asp.net/AJAX (__doPostBack) 管理的 html 分页

    python - 如何获取图像文件,使用Scrapy