java.lang.NullPointerException(nutch 2.2.1 和 MySql 作为数据存储)

标签 java mysql nutch

我是这个领域的新手。 我从本教程开始:http://nlp.solutions.asia/?p=362#more-362 。当我第一次爬取这个网址:nutch.apache.org时,我成功了,但是当我尝试不同的网址时,我的hadoop.log中出现了这个异常。

**java.lang.NullPointerException
    at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
    at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)**
<小时/>

这是我的 nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>http.agent.name</name>
<value>Maria</value>
</property>

<property> 
<name>http.robots.agents</name> 
<value>Maria</value> ....
</description> 
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

</configuration>
<小时/>

这是 regex-ulrfilter.txt:

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.          
(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip
|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov
|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
#+.

+^http://([a-z0-9]*\.)* nutch.apache.org/

#
-.
<小时/>

如果有任何解决此问题的建议,我将不胜感激

最佳答案

我从未使用过nutch,但这似乎是一个常见错误,在init 启动的NPE 意味着UTF8 实例在创建时失败。

原因是“crawl”函数在 Nutch2 中已被弃用,取而代之的是位于“bin/crawl”中的 java 文件

只需将文件 $NUTCH_HOME/src/bin/crawl 复制到部署目录:$NUTCH_HOME/runtime/deploy/bin 然后运行爬网命令,看看这里:

http://wiki.apache.org/nutch/NutchTutorial#A3.1_Using_the_Crawl_Command

希望这有帮助。

关于java.lang.NullPointerException(nutch 2.2.1 和 MySql 作为数据存储),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21198202/

相关文章:

java - 套接字客户端卡在用户输入上

mysql - 加载数据是否有原因不接受管道分隔文件

hadoop - Apache Nutch 在限制后刷新 gora 记录

hadoop - 在 Hortownworks 或 YARN 上集成 Nutch

java - 由: java. lang.NoSuchFieldError: NULL while deploying application to Tomcat引起

Java Swing : How to check what components are opened at runtime?

mysql - 将数据从 SQL Server/PostgreSQL 移动到 MySQL

mysql - 如何在选定日期范围内更改 MySQL 列值

solr - 如何使用nutch和索引特定标签解析html到solr?

java - 添加回调 ListenableFuture