我是这个领域的新手。 我从本教程开始:http://nlp.solutions.asia/?p=362#more-362 。当我第一次爬取这个网址:nutch.apache.org时,我成功了,但是当我尝试不同的网址时,我的hadoop.log中出现了这个异常。
**java.lang.NullPointerException
at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)**
<小时/>
这是我的 nutch-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Maria</value>
</property>
<property>
<name>http.robots.agents</name>
<value>Maria</value> ....
</description>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>
</configuration>
<小时/>
这是 regex-ulrfilter.txt:
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.
(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip
|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov
|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
#+.
+^http://([a-z0-9]*\.)* nutch.apache.org/
#
-.
<小时/>
如果有任何解决此问题的建议,我将不胜感激
最佳答案
我从未使用过nutch,但这似乎是一个常见错误,在init 启动的NPE 意味着UTF8 实例在创建时失败。
原因是“crawl”函数在 Nutch2 中已被弃用,取而代之的是位于“bin/crawl”中的 java 文件
只需将文件 $NUTCH_HOME/src/bin/crawl 复制到部署目录:$NUTCH_HOME/runtime/deploy/bin 然后运行爬网命令,看看这里:
http://wiki.apache.org/nutch/NutchTutorial#A3.1_Using_the_Crawl_Command
希望这有帮助。
关于java.lang.NullPointerException(nutch 2.2.1 和 MySql 作为数据存储),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21198202/