regex - Nutch MalformedURLException导致爬网过程终止

标签 regex hadoop nutch

我使用此命令添加了一组要爬网的种子

./bin/crawl /largeSeeds 1 http://localhost:8983/solr/ddcd 4

对于第一次迭代,所有命令(inject, generate, fetch, parse, update-table, Indexer & delete duplicates.)已成功执行。
对于第二次迭代,“update-table”命令失败(请参阅错误日志以供引用),因为该命令失败,整个过程将终止。
CrawlDB update for 1
/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1452969522-27478 -crawlId 1
16/01/17 02:10:17 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting at 2016-01-17 02:10:17
16/01/17 02:10:17 INFO crawl.DbUpdaterJob: DbUpdaterJob: batchId: 1452969522-27478
16/01/17 02:10:17 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-root/hadoop-unjar3649584948711945520/classes/plugins
16/01/17 02:10:18 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
16/01/17 02:10:18 INFO plugin.PluginRepository: Registered Plugins:
16/01/17 02:10:18 INFO plugin.PluginRepository:     Rel-Tag microformat Parser/Indexer/Querier (microformats-reltag)
16/01/17 02:10:18 INFO plugin.PluginRepository:     HTTP Framework (lib-http)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Html Parse Plug-in (parse-html)
16/01/17 02:10:18 INFO plugin.PluginRepository:     MetaTags (parse-metatags)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Http / Https Protocol Plug-in (protocol-httpclient)
16/01/17 02:10:18 INFO plugin.PluginRepository:     the nutch core extension points (nutch-extensionpoints)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Basic Indexing Filter (index-basic)
16/01/17 02:10:18 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
16/01/17 02:10:18 INFO plugin.PluginRepository:     JavaScript Parser (parse-js)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Anchor Indexing Filter (index-anchor)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Tika Parser Plug-in (parse-tika)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Top Level Domain Plugin (tld)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Language Identification Parser/Filter (language-identifier)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Regex URL Filter Framework (lib-regex-filter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Metadata Indexing Filter (index-metadata)
16/01/17 02:10:18 INFO plugin.PluginRepository:     CyberNeko HTML Parser (lib-nekohtml)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Subcollection indexing and query filter (subcollection)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Link Analysis Scoring Plug-in (scoring-link)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Pass-through URL Normalizer (urlnormalizer-pass)
16/01/17 02:10:18 INFO plugin.PluginRepository:     OPIC Scoring Plug-in (scoring-opic)
16/01/17 02:10:18 INFO plugin.PluginRepository:     More Indexing Filter (index-more)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Http Protocol Plug-in (protocol-http)
16/01/17 02:10:18 INFO plugin.PluginRepository:     SOLRIndexWriter (indexer-solr)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Creative Commons Plugins (creativecommons)
16/01/17 02:10:18 INFO plugin.PluginRepository: Registered Extension-Points:
16/01/17 02:10:18 INFO plugin.PluginRepository:     Parse Filter (org.apache.nutch.parse.ParseFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Index Cleaning Filter (org.apache.nutch.indexer.IndexCleaningFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Content Parser (org.apache.nutch.parse.Parser)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch URL Filter (org.apache.nutch.net.URLFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Protocol (org.apache.nutch.protocol.Protocol)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
16/01/17 02:10:18 INFO plugin.PluginRepository:     Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
16/01/17 02:10:19 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16/01/17 02:10:19 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/01/17 02:10:19 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
16/01/17 02:10:19 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/01/17 02:10:19 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x60a2630a connecting to ZooKeeper ensemble=localhost:2181
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:host.name=cism479
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:java.version=1.8.0_65
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
16/01/17 02:10:19 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/jdk1.8.0_65/jre
16/01/17 02:10:27 INFO mapreduce.JobSubmitter: number of splits:2
16/01/17 02:10:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1452929501009_0024
16/01/17 02:10:28 INFO impl.YarnClientImpl: Submitted application application_1452929501009_0024
16/01/17 02:10:28 INFO mapreduce.Job: The url to track the job: http://cism479:8088/proxy/application_1452929501009_0024/
16/01/17 02:10:28 INFO mapreduce.Job: Running job: job_1452929501009_0024
16/01/17 02:10:39 INFO mapreduce.Job: Job job_1452929501009_0024 running in uber mode : false
16/01/17 02:10:39 INFO mapreduce.Job:  map 0% reduce 0%
16/01/17 02:11:37 INFO mapreduce.Job: Task Id : attempt_1452929501009_0024_m_000000_0, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a href="https:"
    at java.net.URL.<init>(URL.java:620)
    at java.net.URL.<init>(URL.java:483)
    at java.net.URL.<init>(URL.java:432)
    at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
    at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
    at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a href="https:"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:569)
    at java.lang.Integer.parseInt(Integer.java:615)
    at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
    at java.net.URL.<init>(URL.java:615)
    ... 13 more

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

16/01/17 02:12:13 INFO mapreduce.Job:  map 33% reduce 0%
16/01/17 02:12:24 INFO mapreduce.Job:  map 50% reduce 0%
16/01/17 02:12:44 INFO mapreduce.Job: Task Id : attempt_1452929501009_0024_m_000000_1, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a href="https:"
    at java.net.URL.<init>(URL.java:620)
    at java.net.URL.<init>(URL.java:483)
    at java.net.URL.<init>(URL.java:432)
    at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
    at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
    at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a href="https:"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:569)
    at java.lang.Integer.parseInt(Integer.java:615)
    at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
    at java.net.URL.<init>(URL.java:615)
    ... 13 more

16/01/17 02:13:19 INFO mapreduce.Job: Task Id : attempt_1452929501009_0024_m_000000_2, Status : FAILED
Error: java.net.MalformedURLException: For input string: "#10;from <a href="https:"
    at java.net.URL.<init>(URL.java:620)
    at java.net.URL.<init>(URL.java:483)
    at java.net.URL.<init>(URL.java:432)
    at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
    at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
    at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.NumberFormatException: For input string: "#10;from <a href="https:"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:569)
    at java.lang.Integer.parseInt(Integer.java:615)
    at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:216)
    at java.net.URL.<init>(URL.java:615)
    ... 13 more

16/01/17 02:13:42 INFO mapreduce.Job:  map 100% reduce 100%
16/01/17 02:13:43 INFO mapreduce.Job: Job job_1452929501009_0024 failed with state FAILED due to: Task failed task_1452929501009_0024_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

16/01/17 02:13:44 INFO mapreduce.Job: Counters: 34
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=49949067
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1193
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=1
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Failed map tasks=4
        Launched map tasks=5
        Other local map tasks=3
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=829677
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=276559
        Total vcore-seconds taken by all map tasks=276559
        Total megabyte-seconds taken by all map tasks=849589248
    Map-Reduce Framework
        Map input records=30201
        Map output records=1164348
        Map output bytes=250659088
        Map output materialized bytes=49832245
        Input split bytes=1193
        Combine input records=0
        Spilled Records=1164348
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=3541
        CPU time spent (ms)=42980
        Physical memory (bytes) snapshot=2062766080
        Virtual memory (bytes) snapshot=5086490624
        Total committed heap usage (bytes)=2127036416
    File Input Format Counters
        Bytes Read=0
Exception in thread "main" java.lang.RuntimeException: job failed: name=[1]update-table, jobid=job_1452929501009_0024
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
    at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
    at org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
    at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Error running:
  /usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1452969522-27478 -crawlId 1
Failed with exit value 1.

从错误中可以很明显地看出来,这是因为网址格式错误。那么,有没有办法摆脱这种格式错误的网址?还是有任何解决方案可以跳过此类网址或绕过它们,以便后续进程得到执行?
请指教。

最佳答案

要跳过这类URL(格式错误的URL),应在conf / regex-urlfilter.txt文件中创建Nutch过滤器。

关于regex - Nutch MalformedURLException导致爬网过程终止,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34849963/

相关文章:

PHP 模式修饰符 : $ for End-of-Lines in Multi-Line Strings

Python,正则表达式 : Is it possible to have 2 ending anchors?

javascript - 如何检查是否包含任何字符串顺序的所有关键字?正则表达式 Javascript

hadoop - 如何使 Hadoop Distcp 复制自定义文件夹列表?

hadoop - Nutch 2.3.1在 yarn 2.7.1上的错误

testing - Nutch 提示如何测试它

php - 删除 CSS 注释的正则表达式

hadoop - 此存储桶的一部分可能包含部分数据 - kibana Issue

apache - 如何在bin/yarn-session.sh中指定ResourceManager的地址?

scala - 创建一个 Akka fat Jar