indexing - 将数据从 Nutch 传递到 Solr

标签 indexing solr web-crawler nutch

我尝试使用以下命令将 Nutch 网络爬虫抓取的数据传递到 Solr 搜索和索引平台:

bin/nutch index -Dsolr.server.url=http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/ -dir crawl/segments/20161124145935/ crawl/segments/20161124150145/ -filter -normalize

但我收到以下错误:

The input path at segments is not a segment... skipping
The input path at content is not a segment... skipping
The input path at crawl_fetch is not a segment... skipping
Skipping segment: file:/Users/cell/Desktop/usi/information-retrieval/project/apache-nutch-1.12/crawl/segments/20161124145935/crawl_generate. Missing sub directories: parse_data, parse_text, crawl_parse, crawl_fetch
The input path at crawl_parse is not a segment... skipping
The input path at parse_data is not a segment... skipping
The input path at parse_text is not a segment... skipping
Segment dir is complete: crawl/segments/20161124150145.
Indexer: starting at 2016-11-25 05:02:17
Indexer: deleting gone documents: false
Indexer: URL filtering: true
Indexer: URL normalizing: true
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexing 250/250 documents
Deleting 0 documents
Indexing 250/250 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

这是来自 Nutch 的日志:

2016-11-25 06:05:03,378 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-11-25 06:05:03,500 WARN  segment.SegmentChecker - The input path at segments is not a segment... skipping
2016-11-25 06:05:03,506 WARN  segment.SegmentChecker - The input path at content is not a segment... skipping
2016-11-25 06:05:03,506 WARN  segment.SegmentChecker - The input path at crawl_fetch is not a segment... skipping
2016-11-25 06:05:03,507 WARN  segment.SegmentChecker - Skipping segment: file:/Users/cell/Desktop/usi/information-retrieval/project/apache-nutch-1.12/crawl/segments/20161124145935/crawl_generate. Missing sub directories: parse_data, parse_text, crawl_parse, crawl_fetch
2016-11-25 06:05:03,507 WARN  segment.SegmentChecker - The input path at crawl_parse is not a segment... skipping
2016-11-25 06:05:03,507 WARN  segment.SegmentChecker - The input path at parse_data is not a segment... skipping
2016-11-25 06:05:03,507 WARN  segment.SegmentChecker - The input path at parse_text is not a segment... skipping
2016-11-25 06:05:03,509 INFO  segment.SegmentChecker - Segment dir is complete: crawl/segments/20161124150145.
2016-11-25 06:05:03,510 INFO  indexer.IndexingJob - Indexer: starting at 2016-11-25 06:05:03
2016-11-25 06:05:03,512 INFO  indexer.IndexingJob - Indexer: deleting gone documents: false
2016-11-25 06:05:03,512 INFO  indexer.IndexingJob - Indexer: URL filtering: true
2016-11-25 06:05:03,512 INFO  indexer.IndexingJob - Indexer: URL normalizing: true
2016-11-25 06:05:03,614 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-11-25 06:05:03,615 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


2016-11-25 06:05:03,616 INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
2016-11-25 06:05:03,616 INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
2016-11-25 06:05:03,617 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161124150145
2016-11-25 06:05:04,006 WARN  conf.Configuration - file:/tmp/hadoop-cell/mapred/staging/cell1463380038/.staging/job_local1463380038_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-11-25 06:05:04,010 WARN  conf.Configuration - file:/tmp/hadoop-cell/mapred/staging/cell1463380038/.staging/job_local1463380038_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-11-25 06:05:04,088 WARN  conf.Configuration - file:/tmp/hadoop-cell/mapred/local/localRunner/cell/job_local1463380038_0001/job_local1463380038_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-11-25 06:05:04,090 WARN  conf.Configuration - file:/tmp/hadoop-cell/mapred/local/localRunner/cell/job_local1463380038_0001/job_local1463380038_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-11-25 06:05:04,258 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-11-25 06:05:04,272 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:08,950 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:09,344 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:09,734 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:10,908 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:11,376 INFO  regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default
2016-11-25 06:05:11,686 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: content dest: content
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: title dest: title
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: host dest: host
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: segment dest: segment
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: boost dest: boost
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: digest dest: digest
2016-11-25 06:05:11,775 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2016-11-25 06:05:11,940 INFO  solr.SolrIndexWriter - Indexing 250/250 documents
2016-11-25 06:05:11,940 INFO  solr.SolrIndexWriter - Deleting 0 documents
2016-11-25 06:05:12,139 INFO  solr.SolrIndexWriter - Indexing 250/250 documents
2016-11-25 06:05:12,139 INFO  solr.SolrIndexWriter - Deleting 0 documents
2016-11-25 06:05:12,207 WARN  mapred.LocalJobRunner - job_local1463380038_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p>
</body>
</html>

    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p>
</body>
</html>

    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:543)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
    at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:367)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-11-25 06:05:12,293 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

我还没有从 UI 创建任何核心或集合,老实说,我不确定这个将数据传递到 solr 的命令的确切含义...

由于我对 Nutch 和 Solr 都很陌生,因此很难调试......

最佳答案

日志显示错误,因为您没有创建任何核心/集合,SolrJ 库提示找不到 /solr/update 处理程序,这导致索引步骤失败。只需创建一个核心/集合并更新传递给 bin/crawl 脚本的 solr URL。只需按照https://wiki.apache.org/nutch/NutchTutorial中的步骤操作即可进行第一次爬行。

关于indexing - 将数据从 Nutch 传递到 Solr,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40798074/

相关文章:

java - Riak:通过 Java/Scala 在键上创建索引

python - 查找列表中偶数的索引

asp.net - 在我的网站上模拟爬虫

python - 我的 scrapy 无法得到有效响应

java - 如何在我的远程服务器上部署 elasticsearch-head 或其他浏览器前端监控工具

python - Scrapy中发生异常时如何对新URL进行排队

python - 将 numpy 数组存储在 pandas 数据框的多个单元格中(Python)

单个表和分区上的 MySQL 性能 : multiple tables vs. 索引

Solr:如何避免分数被多个值稀释?

Solr 搜索主题标签或提及