solr - 使用 Solr 索引维基百科不起作用

标签 solr indexing heap-memory

我正在尝试对大约 40Gb 的英文维基百科进行索引,但它不起作用。我已按照 http://wiki.apache.org/solr/DataImportHandler#Configuring_DataSources 上的教程进行操作以及其他相关的 Stackoverflow 问题,例如 Indexing wikipedia with solrIndexing wikipedia dump with solr

我能够使用教程中解释的配置导入维基百科(简单英语)、大约 15 万个文档和葡萄牙语维基百科(超过 100 万个文档)。当我尝试对英文维基百科(超过 800 万个文档)建立索引时,问题就出现了。它给出以下错误:

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:457)
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:410)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231)
    ... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:539)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:408)
    ... 5 more
Caused by: java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.index.ParallelPostingsArray.<init>(ParallelPostingsArray.java:34)
    at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:254)
    at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:279)
    at org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
    at org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:307)
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:324)
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185)
    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:165)
    at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1520)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217)
    at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
    at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:569)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:705)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435)
    at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
    at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
    at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:504)
    ... 6 more

我使用的是配备 4Gb RAM 和超过 120Gb 硬盘可用空间的 MacBook Pro。我已经尝试更改 solrconfig.xml 中的 256,但到目前为止没有成功。

请问有人可以帮助我吗?

已编辑

以防万一,如果有人遇到同样的问题,我使用了 Cheffe 建议的命令 java Xmx1g -jar star.jar 来解决我的问题。

最佳答案

您的 Java VM 内存不足。给它更多的内存。就像这个问题 Increase heap size in Java 中所解释的那样

java -Xmx1024m myprogram

有关 Xmx 参数的更多详细信息可以是 found in the docs ,只需搜索-Xmxsize

Specifies the maximum size (in bytes) of the memory allocation pool in bytes. This value must be a multiple of 1024 and greater than 2 MB. Append the letter k or K to indicate kilobytes, m or M to indicate megabytes, g or G to indicate gigabytes. The default value is chosen at runtime based on system configuration. For server deployments, -Xms and -Xmx are often set to the same value. For more information, see Garbage Collector Ergonomics at http://docs.oracle.com/javase/8/docs/technotes/guides/vm/gc-ergonomics.html

The following examples show how to set the maximum allowed size of allocated memory to 80 MB using various units:

  • Xmx83886080
  • Xmx81920k
  • Xmx80m

The -Xmx option is equivalent to -XX:MaxHeapSize.

关于solr - 使用 Solr 索引维基百科不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22596726/

相关文章:

php - mysql like %asd% 相当于 solr 查询

date - Elasticsearch (7.3) 日期映射解析失败

Javascript 数组访问(以字符串文字为键)- 空间复杂度

asp.net-mvc - Azure VM 上的安全 Solr 搜索引擎

tomcat - 为具有凭据支持的 tomcat 应用程序启用 CORS

solr - 卢森/Solr : Store offset information for certain keywords

即使正在使用索引,在极少数情况下 MySQL 查询也会变慢

javascript - 删除字符串中某个位置之前的部分

tomcat - JVM 堆转储 : The memory is accumulated in one instance of "java. util.concurrent.ConcurrentHashMap$Segment

c++ - 'new' 关键字和类存储