Solr 不自动检测语言

我已经设置了一个单核 solr (4.6.0)，并且正在尝试以多种语言索引文档。我以自动检测文档语言的方式配置了 solr，但它始终设置默认语言(在 langid.fallback 参数中配置)。

这是我在 solrconfig.xml 中编写的用于允许语言检测的内容:

<requestHandler name="/update" class="solr.UpdateRequestHandler">
     <lst name="defaults">
       <str name="update.chain">langid</str>
     </lst>
  </requestHandler>

和

<updateRequestProcessorChain name="langid">
       <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
         <str name="langid.fl">text,title,description,content</str>
         <str name="langid.langField">language_s</str>
         <str name="langid.fallback">en</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

上传文档后，日志中显示的内容如下:

248638 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – LangId configured
248639 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Language fallback to value en
248639 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field text
248639 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field title
248639 [qtp723484867-14] WARN  org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Field title not a String value, not including in detection
248640 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field description
248640 [qtp723484867-14] WARN  org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Field description not a String value, not including in detection
248640 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field content
248640 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – No input text to detect language from, returning empty list
248641 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – No language detected, using fallback en
248641 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Detected main document language from fields [Ljava.lang.String;@6efbb783: en

据我了解，LanguageIdentifierUpdateProcessor 无法处理 solr.TextField 字段进行语言检测，但我在任何文档中都没有看到此限制。此外，我在书中看到了几个示例，它们都使用文本字段(而不是字符串字段)进行语言检测。而且，我不知道为什么，但不考虑字段text和content。

有人能指出我正确的方向吗？

这里是这些字段的字段定义:

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

谢谢!

最佳答案

我通过调用/update/extract来管理它。

在 solrconfig.xml 中:

<!-- Solr Cell Update Request Handler
     http://wiki.apache.org/solr/ExtractingRequestHandler 
-->
<requestHandler name="/update/extract" 
                startup="lazy"
                class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>

    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">ignored_</str>
    <str name="fmap.div">ignored_</str>

    <str name="update.chain">langid</str>
  </lst>
</requestHandler>

在java代码中:

  // Upload pdf content
  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
  up.setParam("literal.id", doc.getId().toString());
  up.setParam("literal.title", doc.getTitle());
  up.setParam("literal.description", doc.getDescription());
  up.addFile(new java.io.File(doc.getFile().getFilePath()), doc.getProcessedFile().getFile()
      .getMimeType());
  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
  solrServer.getServer().request(up);

通过这种方式可以完美检测文档语言。

希望它能帮助别人!

关于Solr 不自动检测语言，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20401548/

Solr 不自动检测语言

上一篇：asp.net-mvc - 从 View 调用操作方法

下一篇：visual-studio-2012 - Visual Studio 2012 : How to remove an delete project project from the source-control-explorer