java - 如何在 Apache Solr 中创建自己的字段并上传文档?

标签 java xml search solr lucene

我只是第一次使用 Solr。在 Ubuntu Server 上安装并运行它,发布了 exampledocs 目录中的示例 xml 文档,并且能够搜索“monitor”、“apple”和“Dell”等关键字,因为这些都在示例文件中。

现在我想添加带有自定义字段的我自己的文档。这是scheme.xml 中默认的内容:

 <fields>
   <!-- Valid attributes for fields:
     name: mandatory - the name for the field
     type: mandatory - the name of a previously defined type from the 
       <types> section
     indexed: true if this field should be indexed (searchable or sortable)
     stored: true if this field should be retrievable
     multiValued: true if this field may contain multiple values per document
     omitNorms: (expert) set to true to omit the norms associated with
       this field (this disables length normalization and index-time
       boosting for the field, and saves some memory).  Only full-text
       fields or fields that need an index-time boost need norms.
       Norms are omitted for primitive (non-analyzed) types by default.
     termVectors: [false] set to true to store the term vector for a
       given field.
       When using MoreLikeThis, fields used for similarity should be
       stored for best performance.
     termPositions: Store position information with the term vector.  
       This will increase storage costs.
     termOffsets: Store offset information with the term vector. This 
       will increase storage costs.
     default: a value that should be used if no value is specified
       when adding a document.
   -->

   <field name="id" type="string" indexed="true" stored="true" required="true" /> 
   <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
   <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
   <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />

   <field name="weight" type="float" indexed="true" stored="true"/>
   <field name="price"  type="float" indexed="true" stored="true"/>
   <field name="popularity" type="int" indexed="true" stored="true" />
   <field name="inStock" type="boolean" indexed="true" stored="true" />

   <!--
   The following store examples are used to demonstrate the various ways one might _CHOOSE_ to
    implement spatial.  It is highly unlikely that you would ever have ALL of these fields defined.
    -->
   <field name="store" type="location" indexed="true" stored="true"/>

   <!-- Common metadata fields, named specifically to match up with
     SolrCell metadata when parsing rich documents such as Word, PDF.
     Some fields are multiValued only because Tika currently may return
     multiple values for them.
   -->
   <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="subject" type="text_general" indexed="true" stored="true"/>
   <field name="description" type="text_general" indexed="true" stored="true"/>
   <field name="comments" type="text_general" indexed="true" stored="true"/>
   <field name="author" type="text_general" indexed="true" stored="true"/>
   <field name="keywords" type="text_general" indexed="true" stored="true"/>
   <field name="category" type="text_general" indexed="true" stored="true"/>
   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>

   <!-- catchall field, containing all other searchable text fields (implemented
        via copyField further on in this schema  -->
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

   <!-- catchall text field that indexes tokens both normally and in reverse for efficient
        leading wildcard queries. -->
   <field name="text_rev" type="text_general_rev" indexed="true" stored="false" multiValued="true"/>

   <!-- non-tokenized version of manufacturer to make it easier to sort or group
        results by manufacturer.  copied from "manu" via copyField -->
   <field name="manu_exact" type="string" indexed="true" stored="false"/>

   <field name="payloads" type="payloads" indexed="true" stored="true"/>

   <!-- Uncommenting the following will create a "timestamp" field using
        a default value of "NOW" to indicate when each document was indexed.
     -->
   <!--
   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
     -->

   <!-- Dynamic field definitions.  If a field name is not found, dynamicFields
        will be used if the name matches any of the patterns.
        RESTRICTION: the glob-like pattern in the name attribute must have
        a "*" only at the start or the end.
        EXAMPLE:  name="*_i" will match any field ending in _i (like myid_i, z_i)
        Longer patterns will be matched first.  if equal size patterns
        both match, the first appearing in the schema will be used.  -->
   <dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
   <dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
   <dynamicField name="*_t"  type="text_general"    indexed="true"  stored="true"/>
   <dynamicField name="*_txt" type="text_general"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="*_en"  type="text_en"    indexed="true"  stored="true" multiValued="true" />
   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
   <dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
   <dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>

   <!-- Type used to index the lat and lon components for the "location" FieldType -->
   <dynamicField name="*_coordinate"  type="tdouble" indexed="true"  stored="false"/>

   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>
   <dynamicField name="*_p"  type="location" indexed="true" stored="true"/>

   <!-- some trie-coded dynamic fields for faster range queries -->
   <dynamicField name="*_ti" type="tint"    indexed="true"  stored="true"/>
   <dynamicField name="*_tl" type="tlong"   indexed="true"  stored="true"/>
   <dynamicField name="*_tf" type="tfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_td" type="tdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_tdt" type="tdate"  indexed="true"  stored="true"/>

   <dynamicField name="*_pi"  type="pint"    indexed="true"  stored="true"/>
   <dynamicField name="*_c"   type="currency" indexed="true"  stored="true"/>

   <dynamicField name="ignored_*" type="ignored" multiValued="true"/>
   <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

   <dynamicField name="random_*" type="random" />

   <!-- uncomment the following to ignore any fields that don't already match an existing 
        field name or dynamic field, rather than reporting them as an error. 
        alternately, change the type="ignored" to some other type e.g. "text" if you want 
        unknown fields indexed and/or stored by default --> 
   <!--dynamicField name="*" type="ignored" multiValued="true" /-->

 </fields>

默认示例文件如下所示:

<add><doc>
  <field name="id">3007WFP</field>
  <field name="name">Dell Widescreen UltraSharp 3007WFP</field>
  <field name="manu">Dell, Inc.</field>
  <field name="cat">electronics</field>
  <field name="cat">monitor</field>
  <field name="features">30" TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast</field>
  <field name="includes">USB cable</field>
  <field name="weight">401.6</field>
  <field name="price">2199</field>
  <field name="popularity">6</field>
  <field name="inStock">true</field>
  <!-- Buffalo store -->
  <field name="store">43.17614,-90.57341</field>
</doc></add>

我用自己的自定义字段替换了 schema.xml 文件中的字段:

<fields>
  <field name="user_id" type="string" indexed="true" stored="true" />
  <field name="about" type="string" indexed="true" stored="true" />
  <field name="music" type="string" indexed="true" stored="true" />
  <field name="movies" type="string" indexed="true" stored="true" />
  <field name="occupation" type="string" indexed="true" stored="true" />
</fields>

并尝试发布这个名为 mydoc.xml 的文档:

<add>
    <doc>
        <field name="user_id">foobar</field>
        <field name="about">I am a somebody</field>
        <field name="music">pop, rock</field>
        <field name="movies">titanic</field>
        <field name="occupation">web developer</field>
    </doc>
</add>

当我尝试使用相同的旧命令发布时:

java -jar post.jar mydoc.xml

这是我收到的错误:

SimplePostTool: version 1.4
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file mydoc.xml
SimplePostTool: FATAL: Solr returned an error #400 ERROR: [doc=null] unknown field 'user_id'

我还注意到,如果我重新启动 Solr 服务,它无法加载 Solr Admin,并给出消息:

HTTP ERROR 500

Problem accessing /solr/admin/. Reason:

    Severe errors in solr configuration.

Check your log files for more detailed information on what may be wrong.

If you want solr to continue after configuration errors, change: 

 <abortOnConfigurationError>false</abortOnConfigurationError>

in solr.xml

接着是一堆其他 java 类型错误...

如果我从 schema.xml 中删除自己的自定义字段并重新启动 Solr,它就会正常加载 Solr Admin。

所以我在这里不知所措,如何添加自己的自定义字段并能够将我的文档发布到 Solr?

最佳答案

问题是,我忘了更新:

<uniqueKey>id</uniqueKey>

成为:

<uniqueKey>user_id</uniqueKey>

位于 schema.xml 的底部。另一个问题是,当我在 Solr 管理中使用 *:* 进行搜索时,一切都很好,但是当我通过字符串(关键字)进行搜索时,它给出了 未定义的字段文本 错误。为了解决这个问题,我必须将其添加为我的字段之一:

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

关于java - 如何在 Apache Solr 中创建自己的字段并上传文档?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10262850/

相关文章:

java - 自定义 validator 是否可以根据 hibernate validator 中验证失败的内容有多条消息?

java - 对不同数据类型的操作

java - 模态对话框不会阻止其他框架出现在前面

Java函数Date()返回GMT - 00时区的值,但我在GMT -3

css - 带有 XML 文件加载的 ASP 菜单控件太慢

xml - 提取 Xpath 中包含字符串的图像

java - 解码 Dozer 映射文件以提供映射库

c++ - 在网络浏览器上的行编辑(地址栏)中输入内容时创建条件

javascript - Rails 应用程序中搜索框的自动过滤

search - Solr - 示例拼写检查器不工作