java - Lucene 5.5.0 StopFilter 错误

标签 java lucene stop-words

我正在尝试在 Lucene 5.5.0 中使用 StopFilter。我尝试了以下方法:

package lucenedemo;

import java.io.StringReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.Iterator;

import org.apache.lucene.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.util.AttributeFactory;
import org.apache.lucene.util.Version;

public class lucenedemo {

    public static void main(String[] args) throws Exception {
        System.out.println(removeStopWords("hello how are you? I am fine. This is a great day!"));

    }

    public static String removeStopWords(String strInput) throws Exception {
        AttributeFactory factory = AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
        StandardTokenizer tokenizer = new StandardTokenizer(factory);
        tokenizer.setReader(new StringReader(strInput));
        tokenizer.reset();              
        CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();

        TokenStream streamStop = new StopFilter(tokenizer, stopWords);
        StringBuilder sb = new StringBuilder();
        CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
        streamStop.reset();
        while (streamStop.incrementToken()) {
            String term = charTermAttribute.toString();
            sb.append(term + " ");
        }

        streamStop.end();
        streamStop.close();

        tokenizer.close();  


        return sb.toString();

    }

}

但它给了我以下错误:

Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738)
at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159)
at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
at lucenedemo.lucenedemo.removeStopWords(lucenedemo.java:42)
at lucenedemo.lucenedemo.main(lucenedemo.java:27)

我到底做错了什么?我已经关闭了 Tokeinzer 和 TokenStream 类。我还缺少其他东西吗?

最佳答案

对过滤器调用重置将依次重置底层流。由于您手动重置分词器,然后使用分词器创建 StopFilter(它是底层流),并重置,分词器被重置两次。

所以只需删除这一行:

tokenizer.reset();

关于java - Lucene 5.5.0 StopFilter 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35638641/

相关文章:

java - 如何从 ListPreference 中获取选定的项目

javascript - 动态创建 map 时防止Java Nashorn过度占用内存

search - solr 可以返回函数值(不是 solr 分数或文档字段)吗?

elasticsearch - Liferay dxp中的Indexer和IndexWriter类之间有什么区别?

java - 如何以更有效的方式从大型集合文件中删除停用词?

c# - 使用数组 c# 删除停用词

java - 将logback.xml放入应用程序中,那些想要自己的logback.xml的人怎么办?

c# - Lucene - 短语中的通配符

java - java中的停用词和词干分析器

java - 使用所有(jComboBox、JTextField、jFileChooser)作为表编辑器覆盖引用