java - 计算 Lucene 术语文档数量的最快方法

标签 java lucene

我想统计 Lucene 中某个字段上某个术语的文档数。 我知道 3 种方法;我很好奇最好和最快的做法是什么:

我将在长类型单值字段(“字段”)中搜索术语(所以不是文本,而是编号数据!)

将首先使用以下任何示例的一些预编码:

Directory dirIndex = FSDirectory.open('/path/to/index/');
IndexReader indexReader = DirectoryReader.open(dirIndex);
final BytesRefBuilder bytes = new BytesRefBuilder(); 
NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);

1) 使用索引中的 docFreq()

TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);
termEnum.seekExact(bytes.toBytesRef());
int count = termEnum.docFreq(); 

2) 搜索

IndexSearcher searcher = new IndexSearcher(indexReader);
TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(query,collector);
int count = collector.getTotalHits(); 

3) 从索引中读取完全匹配的文档并逐一统计

TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);
termEnum.seekExact(bytes.toBytesRef());
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
DocsEnum docsEnum = termEnum.docs(liveDocs, null);
int count = 0;
if (docsEnum != null) {
    int docx;
    while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        count++;
    }
 }

最佳方法

选项 1) 赢得最短代码,但如果您更新和删除索引中的文档,则基本上无用。它计算已删除的文档,就好像它们仍然存在一样。在很多地方都没有记录(官方文档除外,但在 s.o. 的答案中没有记录)这是需要注意的事情。也许有办法解决这个问题,否则对这种方法的热情有点放错了地方。 选项 2) 和 3) 确实产生了正确的结果,但应该首选哪个?或者更好 - 有没有更快的方法来做到这一点?

最佳答案

通过测试测量,使用索引获取文档而不是搜索文档(即选项 3 而不是选项 2)似乎更快(平均:选项 3)在 100 人中快 8 倍文档示例 我可以运行)。 我还反转了测试以确保先运行一个测试不会影响结果:它不会。

因此,搜索者似乎正在创建相当多的开销来执行简单的文档计数,如果要查看单个术语条目的文档计数,则索引中的查找是最快的。

用于测试的代码(使用 SOLR 索引中的前 100 条记录):

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.NumericUtils;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.util.Bits;
import org.apache.lucene.index.MultiFields;

public class ReadLongTermReferenceCount {

    public static void main(String[] args) throws IOException {

        Directory dirIndex = FSDirectory.open('/path/to/index/');
        IndexReader indexReader = DirectoryReader.open(dirIndex);
        final BytesRefBuilder bytes = new BytesRefBuilder(); 


        TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);

        IndexSearcher searcher = new IndexSearcher(indexReader);
        TotalHitCountCollector collector = new TotalHitCountCollector();

        Bits liveDocs = MultiFields.getLiveDocs(indexReader);
        final BytesRefBuilder bytes = new BytesRefBuilder(); // for reuse!
        int maxDoc = indexReader.maxDoc();
        int docsPassed = 0;
        for (int i=0; i<maxDoc; i++) {
            if (docsPassed==100) {
                break;
            }
            if (liveDocs != null && !liveDocs.get(i))
                continue;
            Document doc = indexReader.document(i);

            //get longTerm from this doc and convert to BytesRefBuilder
            String longTerm = doc.get("longTerm");
            NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);

            //time before the first test
            long time_start = System.nanoTime();

            //look in the "field" index for longTerm and count the number of documents
            int count = 0;
            termEnum.seekExact(bytes.toBytesRef());
            DocsEnum docsEnum = termEnum.docs(liveDocs, null);
            if (docsEnum != null) {
                int docx;
                while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
                    count++;
                }
            }

            //mid point: test 1 done, start of test 2
            long time_mid = System.nanoTime();

            //do a search for longTerm in "field"
            TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
            searcher.search(query,collector);
            int count = collector.getTotalHits();

            //end point: test 2 done.
            long time_end = System.nanoTime();

            //write to stdout
            System.out.println(longTerm+"\t"+(time_mid-time_start)+"\t"+(time_end-time_mid));

            docsPassed++;
        }
        indexReader.close();
        dirIndex.close();
    }
}   

对上面的内容稍作修改以使用 Lucene 5:

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.PostingsEnum;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.NumericUtils;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.util.Bits;
import org.apache.lucene.index.MultiFields;

public class ReadLongTermReferenceCount {

    public static void main(String[] args) throws IOException {

        Directory dirIndex = FSDirectory.open(Paths.get('/path/to/index/'));
        IndexReader indexReader = DirectoryReader.open(dirIndex);
        final BytesRefBuilder bytes = new BytesRefBuilder(); 


        TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);

        IndexSearcher searcher = new IndexSearcher(indexReader);
        TotalHitCountCollector collector = new TotalHitCountCollector();

        Bits liveDocs = MultiFields.getLiveDocs(indexReader);
        final BytesRefBuilder bytes = new BytesRefBuilder(); // for reuse!
        int maxDoc = indexReader.maxDoc();
        int docsPassed = 0;
        for (int i=0; i<maxDoc; i++) {
            if (docsPassed==100) {
                break;
            }
            if (liveDocs != null && !liveDocs.get(i))
                continue;
            Document doc = indexReader.document(i);

            //get longTerm from this doc and convert to BytesRefBuilder
            String longTerm = doc.get("longTerm");
            NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);

            //time before the first test
            long time_start = System.nanoTime();

            //look in the "field" index for longTerm and count the number of documents
            int count = 0;
            termEnum.seekExact(bytes.toBytesRef());
            PostingsEnum docsEnum = termEnum.postings(liveDocs, null);
            if (docsEnum != null) {
                int docx;
                while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
                    count++;
                }
            }

            //mid point: test 1 done, start of test 2
            long time_mid = System.nanoTime();

            //do a search for longTerm in "field"
            TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
            searcher.search(query,collector);
            int count = collector.getTotalHits();

            //end point: test 2 done.
            long time_end = System.nanoTime();

            //write to stdout
            System.out.println(longTerm+"\t"+(time_mid-time_start)+"\t"+(time_end-time_mid));

            docsPassed++;
        }
        indexReader.close();
        dirIndex.close();
    }
}   

关于java - 计算 Lucene 术语文档数量的最快方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26423045/

相关文章:

java - 有谁知道如何使用 Wordnet 和 Lucene 3.6 来扩展查询?

java.lang.NoClassDefFoundError : org/apache/lucene/codecs/Codec 错误

java - 如何更改滚动面板中按钮的大小

java - 无法在线程内更新 JDialog GUI

javascript - 通过 servlet 将数据库中的数据显示到 fullcalendar - 未列出事件

java - 以 HH :MM (Not time) 的形式存储数值

java - 如何在 Lucene 7.x 中使用 CustomScoreQuery

java - 需要帮助重新构建和优化大型 solr 索引

java :Setter Getter and constructor

solr - Solr中的字符串长度函数查询