java - Lucene 8.5.1 中 IndexReader.getTermVector(int docID ,String field) 中的 docID 是什么以及它是如何工作的？

我正在尝试从 Lucene 的文档字段中获取所有名为 Terms 的术语和相关帖子(即如何计算 Lucene 中的术语频率？)。根据documentation有一种方法可以做到这一点:

公共(public)最终术语 getTermVector(int docID, String field) 抛出 IOException

Retrieve term vector for this document and field, or null if term vectors were not indexed. The returned Fields instance acts like a single-document inverted index (the docID will be 0).

有一个名为int docID的字段。这是什么？？对于给定的文档，它的 id 字段是什么？Lucene 如何识别它？根据Lucene的文档，我使用了StringField作为id，它不是int。

import org.apache.lucene.document.*;
Document doc = new Document();
Field idField = new StringField("id",post.Id,Field.Store.YES);
Field bodyField = new TextField("body", post.Body, Field.Store.YES);
doc.add(idField);
doc.add(bodyField);

我有五个问题:

Lucene 如何识别 id 字段用作此文档的 docId？甚至 Lucene 是否这样做？？
我使用String作为id，但这个方法给出了int。它会引起问题吗？
有没有合适的方法来获取帖子？
我使用了 TextField 。有没有办法检索该字段的术语 vector (Terms)？我不想按照解释重新索引我的文档 here ，因为它太大(35 GB)。
有没有办法从 TextField 获取术语计数并获取每个术语频率？

最佳答案

要计算术语频率，我们可以使用IndexReader.getTermVector(int docID ,String field)。 int docID 是一个字段，引用 Lucene 创建的文档 ID。您可以通过以下代码检索docID:

String index = "index/AIndex/";
String query = "the query text"

IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();

QueryParser parser = new QueryParser("docField", analyzer);
Query lQuery = parser.parse(query);

]TopDocs results = searcher.search(lQuery ,  requiredHits);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = (int) results.totalHits.value;

for (int i = start; i < numTotalHits; i++)
 {
   int docID = hits[i].doc;
   Terms termVector = reader.getTermVector(docID, "docField");
 }

每个 termVector 对象都具有与文档字段相关的术语和频率，您可以通过以下代码检索它:

private HashMap<String,Long> termsFrequency = new HashMap<>();
TermsEnum itr = termVector.iterator();
int allTermFrequency=0;
BytesRef term;

while ((term = itr.next()) != null){
  String termText = term.utf8ToString();
  long tf = itr.totalTermFreq();
  termsFrequency.put(termText, tf);
  allTermFrequency += itr.totalTermFreq();
}

注意:不要忘记按照我的解释设置存储术语 vector here (或 this one )当您索引文档时。如果您索引文档时未设置存储术语 vector ，则 getTermVector 方法将返回 null。默认情况下，所有类型的预定义 Lucene Field 都禁用此选项。所以你需要设置它。

关于java - Lucene 8.5.1 中 IndexReader.getTermVector(int docID ,String field) 中的 docID 是什么以及它是如何工作的？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62439577/

java - Lucene 8.5.1 中 IndexReader.getTermVector(int docID ,String field) 中的 docID 是什么以及它是如何工作的？

上一篇：excel - VBA中如何获取当前工作表的路径？

下一篇：excel - 将列号转换为字母的函数？