java - Lucene:字段的相似性(BM25)

标签 java lucene similarity

我想使用 IndexSearcher Lucene 来计算文档之间的相似度。确切地说,我有一个输入文档,想计算与索引中所有其他文档的相似度。我已经了解了基本功能,但现在我有一些问题,我还没有在网上找到答案。

  • 为什么是IndexSearcher当我调用 is.search(query, Integer.MAX_VALUE) 时只返回两个结果?我会期待三个结果。
  • 我的方法中是否存在一些我目前没有发现的错误?
  • 我如何处理多种语言?据我所知IndexWriterQueryParser两者都应具有相同的分析器(在我的示例中为 StandardAnalyzer)。如果我使用三种不同的语言,是否必须创建三个索引?

SSCCE(我使用的是 Lucene 5.3.0):

public class Main {

    public static void main(String[] args) throws Exception {
        Path path = Paths.get("temp_directoty");

        // create index
        createIndexAndAddDocuments(path);

        // open index reader and create index searcher
        IndexReader ir = DirectoryReader.open(FSDirectory.open(path));
        IndexSearcher is = new IndexSearcher(ir);
        is.setSimilarity(new BM25Similarity());

        // document which is used to create the query
        Document doc = ir.document(1);

        // create query parser
        QueryParser queryParser = new QueryParser("Abstract", new StandardAnalyzer());

        // create query
        Query query = queryParser.parse(doc.get("Abstract"));

        // search
        for (ScoreDoc result : is.search(query, Integer.MAX_VALUE).scoreDocs) {
            System.out.println(result.doc + "\t" + result.score);
        }
    }

    private static void createIndexAndAddDocuments(Path indexPath) throws IOException {
        // create documents
        Document doc1 = new Document();
        doc1.add(new TextField("Title", "Apparatus for manufacturing green bricks for the brick manufacturing industry",
                Store.YES));
        doc1.add(new TextField("Abstract",
                "The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks",
                Store.YES));

        Document doc2 = new Document();
        doc2.add(new TextField("Title",
                "Some other title, for example: Apparatus for manufacturing green bricks for the brick manufacturing industry",
                Store.YES));
        doc2.add(new TextField("Abstract",
                "Some other abstract, for example: The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks",
                Store.YES));

        Document doc3 = new Document();
        doc3.add(new TextField("Title", "A document with a competely different title", Store.YES));
        doc3.add(new TextField("Abstract",
                "This document also has a completely different abstract which is in no way similar to the abstract of the previous documents.",
                Store.YES));

        IndexWriter iw = new IndexWriter(FSDirectory.open(indexPath), new IndexWriterConfig(new StandardAnalyzer()));
        iw.deleteAll();
        iw.addDocument(doc1);
        iw.addDocument(doc2);
        iw.addDocument(doc3);
        iw.close();
    }
}

最佳答案

我发现您只有 2 个结果的问题。您在 createIndexAndAddDocuments 中只创建了 doc1 和 doc2,之后您覆盖了 doc2 而没有初始化 doc3。 关于语言的问题,我会回答:这取决于你是想单独搜索一个语句还是全部搜索一个语句。如果你想分离语言,你需要不同的索引。

希望对你有帮助。

关于java - Lucene:字段的相似性(BM25),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32698895/

相关文章:

java - Apache Lucene createFullTextQuery 返回匹配的空对象

audio - 组500,000个音频文件(多次重复)的最佳方法?

java - 什么时候在 Java/Gradle 中使用运行时而不是编译时依赖?

java - 如何使用 java Apache Lucene 检索 PDF 文档中的正则表达式搜索字母数字文本?

elasticsearch - 返回 Elasticsearch 中搜索查询的位置和突出显示

string - 查找具有相似文本的文章的算法

python - 将python协同过滤代码转换为使用Map Reduce

java - 计算富文本字段中可能的行数

java - GWT doGet() servlet 将图像字节数组或图像返回给客户端

java - 对于输入字符串 : "" when textfields are filled out