python - Okapi BM25的python实现

标签 python svm feature-selection

我正在尝试用 python 实现 Okapi BM25。虽然我看过一些教程如何做到这一点,但我似乎陷入了这个过程。

所以我收集了文档(并且具有“id”和“text”列)和查询(并且具有“id”和“text”列)。我已经完成了预处理步骤,并且将我的文档和查询作为列表:

documents = list(train_docs['text'])        #put the documents text to list
queries = list(train_queries_all['text'])   #put the queries text to list

然后对于 BM25 我这样做:

pip install rank_bm25

#计算BM25

from rank_bm25 import BM25Okapi

bm25 = BM25Okapi(documents)

#计算分数

bm_score = BM25Okapi.get_scores(文档, query=查询)

但这行不通。


然后我尝试这样做:

import math
import numpy as np
from multiprocessing import Pool, cpu_count

nd = len(documents) # corpus_size = 3612 (我不确定这是否有必要)

class BM25:
    def __init__(self, documents, tokenizer=None):
        self.corpus_size = len(documents)
        self.avgdl = 0
        self.doc_freqs = []
        self.idf = {}
        self.doc_len = []
        self.tokenizer = tokenizer

        if tokenizer:
            documents = self._tokenize_corpus(documents)

        nd = self._initialize(documents)
        self._calc_idf(nd)

    def _initialize(self, documents):
        nd = {}  # word -> number of documents with word
        num_doc = 0
        for document in documents:
            self.doc_len.append(len(document))
            num_doc += len(document)

            frequencies = {}
            for word in document:
                if word not in frequencies:
                    frequencies[word] = 0
                frequencies[word] += 1
            self.doc_freqs.append(frequencies)

            for word, freq in frequencies.items():
                if word not in nd:
                    nd[word] = 0
                nd[word] += 1

        self.avgdl = num_doc / self.corpus_size
        return nd

    def _tokenize_corpus(self, documents):
        pool = Pool(cpu_count())
        tokenized_corpus = pool.map(self.tokenizer, documents)
        return tokenized_corpus

    def _calc_idf(self, nd):
        raise NotImplementedError()

    def get_scores(self, queries):
        raise NotImplementedError()

    def get_top_n(self, queries, documents, n=5):

        assert self.corpus_size == len(documents), "The documents given don't match the index corpus!"

        scores = self.get_scores(queries)
        top_n = np.argsort(scores)[::-1][:n]
        return [documents[i] for i in top_n]

class BM25T(BM25):
    def __init__(self, documents, k1=1.5, b=0.75, delta=1):
        # Algorithm specific parameters
        self.k1 = k1
        self.b = b
        self.delta = delta
        super().__init__(documents)

    def _calc_idf(self, nd):
        for word, freq in nd.items():
            idf = math.log((self.corpus_size + 1) / freq)
            self.idf[word] = idf

    def get_scores(self, queries):
        score = np.zeros(self.corpus_size)
        doc_len = np.array(self.doc_len)
        for q in queries:
            q_freq = np.array([(doc.get(q) or 0) for doc in self.doc_freqs])
            score += (self.idf.get(q) or 0) * (self.delta + (q_freq * (self.k1 + 1)) /
                                               (self.k1 * (1 - self.b + self.b * doc_len / self.avgdl) + q_freq))
        return score

然后我尝试获取分数:

score = BM25.get_scores(self=documents, queries)

但我收到一条消息: 分数 = BM25.get_scores(self=文档、查询)

语法错误:位置参数跟随关键字参数


有谁知道为什么会出现此错误?预先感谢您。

最佳答案

1)对语料库进行标记化或将标记化函数发送到类

2 ) 仅向“get_scores”函数发送查询

阅读官方示例

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)

关于python - Okapi BM25的python实现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61877065/

相关文章:

java - 使用神经网络进行文本分类

r - 为什么递归特征消除程序没有消除无用的预测变量?

python - Linux 上 %include 的 CherryPy/Mako 路径问题(适用于 Windows)

python - 使用 scikit 的深度信念网络

python - PyCharm可以将变量值显示为十六进制数吗?

python - PyML:绘制决策面

android - 如何在 Android 中设置 libsvm?

r - "Something is wrong; all the Accuracy metric values are missing:"

machine-learning - 如何使用weka删除冗余特征

python - 正则表达式匹配字符串中的特定单词但排除索引版本