python - 删除 NLP 中句子比较的循环

标签 python numpy bert-language-model

我正在使用 BERT 来比较文本相似度,代码如下:

from bert_embedding import BertEmbedding
import numpy as np
from scipy.spatial.distance  import cosine as cosine_similarity

bert_embedding = BertEmbedding()
TEXT1 = "As expected from MIT-level of course: it's interesting, challenging, engaging, and for me personally quite enlightening. This course is second part of 5 courses in  micromasters program. I was interested in learning about supply chain (purely personal interest, my work touch this topic but not directly) and stumbled upon this course, took it, and man-oh-man...I just couldn't stop learning. Now I'm planning to take the rest of the courses. Average time/effort per week should be around 8-10 hours, but I tried to squeeze everything into just 5 hours since I have very limited free time. You will need 2-3 hours per week for the lecture videos, 2 hours for practice problems, and another 2 hours for the weekly homework. This course offers several topics around demand forecasting and inventory. Basic knowledge of probability and statistics is needed. It will help if you take the prerequisite course: supply chain analytics. But if you've already familiar with basic concept of statistics, you can pick yourself along the way. The lectures are very interesting and engaging, it gives you a lot of knowledge but also throw in some business perspective, so it's very relatable and applicable! The practice problems can help strengthen the understanding of the given knowledge and the homework are very challenging compared to other online-courses I have taken. This course is the best quality I have taken so far, and I have taken several (3-4 MOOCs) from other provider."
TEXT1 = TEXT1.split('.')

sentence2 = ["CHALLENGING COURSE "]

从那里我想使用余弦距离在 TEXT1 的其中一个句子中找到 sentence2 的最佳匹配

best_match = {'sentence':'','score':''}
best = 0
for sentence in TEXT1: 
  #sentence = sentence.replace('SUPPLY CHAIN','')
  if len(sentence) < 5:
    continue
  avg_vec1 = calculate_avg_vec([sentence])
  avg_vec2 = calculate_avg_vec(sentence2)

  score = cosine_similarity(avg_vec1,avg_vec2)
  if score > best:
    best_match['sentence'] =  sentence
    best_match['score'] =  score
    best = score

best_match

代码可以运行,但由于我不仅要将 sentence2 与 TEXT1 进行比较,还要将 N 个文本进行比较,因此我需要提高速度。是否可以对这个循环进行矢量化?或者有什么办法可以加快速度?

最佳答案

cosine_similarity被定义为两个归一化向量的点积。

这本质上是一个矩阵乘法,后跟一个 argmax。以获得最佳索引。

我将使用 numpy ,即使 - 正如评论中提到的那样 - 你也可以将它插入 BERT模型与 pytorchtensorflow .

首先,我们定义一个归一化平均向量:

def calculate_avg_norm_vec(sentence):
    vs = sentence2vectors(sentence) # TODO: use Bert embedding
    vm = vs.mean(axis=0)
    return vm/np.linalg.norm(vm)

然后,我们构建所有句子及其向量的矩阵

X = np.apply_along_axis(calculate_avg_norm_vec, 1, all_sentences)
target = calculate_avg_norm_vec(target_sentence)

最后,我们需要乘以 target矢量与 X矩阵,然后取 argmax

index_of_sentence = np.dot(X,target.T).argmax(axis=1)

您可能想确保 axis和索引适合你的数据,但这是整体方案

关于python - 删除 NLP 中句子比较的循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56656153/

相关文章:

python - Numpy 干扰命名空间

python - 按索引选择不同形状的numpy数组并将其写回

python - 在 SciPy 中将复杂函数的根存储在数组中

machine-learning - BERT模型语法正确性和语义连贯性评价指标

python - 除了使用 "columns.tolist()"方法之外,如何重新排序我的 Pandas 数据框?我希望特定的列始终出现在最后

python - 将代码从开发计算机移动到目标时管理 Python 路径

python - 厄密矩阵的 logm 函数返回非厄密矩阵

pytorch - CUDA 运行时错误 : Which Cuda version is compatible to run NER task using BERT-NER

nlp - 如何在给定上下文的情况下获得句子中特定标记(单词)的概率

重复直到返回值不为 None 的 Pythonic 方式