假设我有一个固定的多词名称列表，例如: 水 生育酚(维生素 E) 维生素D PEG-60氢化蓖麻油

我想要以下输入/输出结果:

水、PEG-60 氢化蓖麻油 -> 水、PEG-60 氢化蓖麻油
PEG-60 氢化蓖麻油 -> PEG-60 氢化蓖麻油
水 PEG-60 氢化蓖麻油 -> 水、PEG-60 氢化蓖麻油
维生素 E -> 生育酚(维生素 E)

我需要它具有高性能，并且能够识别要么有太多势均力敌的比赛，要么没有势均力敌的比赛。使用 1 相对容易，因为我可以用逗号分隔。大多数情况下，输入列表是用逗号分隔的，因此 80% 的情况下这有效，但即使这样也有一个小问题。以 4 为例。一旦分开，大多数拼写检查库(我已经尝试过一个数字)都不会返回 4 的理想匹配，因为与 Vitamin D 的编辑距离要小得多。有一些网站在这方面做得很好，但我不知道如何做到这一点。

这个问题的第二部分是，我如何在上面进行分词。假设给定的列表没有逗号，我需要能够识别这一点。最简单的例子是水维生素D应该变成水，维生素D。我可以举出很多例子，但我认为这很好地说明了问题。

Here's可以使用的名称列表。

最佳答案

上下文

这是approximate string matching的案例或fuzzy matching 。有关于这方面的很好的 Material 和图书馆。

有不同的库和方法来解决这个问题。我将限制在相对简单的库

一些很酷的库:

from fuzzywuzzy import process
import pandas as pd
import string

第一部分

让我们放置要玩的数据。我尝试重现上面的例子，希望没问题。

# Set up dataframe
d = {'originals': [["Water","PEG-60 Hydrogenated Castor Oil"],
                   ["PEG-60 Hydrnated Castor Oil"],
                   ["wter"," PEG-60 Hydrnated Castor Oil"],
                   ['Vitamin E']],
     'correct': [["Water","PEG-60 Hydrogenated Castor Oil"],
                 ["PEG-60 Hydrogenated Castor Oil"],
                 ['Water', 'PEG-60 Hydrogenated Castor Oil'],
                 ['Tocopherol (Vitamin E)']]}
df = pd.DataFrame(data=d)
print(df)
                                 originals                                  correct
0  [Water, PEG-60 Hydrogenated Castor Oil]  [Water, PEG-60 Hydrogenated Castor Oil]
1            [PEG-60 Hydrnated Castor Oil]         [PEG-60 Hydrogenated Castor Oil]
2     [wter,  PEG-60 Hydrnated Castor Oil]  [Water, PEG-60 Hydrogenated Castor Oil]
3                              [Vitamin E]                 [Tocopherol (Vitamin E)]

从上面我们得到了问题的陈述:我们有一些原始的措辞，想要改变它。

哪些选项对我们来说是正确的:

strOptions = ['Water', "Tocopherol (Vitamin E)",
             "Vitamin D", "PEG-60 Hydrogenated Castor Oil"]

这个函数将会帮助我们。我尝试很好地记录它们。

def function_proximity(str2Match,strOptions):
    """
    This function help to get the first guess by similiarity.

    paramters
    ---------
    str2Match: string. The string to match.
    strOptions: list of strings. Those are the possibilities to match.
    """
    highest = process.extractOne(str2Match,strOptions)
    return highest[0]
def check_strings(x, strOptions):
    """
    Takes a list of string and give you a list of string best matched.
    :param x: list of string to link / matched
    :param strOptions:
    :return: list of string matched
    """
    list_results = []
    for i in x:
        i=str(i)
        list_results.append(function_proximity(i,strOptions))
    return list_results

让我们应用到数据框:

df['solutions_1'] = df['originals'].apply(lambda x: check_strings(x, strOptions))

让我们通过比较各列来检查结果。

print(df['solutions_1'] == df['correct'])
0    True
1    True
2    True
3    True
dtype: bool

正如您所见，该解决方案适用于四种情况。

第二部分

问题示例解决方案: 您的水维生素 D 应变为水、维生素 D。

让我们创建一个有效单词列表。

list_words = []
for i in strOptions:
    print(i.split(' '))
    list_words = list_words + i.split(' ')
# Lower case and remove punctionation
list_valid_words = []
for i in list_words:
    i = i.lower()
    list_valid_words.append(i.translate(str.maketrans('', '', string.punctuation)))
print(list_valid_words)
['water', 'tocopherol', 'vitamin', 'e', 'vitamin', 'd', 'peg60', 'hydrogenated', 'castor', 'oil']

如果单词列表有效。

def remove_puntuation_split(x):
    """
    This function remove puntuation and split the string into tokens.
    :param x: string
    :return: list of proper tokens
    """
    x = x.lower()
    # Remove all puntuation
    x = x.translate(str.maketrans('', '', string.punctuation))
    return x.split(' ')

tokens = remove_puntuation_split(x)
# Clean tokens
clean_tokens = [function_proximity(x,list_valid_words) for x in tokens]
# Matched tokens with proper selection
tokens_clasified = [function_proximity(x,strOptions) for x in tokens]
# Removed repeated
tokens_clasified =  list(set(tokens_clasified))
print(tokens_clasified)
['Vitamin D', 'Water']

这是最初的要求。然而，这些可能会有点失败，特别是当维生素 E 和 D 结合在一起时。

引用文献

关于python - 使用拼写检查进行查询分段，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65844582/

python - 使用拼写检查进行查询分段

上下文

第一部分

第二部分

引用文献

上一篇：google-analytics - 通过 API 访问 Google Analytics，但 webProperties 为空

下一篇：python - 如何使用pdfminer从存储在S3存储桶中的PDF文件中提取文本而不需要下载到本地？