python - 提取段落中与列表中的单词相似的单词

标签 python python-3.x difflib

我有以下字符串:

"The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"

要提取的单词列表:

["town","teddy","chicken","boy went"]

注意:给定句子中的 town 和 teddy 拼写错误。

我尝试了以下方法,但我得到了不属于答案的其他词:

import difflib

sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"

list1 = ["town","teddy","chicken","boy went"]

[difflib.get_close_matches(x.lower().strip(), sent.split()) for x in list1 ]

我得到以下结果:

[['twn', 'to'], ['tddy'], ['chicken.', 'picked'], ['went']]

代替:

'twn', 'tddy', 'chicken','boy went'

最佳答案

文档中关于 difflib.get_closest_matches() 的通知:

difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)

Return a list of the best "good enough" matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).

Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.

Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignored.


目前,您正在使用默认的 ncutoff 参数。

您可以指定其中一个(或两者),以缩小返回的匹配范围。

例如,您可以使用 0.75 的 cutoff 分数:

result = [difflib.get_close_matches(x.lower().strip(), sent.split(), cutoff=0.75) for x in list1]

或者,您可以指定最多只返回 1 个匹配项:

result = [difflib.get_close_matches(x.lower().strip(), sent.split(), n=1) for x in list1]

在任何一种情况下,您都可以使用列表理解来展平列表的列表(因为 difflib.get_close_matches() 总是返回一个列表):

matches = [r[0] for r in result]

由于您还想检查双字母组的紧密匹配,您可以通过提取相邻“单词”的配对来实现,并将它们作为 的一部分传递给 difflib.get_close_matches() >可能性参数。

这是一个完整的实际工作示例:

import difflib
import re

sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"

list1 = ["town", "teddy", "chicken", "boy went"]

# this extracts overlapping pairings of "words"
# i.e. ['The boy', 'boy went', 'went to', 'to twn', ...
pairs = re.findall(r'(?=(\b[^ ]+ [^ ]+\b))', sent)

# we pass the sent.split() list as before
# and concatenate the new pairs list to the end of it also
result = [difflib.get_close_matches(x.lower().strip(), sent.split() + pairs, n=1) for x in list1]

matches = [r[0] for r in result]

print(matches)
# ['twn', 'tddy', 'chicken.', 'boy went']

关于python - 提取段落中与列表中的单词相似的单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66181700/

相关文章:

python - 使用 difflib SequenceMatcher 比率在 Pandas 中合并

python - 根据多个条件在 Pandas 数据框中创建一个新列

python - "list.__eq__(self, other)"应该做什么?

python - sin_family 的网络字节顺序

python - 使用Python修改.iso文件

Python正则表达式不捕获单个字符串

python - 导入错误 :No module named difflib_data

python - 使用元类允许前向声明

python - 我如何编写一个函数 fmap 来返回与输入的相同类型的可迭代对象?

python - difflib有两个以上的文件名