我对 Python 相当陌生,我正在尝试使用 fuzzy wuzzy 进行模糊匹配。我相信我使用partial_ratio 函数得到的匹配分数不正确。这是我的探索性代码:
>>>from fuzzywuzzy import fuzz
>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Barbil')
50
我相信这应该返回 100 分,因为第二个字符串“Barbil”包含在第一个字符串中。当我尝试去掉第一个字符串末尾或开头的几个字符时,我得到的匹配分数为 100。
>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear','Barbil')
100
>>> fuzz.partial_ratio('ect: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Orissa')
100
当第一个字符串的长度达到 199 时,分数似乎从 50 分切换到 100 分。有人知道会发生什么吗?
最佳答案
这是因为当其中一个字符串是200 characters or longer, an automatic junk heuristic gets turned on in python's SequenceMatcher时。 此代码应该适合您:
from difflib import SequenceMatcher
def partial_ratio(s1, s2):
"""Return the ratio of the most similar substring
as a number between 0 and 100."""
if len(s1) <= len(s2):
shorter = s1
longer = s2
else:
shorter = s2
longer = s1
m = SequenceMatcher(None, shorter, longer, autojunk=False)
blocks = m.get_matching_blocks()
# each block represents a sequence of matching characters in a string
# of the form (idx_1, idx_2, len)
# the best partial match will block align with at least one of those blocks
# e.g. shorter = "abcd", longer = XXXbcdeEEE
# block = (1,3,3)
# best score === ratio("abcd", "Xbcd")
scores = []
for (short_start, long_start, _) in blocks:
new_long_start = max(0, long_start - short_start)
new_long_end = new_long_start + len(shorter)
long_substr = longer[new_long_start:new_long_end]
m2 = SequenceMatcher(None, shorter, long_substr, autojunk=False)
r = m2.ratio()
if r > .995:
return 100
else:
scores.append(r)
return max(scores) * 100.0
关于python - 从模糊 wuzzypartial_ratio 获得不正确的分数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39729225/