python - 从模糊 wuzzypartial_ratio 获得不正确的分数

我对 Python 相当陌生，我正在尝试使用 fuzzy wuzzy 进行模糊匹配。我相信我使用partial_ratio 函数得到的匹配分数不正确。这是我的探索性代码:

>>>from fuzzywuzzy import fuzz
>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Barbil')
50

我相信这应该返回 100 分，因为第二个字符串“Barbil”包含在第一个字符串中。当我尝试去掉第一个字符串末尾或开头的几个字符时，我得到的匹配分数为 100。

>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear','Barbil')
100
>>> fuzz.partial_ratio('ect: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Orissa')
100

当第一个字符串的长度达到 199 时，分数似乎从 50 分切换到 100 分。有人知道会发生什么吗？

最佳答案

这是因为当其中一个字符串是200 characters or longer, an automatic junk heuristic gets turned on in python's SequenceMatcher时。此代码应该适合您:

from difflib import SequenceMatcher

def partial_ratio(s1, s2):
    """Return the ratio of the most similar substring
    as a number between 0 and 100."""

    if len(s1) <= len(s2):
        shorter = s1
        longer = s2
    else:
        shorter = s2
        longer = s1

    m = SequenceMatcher(None, shorter, longer, autojunk=False)
    blocks = m.get_matching_blocks()

    # each block represents a sequence of matching characters in a string
    # of the form (idx_1, idx_2, len)
    # the best partial match will block align with at least one of those blocks
    #   e.g. shorter = "abcd", longer = XXXbcdeEEE
    #   block = (1,3,3)
    #   best score === ratio("abcd", "Xbcd")
    scores = []
    for (short_start, long_start, _) in blocks:
        new_long_start = max(0, long_start - short_start)
        new_long_end = new_long_start + len(shorter)
        long_substr = longer[new_long_start:new_long_end]

        m2 = SequenceMatcher(None, shorter, long_substr, autojunk=False)
        r = m2.ratio()
        if r > .995:
            return 100
        else:
            scores.append(r)

    return max(scores) * 100.0

关于python - 从模糊 wuzzypartial_ratio 获得不正确的分数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39729225/

python - 从模糊 wuzzypartial_ratio 获得不正确的分数

上一篇：python - 错误3 : Renaming files in python

下一篇：Python:在A类实例化中实例化B类，<class A name>对象没有属性<class B attribute>