python - 从模糊 wuzzypartial_ratio 获得不正确的分数

标签 python fuzzy-comparison fuzzywuzzy

我对 Python 相当陌生,我正在尝试使用 fuzzy wuzzy 进行模糊匹配。我相信我使用partial_ratio 函数得到的匹配分数不正确。这是我的探索性代码:

>>>from fuzzywuzzy import fuzz
>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Barbil')
50

我相信这应该返回 100 分,因为第二个字符串“Barbil”包含在第一个字符串中。当我尝试去掉第一个字符串末尾或开头的几个字符时,我得到的匹配分数为 100。

>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear','Barbil')
100
>>> fuzz.partial_ratio('ect: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Orissa')
100

当第一个字符串的长度达到 199 时,分数似乎从 50 分切换到 100 分。有人知道会发生什么吗?

最佳答案

这是因为当其中一个字符串是200 characters or longer, an automatic junk heuristic gets turned on in python's SequenceMatcher时。 此代码应该适合您:

from difflib import SequenceMatcher

def partial_ratio(s1, s2):
    """Return the ratio of the most similar substring
    as a number between 0 and 100."""

    if len(s1) <= len(s2):
        shorter = s1
        longer = s2
    else:
        shorter = s2
        longer = s1

    m = SequenceMatcher(None, shorter, longer, autojunk=False)
    blocks = m.get_matching_blocks()

    # each block represents a sequence of matching characters in a string
    # of the form (idx_1, idx_2, len)
    # the best partial match will block align with at least one of those blocks
    #   e.g. shorter = "abcd", longer = XXXbcdeEEE
    #   block = (1,3,3)
    #   best score === ratio("abcd", "Xbcd")
    scores = []
    for (short_start, long_start, _) in blocks:
        new_long_start = max(0, long_start - short_start)
        new_long_end = new_long_start + len(shorter)
        long_substr = longer[new_long_start:new_long_end]

        m2 = SequenceMatcher(None, shorter, long_substr, autojunk=False)
        r = m2.ratio()
        if r > .995:
            return 100
        else:
            scores.append(r)

    return max(scores) * 100.0

关于python - 从模糊 wuzzypartial_ratio 获得不正确的分数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39729225/

相关文章:

python - 将图像分割成任意数量的框

python - 这个字符串匹配方法在python中有实现吗?

lucene - 搜索数百万个模糊哈希的最佳方法

python fuzzywuzzy 的 process.extract() : how does it work?

python - 文本分类的特征选择和约简

python - Pandas 中的 lambda 函数后索引列消失

Python合并两个数据框(模糊匹配,有些列完全匹配,而有些列不匹配)

python - FuzzyWuzzy 字符串匹配 - 区分大小写

python - Python 新手……Python 3 和 Matplotlib