python - Python 的 SequenceMatcher 是如何工作的？

我对 SequenceMatcher 返回的两个不同答案感到有点困惑取决于参数的顺序。为什么会这样？

例子

SequenceMatcher 不可交换:

>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo").ratio()
0.6086956521739131

>>> SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm").ratio()
0.5217391304347826

最佳答案

SequenceMatcher.ratio 内部使用 SequenceMatcher.get_matching_blocks 来计算比率，我将引导您完成这些步骤以了解这是如何发生的:

SequenceMatcher.get_matching_blocks

Return list of triples describing matching subsequences. Each triple is of the form (i, j, n), and means that a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in i and j.

The last triple is a dummy, and has the value (len(a), len(b), 0). It is the only triple with n == 0. If (i, j, n) and (i', j', n') are adjacent triples in the list, and the second is not the last triple in the list, then i+n != i' or j+n != j'; in other words, adjacent triples always describe non-adjacent equal blocks.

ratio 在内部使用 SequenceMatcher.get_matching_blocks 的结果，并对 SequenceMatcher.get_matching_blocks 返回的所有匹配序列的大小求和。这是来自 difflib.py 的确切源代码:

matches = sum(triple[-1] for triple in self.get_matching_blocks())

上面一行很关键，因为上面表达式的结果是用来计算比率的。我们很快就会看到这一点，以及它如何影响比率的计算。

>>> m1 = SequenceMatcher(None, "Ebojfm Mzpm", "Ebfo ef Mfpo")
>>> m2 = SequenceMatcher(None, "Ebfo ef Mfpo", "Ebojfm Mzpm")

>>> matches1 = sum(triple[-1] for triple in m1.get_matching_blocks())
>>> matches1
7
>>> matches2 = sum(triple[-1] for triple in m2.get_matching_blocks())
>>> matches2
6

如您所见，我们有 7 和 6。这些只是 get_matching_blocks 返回的匹配子序列的总和。为什么这很重要？这就是为什么，比率是按以下方式计算的，(这来自 difflib 源代码):

def _calculate_ratio(matches, length):
    if length:
        return 2.0 * matches / length
    return 1.0

length 是 len(a) + len(b) 其中 a 是第一个序列，b作为第二个序列。

好了，废话少说，我们需要行动:

>>> length = len("Ebojfm Mzpm") + len("Ebfo ef Mfpo") 
>>> m1.ratio()
0.6086956521739131
>>> (2.0 * matches1 / length)  == m1.ratio()
True

m2 类似:

>>> 2.0 * matches2 / length
0.5217391304347826 
>>> (2.0 * matches2 / length) == m2.ratio()
True

注意:并非所有 SequenceMatcher(None a,b).ratio() == SequenceMatcher(None b,a).ratio() 都是 False ，有时它们可以是 True:

>>> s1 = SequenceMatcher(None, "abcd", "bcde").ratio()
>>> s2 = SequenceMatcher(None, "bcde", "abcd").ratio()
>>> s1 == s2
True

如果你想知道为什么，这是因为

sum(triple[-1] for triple in self.get_matching_blocks())

对于 SequenceMatcher(None, "abcd", "bcde") 和 SequenceMatcher(None, "bcde", "abcd") 是相同的 < strong>3.

关于python - Python 的 SequenceMatcher 是如何工作的？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35517353/

python - Python 的 SequenceMatcher 是如何工作的？

例子

上一篇：python - 不用activate使用anaconda环境？ (例如在 Crontab 中)

下一篇：python - pyinstaller numpy "Intel MKL FATAL ERROR: Cannot load mkl_intel_thread.dll"