python - 在 Python 中比较两个文本 block

我有一个信息可以来自各种来源的系统。我想确保我没有添加准确(或极其相似)的信息。这是一个例子:

Text A: One day a man walked over the hill and saw the sun

Text B: One day a man walked over a hill and saw the sun

Text C: One week a woman looked over a hill and saw the sun

在这种情况下，我想获得信息 block 之间差异的某种数值。从那里我可以应用以下逻辑:

将文本添加到数据库时，检查数据库中的现有值
如果看到值非常相似，则不要添加
如果发现值差异足够大，则添加

因此我们最终在数据库中得到不同的信息，而不是重复的，但我们允许有少量的回旋余地。

谁能告诉我如何在 Python 中尝试这个？

最佳答案

查看您的问题，difflib.SequenceMatcher.ratio()可能会派上用场。

这个漂亮的例程，接受两个字符串并计算 [0,1] 范围内的相似度指数

快速演示

>>> for a,b in list(itertools.product(st, st)):
    print "Text 1 {}".format(a)
    print "Text 2 {}".format(b)
    print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
    print '-'*80


Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------

关于python - 在 Python 中比较两个文本 block ，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18378949/

python - 在 Python 中比较两个文本 block

快速演示

上一篇：algorithm - 循环内递归函数的时间复杂度分析

下一篇：ruby - 如何有效地加入从 json 列表收到的 ruby 散列

python - 在 Python 中比较两个文本 block

快速演示

上一篇：algorithm - 循环内递归函数的时间复杂度分析

下一篇：ruby - 如何有效地加入从 json 列表收到的 ruby​​ 散列

下一篇：ruby - 如何有效地加入从 json 列表收到的 ruby 散列