python - 在 Python 中比较两个文本 block

Text A: One day a man walked over the hill and saw the sun

Text B: One day a man walked over a hill and saw the sun

Text C: One week a woman looked over a hill and saw the sun

在这种情况下,我想获得信息 block 之间差异的某种数值。从那里我可以应用以下逻辑:

  1. 将文本添加到数据库时,检查数据库中的现有值
  2. 如果看到值非常相似,则不要添加
  3. 如果发现值差异足够大,则添加


谁能告诉我如何在 Python 中尝试这个?



这个漂亮的例程,接受两个字符串并计算 [0,1] 范围内的相似度指数


>>> for a,b in list(itertools.product(st, st)):
    print "Text 1 {}".format(a)
    print "Text 2 {}".format(b)
    print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
    print '-'*80

Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0

