Ruby:如何测试两个文本 block 之间的相似性?

absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients.

文本 2:

zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate

文本 3

When the zerg first arrived in the Koprulu sector, they were unified by their absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate the advanced protoss race, it found useful but undeveloped material in humanity.

现在,Text1 的结尾和 text2 的开头重叠,所以我们可以说文本 block 不是唯一的。同样,对于 Text3,可以在内部找到 Text1(以及 Text2),因此由于重叠,这也不是唯一的。


我如何着手编写可以查看连续字母或单词并确定唯一性的内容?理想情况下,我希望这样的方法返回一些值,表示相似度——可能是两个文本 block 大小的平均值之上的匹配词数。当它返回 0 时,测试的两个文本应该是完全唯一的。

我在使用 Ruby 的字符串方法时遇到了一些问题。


>> a = "nt version, there are no ch"  
>> b = "he current versi"  
>> (a.chars.to_a & b.chars.to_a).join  
=> "nt versihc"  


我想我真的只是不知道从哪里开始,以一种高效而不是 O(n^too_high) 的方式开始。


我相信你要找的是Longest Common Substring problem ,即给定两个字符串,找到它们共有的最长子字符串的问题。该链接指向维基百科页面,该页面将帮助您了解域并包含一个在 O(nm) 时间内运行的算法的伪代码示例。

此外,维基教科书的算法实现书有an implementation in Ruby .它包含一个 lcs_size 方法,这可能就是您所需要的。简而言之,如果 lcs_size(text1, text2) 返回 4 这意味着 text1text2 几乎没有共同的连续文本,可能只有一个word,但如果它返回,比如说 40,他们可能有一个完整的共同句子。


