我正在使用 this实现 Aho-Corasick 字符串搜索算法的 python 库,该算法在一次传递中找到给定字符串中的一组模式。输出不是我所期望的:
In [4]: import ahocorasick
In [5]: import collections
In [6]: tree = ahocorasick.KeywordTree()
In [7]: ss = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"
In [8]: words = ["first sentence is", "first sentence", "the first sentence", "the first sentence is"]
In [9]: for w in words:
...: tree.add(w)
...:
In [10]: tree.make()
In [13]: final = collections.defaultdict(int)
In [15]: for match in tree.findall(ss, allow_overlaps=True):
....: final[ss[match[0]:match[1]]] += 1
....:
In [16]: final
{ 'the first sentence': 3, 'the first sentence is': 2}
我期待的输出是这样的:
{
'the first sentence': 3,
'the first sentence is': 2,
'first sentence': 3,
'first sentence is': 2
}
我错过了什么吗?我在大字符串上执行此操作,因此后处理不是我的第一选择。有没有办法获得所需的输出?
最佳答案
我不知道 ahocorasick
模块,但这些结果似乎很可疑。 acora模块显示:
import acora
import collections
ss = "this is the first sentence in this book "
"the first sentence is really the most interesting "
"the first sentence is always first"
words = ["first sentence is",
"first sentence",
"the first sentence",
"the first sentence is"]
tree = acora.AcoraBuilder(*words).build()
for match in tree.findall(ss):
result[match] += 1
结果:
>>> result
defaultdict(<type 'int'>,
{'the first sentence' : 3,
'first sentence' : 3,
'first sentence is' : 2,
'the first sentence is': 2})
关于python - 字符串搜索库的结果 - 错误或功能或我的编码错误?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8100233/