python - 字符串搜索库的结果 - 错误或功能或我的编码错误？

我正在使用 this实现 Aho-Corasick 字符串搜索算法的 python 库，该算法在一次传递中找到给定字符串中的一组模式。输出不是我所期望的:

In [4]: import ahocorasick
In [5]: import collections

In [6]: tree = ahocorasick.KeywordTree()

In [7]: ss = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"

In [8]: words = ["first sentence is", "first sentence", "the first sentence", "the first sentence is"]

In [9]: for w in words:
   ...:     tree.add(w)
   ...:

In [10]: tree.make()

In [13]: final = collections.defaultdict(int)

In [15]: for match in tree.findall(ss, allow_overlaps=True):
   ....:     final[ss[match[0]:match[1]]] += 1
   ....:

In [16]: final
{   'the first sentence': 3, 'the first sentence is': 2}

我期待的输出是这样的:

{ 
  'the first sentence': 3,
  'the first sentence is': 2,
  'first sentence': 3,
  'first sentence is': 2
}

我错过了什么吗？我在大字符串上执行此操作，因此后处理不是我的第一选择。有没有办法获得所需的输出？

最佳答案

我不知道 ahocorasick 模块，但这些结果似乎很可疑。 acora模块显示:

import acora
import collections

ss = "this is the first sentence in this book "
     "the first sentence is really the most interesting "
     "the first sentence is always first"

words = ["first sentence is", 
         "first sentence",
         "the first sentence",
         "the first sentence is"]

tree = acora.AcoraBuilder(*words).build()

for match in tree.findall(ss):
    result[match] += 1

结果:

>>> result
defaultdict(<type 'int'>, 
            {'the first sentence'   : 3,
             'first sentence'       : 3,
             'first sentence is'    : 2,
             'the first sentence is': 2})

关于python - 字符串搜索库的结果 - 错误或功能或我的编码错误？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8100233/

上一篇：java - 算法 - 确定矩阵中是否存在具有相同值的特定大小的对角线

下一篇：php - 无重复且无 "classic"排序计算所有 n 大小的排列

c++ - 在类方法上使用指针 : Expression must have type bool error

.net - 如何验证路径(字符串)是否包含 C# .NET 中可变时间格式的日期？

C 将文件替换为程序内的字符串

c - 删除/替换字符数组中的两个或多个连续字符

javascript - 处理 JSON 以创建子级与父级的层次关系

python - Pandas Excelwriter 发散颜色数据栏

python - Python 中的单例对象和复制模块

python - 将字符串拆分为一个列表，其中 N 个字符的部分按相反顺序排列

c++ - 如何使新给定坐标与旧坐标成比例？