python - 基于字典的ngram

我正在尝试提取由一些较小部分组合而成的一元字符串、二元字符串和三元字符串。当它们是较大的一部分时，有没有可能的方法来单独提取它们而不计算较小的？

text = "the log user should able to identify log entries  and domain  log entries"
ngramList = ['log', 'log entries','domain log entries']


import re

counts = {}
for ngram in ngrams:
  words = ngram.rsplit()
  pattern = re.compile(r'%s' % "\s+".join(words),re.IGNORECASE)
  counts[ngram] = len(pattern.findall(text))

print(counts)

当前程序输出 = 'log':3 ,'日志条目':2,'域日志条目':1

预期输出 = “log”:1、“日志条目”:1、“域日志条目”:1

最佳答案

您可以先按大小对 ngram 列表进行排序，然后使用 re.subn 将每个 ngram(从大到小)替换为空字符串，同时统计替换次数。

因为您从较大的 ngram 到较小的 ngram 进行排序，所以您可以确保较小的 ngram 不会被计为“较大 ngram 的一部分”，因为您从循环中的字符串中删除了较小的 ngram。

import re

s = "the log user should able to identify log entries  and domain  log entries"
ngramList = ['log', 'log entries','domain log entries']
ngramList.sort(key=len, reverse=True)

counts = {}

for ngram in ngramList:
    words = ngram.rsplit()
    pattern = re.compile(r'%s' % "\s+".join(words), re.IGNORECASE)
    s, n = re.subn(pattern, '', s)
    counts[ngram] = n

print(counts)

正如 Wiktor 在评论中指出的那样，您可能希望改进您的正则表达式模式。现在该模式还将匹配“keylogging”一词中的“log”。可以肯定的是，您希望将 token 包含在分词符中:

pattern = re.compile(r"\b(?:{})\b".format(r"\s+".join(ngram.split())), re.IGNORECASE)

关于python - 基于字典的ngram，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53388853/

python - 基于字典的ngram

上一篇：python - 按分隔符分割时保持引用 block 完整

下一篇：python - 整数字符串的自定义比较