Python - 从一行中的给定点查找前后五个单词的最佳代码

我正在尝试编写代码来查找特定短语两侧的 5 个单词。很简单，但我必须对大量数据执行此操作，因此代码需要优化!

for file in listing:
    file2 = open('//home/user/Documents/Corpus/Files/'+file,'r')
    for line in file2:
        linetrigrams = trigram_split(line)
        for trigram in linetrigrams:
            if trigram in trigrams:
                line2 = line.replace(trigram,'###').split('###')
                window = (line2[0].split()[-5:] + line2[1].split()[:5])
                for item in window:
                    if item in mostfreq:
                        matrix[trigram][mostfreq[item]] += 1

对于更快地执行此操作有什么建议吗？可能是我在这里使用了完全错误的数据结构。 trigram_split() 只给出该行中的所有三字母组(这是我需要为其创建向量的单位)。 “Trigrams”基本上是一个包含大约一百万个 trigrams 的列表，我关心的是为其创建向量。 Window 获取 trigram 前后的 5 个单词(如果该 trigram 在列表中)，然后检查它们是否在列表 MostFreq(这是一个包含 1000 个单词作为键的字典，每个单词对应一个整数 [ 0-100]作为存储值)。然后使用它来更新 Matrix(这是一个包含列表 ([0] * 1000) 作为存储值的字典)。伪矩阵中的相应值以这种方式递增。

最佳答案

权衡各种方法时要考虑的几个重要因素:

多行与单行
行的长度
搜索模式的长度
搜索匹配率
前后少于5个字怎么办
如何处理非单词、非空格字符(换行符和标点符号)
不区分大小写？
如何处理重叠匹配项？例如，如果文本是 We are the knights who say NI! NI NI NI NI NI NI NI NI 搜索 NI 返回什么？这会发生在你身上吗？
如果 ### 在您的数据中怎么办？
您宁愿错过一些结果，还是返回额外的错误结果？可能会有一些权衡取舍，尤其是对于杂乱无章的现实世界数据。

你可以试试正则表达式...

import re
zen = """Beautiful is better than ugly. \
Explicit is better than implicit. \
Simple is better than complex. \
Complex is better than complicated. \
Flat is better than nested. \
Sparse is better than dense. \
Readability counts. \
Special cases aren't special enough to break the rules. \
Although practicality beats purity. \
Errors should never pass silently. \
Unless explicitly silenced. \
In the face of ambiguity, refuse the temptation to guess. \
There should be one-- and preferably only one --obvious way to do it. \
Although that way may not be obvious at first unless you're Dutch. \
Now is better than never. \
Although never is often better than *right* now. \
If the implementation is hard to explain, it's a bad idea. \
If the implementation is easy to explain, it may be a good idea. \
Namespaces are one honking great idea -- let's do more of those!"""

searchvar = 'Dutch'
dutchre = re.compile(r"""((?:\S+\s*){,5})(%s)((?:\S+\s*){,5})""" % searchvar, re.IGNORECASE | re.MULTILINE)
print dutchre.findall(zen)
#[("obvious at first unless you're ", 'Dutch', '. Now is better than ')]

替代方法，这会导致更糟糕的结果 IMO...

def splitAndFind(text, phrase):
    text2 = text.replace(phrase, "###").split("###")
    if len(text2) > 1:
        return ((text2[0].split()[-5:], text2[1].split()[:5]))
print splitAndFind(zen, 'Dutch')
#(['obvious', 'at', 'first', 'unless', "you're"],
# ['.', 'Now', 'is', 'better', 'than'])

在 iPython 中你可以很容易地计时:

timeit dutchre.findall(zen)
1000 loops, best of 3: 814 us per loop

timeit 'Dutch' in zen
1000000 loops, best of 3: 650 ns per loop

timeit zen.find('Dutch')
1000000 loops, best of 3: 812 ns per loop

timeit splitAndFind(zen, 'Dutch')
10000 loops, best of 3: 18.8 us per loop

关于Python - 从一行中的给定点查找前后五个单词的最佳代码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/5421517/

Python - 从一行中的给定点查找前后五个单词的最佳代码

上一篇：python - 实体组、ReferenceProperty 或键作为字符串

下一篇：Python - 难以置信的大矩阵的最佳数据结构