我正在尝试编写代码来查找特定短语两侧的 5 个单词。很简单,但我必须对大量数据执行此操作,因此代码需要优化!
for file in listing:
file2 = open('//home/user/Documents/Corpus/Files/'+file,'r')
for line in file2:
linetrigrams = trigram_split(line)
for trigram in linetrigrams:
if trigram in trigrams:
line2 = line.replace(trigram,'###').split('###')
window = (line2[0].split()[-5:] + line2[1].split()[:5])
for item in window:
if item in mostfreq:
matrix[trigram][mostfreq[item]] += 1
对于更快地执行此操作有什么建议吗?可能是我在这里使用了完全错误的数据结构。 trigram_split() 只给出该行中的所有三字母组(这是我需要为其创建向量的单位)。 “Trigrams”基本上是一个包含大约一百万个 trigrams 的列表,我关心的是为其创建向量。 Window 获取 trigram 前后的 5 个单词(如果该 trigram 在列表中),然后检查它们是否在列表 MostFreq(这是一个包含 1000 个单词作为键的字典,每个单词对应一个整数 [ 0-100]作为存储值)。然后使用它来更新 Matrix(这是一个包含列表 ([0] * 1000) 作为存储值的字典)。伪矩阵中的相应值以这种方式递增。
最佳答案
权衡各种方法时要考虑的几个重要因素:
- 多行与单行
- 行的长度
- 搜索模式的长度
- 搜索匹配率
- 前后少于5个字怎么办
- 如何处理非单词、非空格字符(换行符和标点符号)
- 不区分大小写?
- 如何处理重叠匹配项?例如,如果文本是
We are the knights who say NI! NI NI NI NI NI NI NI NI
搜索NI
返回什么?这会发生在你身上吗? - 如果
###
在您的数据中怎么办? - 您宁愿错过一些结果,还是返回额外的错误结果?可能会有一些权衡取舍,尤其是对于杂乱无章的现实世界数据。
你可以试试正则表达式...
import re
zen = """Beautiful is better than ugly. \
Explicit is better than implicit. \
Simple is better than complex. \
Complex is better than complicated. \
Flat is better than nested. \
Sparse is better than dense. \
Readability counts. \
Special cases aren't special enough to break the rules. \
Although practicality beats purity. \
Errors should never pass silently. \
Unless explicitly silenced. \
In the face of ambiguity, refuse the temptation to guess. \
There should be one-- and preferably only one --obvious way to do it. \
Although that way may not be obvious at first unless you're Dutch. \
Now is better than never. \
Although never is often better than *right* now. \
If the implementation is hard to explain, it's a bad idea. \
If the implementation is easy to explain, it may be a good idea. \
Namespaces are one honking great idea -- let's do more of those!"""
searchvar = 'Dutch'
dutchre = re.compile(r"""((?:\S+\s*){,5})(%s)((?:\S+\s*){,5})""" % searchvar, re.IGNORECASE | re.MULTILINE)
print dutchre.findall(zen)
#[("obvious at first unless you're ", 'Dutch', '. Now is better than ')]
替代方法,这会导致更糟糕的结果 IMO...
def splitAndFind(text, phrase):
text2 = text.replace(phrase, "###").split("###")
if len(text2) > 1:
return ((text2[0].split()[-5:], text2[1].split()[:5]))
print splitAndFind(zen, 'Dutch')
#(['obvious', 'at', 'first', 'unless', "you're"],
# ['.', 'Now', 'is', 'better', 'than'])
在 iPython 中你可以很容易地计时:
timeit dutchre.findall(zen)
1000 loops, best of 3: 814 us per loop
timeit 'Dutch' in zen
1000000 loops, best of 3: 650 ns per loop
timeit zen.find('Dutch')
1000000 loops, best of 3: 812 ns per loop
timeit splitAndFind(zen, 'Dutch')
10000 loops, best of 3: 18.8 us per loop
关于Python - 从一行中的给定点查找前后五个单词的最佳代码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5421517/