Python:字符串计数内存错误

标签 python string memory

根据我在本论坛收到的建议,我使用以下代码(示例)来计算字符串。

phrase_words = ['red car', 'no lake', 'newjersey turnpike']
lines = ['i have a red car which i drove on newjersey', 'turnpike. when i took exit 39 there was no', 'lake. i drove my car on muddy roads which turned my red', 'car into brown. driving on newjersey turnpike can be confusing.']
text = " ".join(lines)
dict = {phrase: text.count(phrase) for phrase in phrase_words}

期望的输出和示例代码的输出是:

{'newjersey turnpike': 2, 'red car': 2, 'no lake': 1}

此代码在小于 300MB 的文本文件上运行良好。我使用了一个 500MB + 大小的文本文件并收到以下内存错误:

    y=' '.join(lines)
MemoryError

我该如何克服这个问题?感谢您的帮助!

最佳答案

该算法一次只需要内存中的两行。它假设没有短语会跨越三行:

from itertools import tee, izip
from collections import defaultdict

def pairwise(iterable): # recipe from itertools docs
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)
d = defaultdict(int)
phrase_words = ['red car', 'no lake', 'newjersey turnpike']
lines = ['i have a red car which i drove on newjersey',
         'turnpike. when i took exit 39 there was no',
         'lake. i drove my car on muddy roads which turned my red',
         'car into brown. driving on newjersey turnpike can be confusing.']

for line1, line2 in pairwise(lines):
    both_lines= ' '.join((line1, line2))
    for phrase in phrase_words:
        # counts phrases in first line and those that span to the next
        d[phrase] += both_lines.count(phrase) - line2.count(phrase)
for phrase in phrase_words:
    d[phrase] += line2.count(phrase) # otherwise last line is not searched

关于Python:字符串计数内存错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7475387/

相关文章:

c# - 获取字符串中某个索引后第一个检测到的空间的索引

java - 生成没有相邻字符的字符串的所有排列的算法

javascript - 存储多个 id 的最佳技术是什么?

PHP内存限制问题编辑多个文件

python - 如何使用 for 循环动态创建数据帧

python - while循环在python中打印列表

perl - 在 Perl 中,我可以将字符串视为字节数组吗?

python - 在 for 循环中构建不同的 networkx 图

python - Python SKLearn Logistic 回归中的虚拟变量

c - 在堆数组上使用数组初始化符号