python - 使用计数器对象计算文件中的单词数

我有两个 python 文件来计算单词和频率

import io
import collections
import codecs
from collections import Counter

with io.open('JNb.txt', 'r', encoding='utf8') as infh:
    words = infh.read().split()
    with open('e1.txt', 'a') as f:
        for word, count in Counter(words).most_common(10):
            f.write(u'{} {}\n'.format(word, count).encode('utf8'))

<小时/>

import io
import collections
import codecs
from collections import Counter

with io.open('JNb.txt', 'r', encoding='utf8') as infh:
    for line in infh:
        words =line.split()
        with open('e1.txt', 'a') as f:
            for word, count in Counter(words).most_common(10):
                f.write(u'{} {}\n'.format(word, count).encode('utf8'))

没有一个提供输出。

代码不存在语法错误。

输出

താത്കാലിക 1
- 1
ഒഴിവ് 1
അധ്യാപക 1
വാര്‍ത്തകള്‍ 1
ആലപ്പുഴ 1
ഇന്നത്തെപരിപാടി 1
വിവാഹം 1
അമ്പലപ്പുഴ 1

实际文件包含这些单词 100 次出现。

我没有打印任何内容，我正在将所有内容写入文件(e1)

更新:我尝试了另一个并得到了结果

import collections
import codecs
from collections import Counter

    with io.open('JNb.txt', 'r', encoding='utf8') as infh:
        words =infh.read().split()
        with open('file.txt', 'wb') as f:
            for word, count in Counter(words).most_common(10000000):
                f.write(u'{} {}\n'.format(word, count).encode('utf8'))

它可以在 4Gb RAM 中计算多达 2GB 的文件

这里有什么问题吗？

最佳答案

我对任务进行了编码，这是我的解决方案。

我已经使用 5.1 GB 文本文件测试了该程序。该程序在 MBP6.2 上大约 20 分钟内完成。

如果有任何困惑或建议，请告诉我。祝你好运。

from collections import Counter
import io
import sys

cnt = Counter()

if len(sys.argv) < 2:
    print("Provide an input file as argument")
    sys.exit()

try:
    with io.open(sys.argv[1], 'r', encoding='utf-8') as f:
        for line in f:
            for word in line.split():
                cnt[word] += 1
except FileNotFoundError:
    print("File not found")

with sys.stdout as f:
    total_word_count = sum(cnt.values())
    for word, count in cnt.most_common(30):
        f.write('{: < 6} {:<7.2%} {}\n'.format(
            count, count / total_word_count, word))

输出:

~ python countword.py CSW07.txt 
 79619 4.58%   [n]
 63717 3.67%   a
 56783 3.27%   of
 42341 2.44%   to
 40156 2.31%   the
 39295 2.26%   [v]
 38231 2.20%   [n
 36592 2.11%   -S]
 35250 2.03%   or
 17113 0.98%   in

关于python - 使用计数器对象计算文件中的单词数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21722955/

python - 使用计数器对象计算文件中的单词数

上一篇：python - 任意长度的 Numpy 分段

下一篇：python - 使用 boto (AWS Python)，如何获取 IAM 用户列表？