python - 在 Python 中挂载单词出现列表的有效方法

<分区>

我想挂载一个数据结构，说明出现的次数并按正确的顺序映射它们。

例如:

word_1 => 10 occurences

word_2 => 5 occurences

word_3 => 12 occurences

word_4 => 2 ocurrences

并且每个词都有一个id来表示它:

kw2id = ['word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3]

因此有序列表将是:

ordered_vocab = [2, 0, 1, 3]

例如我的代码是这样的...:

#build a vocabulary with the number of ocorrences
vocab = {}
count = 0
for line in open(DATASET_FILE):
    for word in line.split():
        if word in vocab:
            vocab[word] += 1
        else:
            vocab[word] = 1
    count += 1
    if not count % 100000:
        print(count, "documents processed")

我怎样才能有效做到这一点？

最佳答案

这就是Counters专为:

from collections import Counter
cnt = Counter()

with open(DATASET_FILE) as fp:
    for line in fp.readlines():
        for word in line.split():
            cnt[word] += 1

或者(使用生成器更短更“漂亮”):

from collections import Counter

with open(DATASET_FILE) as fp:
    words = (word for line in fp.readlines() for word in line.split())
    cnt = Counter(words)

关于python - 在 Python 中挂载单词出现列表的有效方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46917498/

上一篇：c++ - 高效实现二分查找

下一篇：c# - 检查两个数字中每个对应数字的总和是否相同？

python - 数据框未正确附加

Python 字符串中的字符匹配

ruby - 创建一个几乎排序的数组

algorithm - Haskell 中的串行理解

javascript - 我如何在 javascript(使用 redux)中构造数据，如果逻辑发生变化，这将消耗更少的内存并且将来可以轻松修改？

python - 10倍交叉验证并获得RMSE

c++ - 在程序中使用同一类的两个不同堆栈时出现段错误

java - 使用优先级队列的排序列表的迭代器

python - 在 Python 中重载 int()