<分区>
我想挂载一个数据结构,说明出现的次数并按正确的顺序映射它们。
例如:
word_1 => 10 occurences
word_2 => 5 occurences
word_3 => 12 occurences
word_4 => 2 ocurrences
并且每个词都有一个id来表示它:
kw2id = ['word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3]
因此有序列表将是:
ordered_vocab = [2, 0, 1, 3]
例如我的代码是这样的...:
#build a vocabulary with the number of ocorrences
vocab = {}
count = 0
for line in open(DATASET_FILE):
for word in line.split():
if word in vocab:
vocab[word] += 1
else:
vocab[word] = 1
count += 1
if not count % 100000:
print(count, "documents processed")
我怎样才能有效做到这一点?