python - 从 scikit 创建 ngrams 学习和计数向量化器抛出内存错误

标签 python memory numpy scikit-learn n-gram

我正在使用 scikit-learn 从多个文本文档构建 ngrams。我需要使用 countVectorizer 构建document-frequency

示例:

document1 = "john is a nice guy"

document2 = "person can be a guy"

因此,文档频率将是

{'be': 1,
 'can': 1,
 'guy': 2,
 'is': 1,
 'john': 1,
 'nice': 1,
 'person': 1}

这里的文档只是字符串,但是当我尝试使用大量数据时。它抛出内存错误。

代码:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = [Huge amount of data around 7MB] # ['john is a guy', 'person guy']
vectorizer = CountVectorizer(ngram_range=(1, 5))
X = vectorizer.fit_transform(document).todense()
tranformer = vectorizer.transform(document).todense()
matrix_terms = np.array(vectorizer.get_feature_names())
lst_freq =  map(sum,zip(*tranformer.A))          
matrix_freq = np.array(lst_freq)
final_matrix = np.array([matrix_terms,matrix_freq])

错误:

Traceback (most recent call last):
  File "demo1.py", line 13, in build_ngrams_matrix
    X = vectorizer.fit_transform(document).todense()
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 605, in todense
    return np.asmatrix(self.toarray(order=order, out=out))
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 901, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 269, in toarray
    B = self._process_toarray_args(order, out)
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 789, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

最佳答案

如评论所述,将大型稀疏矩阵转换为密集格式时会遇到内存问题。尝试这样的事情:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = [Huge amount of data around 7MB] # ['john is a guy', 'person guy']
vectorizer = CountVectorizer(ngram_range=(1, 5))

# Don't need both X and transformer; they should be identical
X = vectorizer.fit_transform(document)
matrix_terms = np.array(vectorizer.get_feature_names())

# Use the axis keyword to sum over rows
matrix_freq = np.asarray(X.sum(axis=0)).ravel()
final_matrix = np.array([matrix_terms,matrix_freq])

编辑:如果您想要从术语到频率的字典,请在调用 fit_transform 后尝试此操作:

terms = vectorizer.get_feature_names()
freqs = X.sum(axis=0).A1
result = dict(zip(terms, freqs))

关于python - 从 scikit 创建 ngrams 学习和计数向量化器抛出内存错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26887745/

相关文章:

python - 从包含多个数字的字符串中提取数字百分比

python - nosetests 覆盖范围包括 Python 包

python - 乘以 scipy.lti 传递函数

ipad - 继续使用后,iPad的运行速度变慢并卡住在独立模式下

python - 2D bin (x,y) 并计算 10 个最深数据点 (z) 的平均值 (c)

python - 如何使用滚动/移动平均值插入 csv 文件中的数据? (Python)

android - 如何打印每个任务在 Android 中使用的内存量

java - FrameLayout 和 Bitmap 内存泄漏导致 OOM

python - 在多索引数据帧上应用重复序列

python - 使用Python从视频中识别车牌