python-2.7 - 在 Python Pandas 中进行机器学习时出现内存错误

标签 python-2.7 memory pandas machine-learning out-of-memory

我正在尝试通过从较大的 DataFrame 中采样 100,000 行数据来进行机器学习训练/测试。我尝试过使用 30,000-60,000 个具有预期输出的随机样本,但是当增加到 100,000+ 时,就会出现内存错误。

# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import csv
import dask.dataframe as dd
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import Imputer

lr = LogisticRegression()
dv = DictVectorizer()
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

# Get csv file into data frame
data = pd.read_csv("file.csv", header=0, encoding="utf-8")
df = DataFrame(data)

# Random sampling a smaller dataframe for debugging
rows = random.sample(df.index, 100000)
df = df.ix[rows] # Warning!!!! overwriting original df

# Assign X and y variables
X = df.raw_name.values
y = df.ethnicity2.values

# Feature extraction functions
def feature_full_last_name(nameString):
    try:
        last_name = nameString.rsplit(None, 1)[-1]
        if len(last_name) > 1: # not accept name with only 1 character
            return last_name
        else: return '?'
    except: return '?'

# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'last-name': feature_full_last_name(i)} for i in X]

all_dict = my_dict

newX = dv.fit_transform(all_dict).toarray()

# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]

# Fitting X and y into model, using training data
lr.fit(X_train, y_train)

# Making predictions using trained data
y_train_predictions = lr.predict(X_train)
y_test_predictions = lr.predict(X_test)

print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
print (y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])

错误说明:

Traceback (most recent call last):
  File "C:\Users\Dropbox\Python_Exercises\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\MachineLearning\FamSearch_LogReg_GOOD8.py", line 93, in <module>
    newX = dv.fit_transform(all_dict).toarray()
  File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\compressed.py", line 942, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\coo.py", line 274, in toarray
    B = self._process_toarray_args(order, out)
  File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\base.py", line 793, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

最佳答案

这看起来不对:

newX = dv.fit_transform(all_dict).toarray()

因为 scikit-learn 中几乎所有估计器都支持稀疏数据集,但您正试图从稀疏数据集中变得密集。当然会消耗大量的内存。您需要避免在代码中使用 todense() 和 toarray() 方法。

关于python-2.7 - 在 Python Pandas 中进行机器学习时出现内存错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33161898/

相关文章:

python - Pandas Dataframe 列到 python 列表

python - 在 Python 中解析来自 http 请求的文本响应

python - 在没有busywait的情况下在python中实现亚毫秒处理

JMX 与 htop 报告的 Java 内存

c++ - 如果构造函数中抛出异常,防止内存泄漏的最佳做法是什么?

python - Pandas 货币换算

python-2.7 - 如何应用PCA和随机森林训练的模型来测试数据?

python - 如何使用python获取隐藏输入的值?

java - 一个演示Java内存泄漏的简单程序

python - Pandas MultiIndex Dataframe Groupby Rolling Mean