python - 使用 scikit learn DictVectorizer 对特定列进行矢量化时出现问题?

标签 python python-2.7 pandas machine-learning scikit-learn

我想了解如何做一个简单的预测任务,我正在玩这个dataset ,也就是here以不同的格式。这是关于学生在某些类(class)中的表现,我想对数据集的某些列进行矢量化,以便不使用所有数据(只是为了了解它是如何工作的)。所以我尝试了以下操作,使用 DictVectorizer :

import pandas as pd
from sklearn.feature_extraction import DictVectorizer

training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv')

dict_vect = DictVectorizer(sparse=False)

training_matrix = dict_vect.fit_transform(training_data['G1','G2','sex','school','age'])
training_matrix.toarray()

然后我想传递另一个功能行,如下所示:

testing_data = pd.read_csv('/Users/user/Downloads/student/student-mat_test.csv')
test_matrix = dict_vect.transform(testing_data['G1','G2','sex','school','age'])

问题是我得到以下回溯:

/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 school_2.py
Traceback (most recent call last):
  File "/Users/user/PycharmProjects/PAN-pruebas/escuela_2.py", line 14, in <module>
    X = dict_vect.fit_transform(df['sex','age','address','G1','G2'].values)
  File "school_2.py", line 1787, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1794, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1079, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 2843, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/index.py", line 1437, in get_loc
    return self._engine.get_loc(_values_from_object(key))
  File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)
  File "pandas/hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12349)
  File "pandas/hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12300)
KeyError: ('sex', 'age', 'address', 'G1', 'G2')

Process finished with exit code 1

知道如何正确矢量化这两个数据(即训练和测试)吗?并使用 .toarray() 显示两个矩阵

更新

>>>print training_data.info()
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/PAN-pruebas/escuela_3.py
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 396 entries, (school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences) to (MS, M, 19, U, LE3, T, 1, 1, other, at_home, course, father, 1, 1, 0, no, no, no, no, yes, yes, yes, no, 3, 2, 3, 3, 3, 5, 5)
Data columns (total 3 columns):
id         396 non-null object
content    396 non-null object
label      396 non-null object
dtypes: object(3)
memory usage: 22.7+ KB
None

Process finished with exit code 0

最佳答案

您需要传递一个列表:

test_matrix = dict_vect.transform(testing_data[['G1','G2','sex','school','age']])

您所做的就是尝试使用以下键对 df 建立索引:

['G1','G2','sex','school','age']

这就是为什么你会得到一个KeyError,因为没有像上面那样命名的单列,要选择多个列,你需要传递列名列表和双下标[[ col_list]]

示例:

In [43]:

df = pd.DataFrame(columns=['a','b'])
df
Out[43]:
Empty DataFrame
Columns: [a, b]
Index: []
In [44]:

df['a','b']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-44-33332c7e7227> in <module>()
----> 1 df['a','b']

......    
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12349)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12300)()

KeyError: ('a', 'b')

但这有效:

In [45]:

df[['a','b']]
Out[45]:
Empty DataFrame
Columns: [a, b]
Index: []

关于python - 使用 scikit learn DictVectorizer 对特定列进行矢量化时出现问题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29975033/

相关文章:

python - 针对 python 2 与 python 3 编译时,Cython 代码运行速度慢 125 倍

python - 数据实验室 : How to export Big Query standard SQL query to dataframe?

Python - 基于多个排序项的元组排序列表

python - 在 Linux 上使用 Python 以编程方式提供辅助 WiFi 热点凭证

python - pandas 中的条件计算

Selenium /Python :TypeError:undound method get()

python - 将值的顺序索引减少到 python 中每个值的一组顺序范围的最快方法

python - 根据匹配从另一个数据帧计算数据帧字段值

python - 将文本从一个单元格复制到另一个单元格而不删除原始内容python

python - 找不到 Django 图像源