我尝试用 python 和 sklearn 开始决策树。 工作方式是这样的:
import pandas as pd
from sklearn import tree
for col in set(train.columns):
if train[col].dtype == np.dtype('object'):
s = np.unique(train[col].values)
mapping = pd.Series([x[0] for x in enumerate(s)], index = s)
train_fea = train_fea.join(train[col].map(mapping))
else:
train_fea = train_fea.join(train[col])
dt = tree.DecisionTreeClassifier(min_samples_split=3,
compute_importances=True,max_depth=5)
dt.fit(train_fea, labels)
现在我尝试用 DictVectorizer 做同样的事情,但我的代码不起作用:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
train_fea = vec.fit_transform([dict(enumerate(sample)) for sample in train])
dt = tree.DecisionTreeClassifier(min_samples_split=3,
compute_importances=True,max_depth=5)
dt.fit(train_fea, labels)
最后一行出现错误:“ValueError:标签数=332448 与样本数=55 不匹配”。正如我从文档中了解到的那样,DictVectorize 旨在将标称特征转换为数字特征。我做错了什么?
更正(感谢 ogrisel 促使我制作完整示例):
import pandas as pd
import numpy as np
from sklearn import tree
##################################
# working example
train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'],
'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
columns = set(train.columns)
columns.remove('b')
train_fea = train[['b']]
for col in columns:
if train[col].dtype == np.dtype('object'):
s = np.unique(train[col].values)
mapping = pd.Series([x[0] for x in enumerate(s)], index = s)
train_fea = train_fea.join(train[col].map(mapping))
else:
train_fea = train_fea.join(train[col])
dt = tree.DecisionTreeClassifier(min_samples_split=3,
compute_importances=True,max_depth=5)
dt.fit(train_fea, train['c'])
##########################################
# example with DictVectorizer and error
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
train_fea = vec.fit_transform([dict(enumerate(sample)) for sample in train])
dt = tree.DecisionTreeClassifier(min_samples_split=3,
compute_importances=True,max_depth=5)
dt.fit(train_fea, train['c'])
在 ogrisel 的帮助下修复了最后的代码:
import pandas as pd
from sklearn import tree
from sklearn.feature_extraction import DictVectorizer
from sklearn import preprocessing
train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'x', 'f'],
'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
# encode labels
labels = train[['c']]
le = preprocessing.LabelEncoder()
labels_fea = le.fit_transform(labels)
# vectorize training data
del train['c']
train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()]
train_fea = DictVectorizer(sparse=False).fit_transform(train_as_dicts)
# use decision tree
dt = tree.DecisionTreeClassifier()
dt.fit(train_fea, labels_fea)
# transform result
predictions = le.inverse_transform(dt.predict(train_fea).astype('I'))
predictions_as_dataframe = train.join(pd.DataFrame({"Prediction": predictions}))
print predictions_as_dataframe
一切正常
最佳答案
您枚举样本的方式没有意义。只需打印它们以使其显而易见:
>>> import pandas as pd
>>> train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'],
... 'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
>>> samples = [dict(enumerate(sample)) for sample in train]
>>> samples
[{0: 'a'}, {0: 'b'}, {0: 'c'}, {0: 'd'}]
现在这在语法上是一个字典列表,但与您所期望的完全不同。尝试这样做:
>>> train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()]
>>> train_as_dicts
[{'a': 'a', 'c': 'b', 'b': 0, 'd': 'e'},
{'a': 'b', 'c': 'c', 'b': 1, 'd': 'e'},
{'a': 'a', 'c': 'b', 'b': 1, 'd': 'f'}]
这看起来好多了,现在让我们尝试向量化这些字典:
>>> from sklearn.feature_extraction import DictVectorizer
>>> vectorizer = DictVectorizer()
>>> vectorized_sparse = vectorizer.fit_transform(train_as_dicts)
>>> vectorized_sparse
<3x7 sparse matrix of type '<type 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
>>> vectorized_array = vectorized_sparse.toarray()
>>> vectorized_array
array([[ 1., 0., 0., 1., 0., 1., 0.],
[ 0., 1., 1., 0., 1., 1., 0.],
[ 1., 0., 1., 1., 0., 0., 1.]])
要了解每一列的含义,请询问向量化器:
>>> vectorizer.get_feature_names()
['a=a', 'a=b', 'b', 'c=b', 'c=c', 'd=e', 'd=f']
关于python - 将 DictVectorizer 与 sklearn DecisionTreeClassifier 结合使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15181311/