有几篇关于如何将分类数据编码到 Sklearn 决策树的帖子,但是从 Sklearn 文档中,我们得到了这些
Some advantages of decision trees are:
(...)
Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.
但运行以下脚本
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])
输出如下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b
我知道在 R 中可以通过 Sklearn 传递分类数据,这可能吗?
最佳答案
(这只是 my comment above 从 2016 年开始的重新格式化......它仍然适用。)
这个问题的公认答案具有误导性。
目前,sklearn 决策树不处理分类数据 - see issue #5442 .
使用标签编码的推荐方法转换为整数,DecisionTreeClassifier()
会将 视为数字。如果您的分类数据不是有序的,那就不好了 - 您最终会得到没有意义的拆分。
使用 OneHotEncoder
是当前唯一有效的方法,允许任意拆分不依赖于标签排序,但计算量很大。
关于python - 将分类数据传递给 Sklearn 决策树,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38108832/