我有一个包含整数、 float 和字符串的数据集。我(认为我)通过以下语句将所有字符串转换为类别:
for col in list (X):
if X[col].dtype == np.object_:#dtype ('object'):
X [col] = X [col].str.lower().astype('category', copy=False)
但是,当我想输入随机森林模型的数据时,我收到错误:
ValueError: could not convert string to float: 'non-compliant by no payment'
字符串“不付款不合规”出现在名为 X['compliance_detail']
的列中,当我请求其 dtype
时,我得到 category
。当我询问它的值时:
In[111]: X['compliance_detail'].dtype
Out[111]: category
In[112]: X['compliance_detail'].value_counts()
Out[112]:
non-compliant by no payment 5274
non-compliant by late payment more than 1 month 939
compliant by late payment within 1 month 554
compliant by on-time payment 374
compliant by early payment 10
compliant by payment with no scheduled hearing 7
compliant by payment on unknown date 3
Name: compliance_detail, dtype: int64
有人知道这里发生了什么吗?为什么在分类数据中会出现字符串?为什么此列列出 Int64 的数据类型?
感谢您的宝贵时间。
最佳答案
当您转换为类别类型时,该列仍保留其原始表示,但 pandas 会跟踪类别。
s
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
Name: A, dtype: object
s = s.astype('category')
s
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
Name: A, dtype: category
Categories (2, object): [bar, foo]
如果您想要整数类别,您有几个选择:
选项 1
cat.codes
s.cat.codes
0 1
1 0
2 1
3 0
4 1
5 0
6 1
7 1
dtype: int8
<小时/>
选项 2
pd.factorize
(不需要astype
)
pd.factorize(s)[0]
array([0, 1, 0, 1, 0, 1, 0, 0])
关于python - 为什么类别列被视为 pandas 中的字符串列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46516244/