我在下面的代码中遇到了一个简单的错误。
我的目标是使用 simpleimputer 一次性插入不同数据类型的缺失值。
当我尝试这样做时,fit_transform 似乎没有按预期工作。 当不使用 dtype 参数时,代码可以正常工作,但生成的数据帧会丢失其数据类型信息。当我在参数中包含 dtype 列表时,我看到以下错误。您应该能够通过复制并粘贴到本地来模拟错误。
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
import sklearn
print(sklearn.__version__)
0.21.dev0
data = [['Alex','NJ',21,5.10],['Mary','NY',20,np.nan],['Sam',np.nan,np.nan,6.3]]
df = pd.DataFrame(data,columns=['Name','State','Age','Height'])
df.dtypes
Name object
State object
Age float64
Height float64
dtype: object
imp = SimpleImputer(strategy="most_frequent")
#df = pd.DataFrame(imp.fit_transform(df),columns=df.columns) <<<<----- This works just fine
#df
#Name State Age Height
#0 Alex NJ 21 5.1
#1 Mary NY 20 5.1
#2 Sam NJ 20 6.3
#df.dtypes
#Name object
#State object
#Age object
#Height object
#dtype: object
以下语句失败 - 并出现下面列出的错误(我试图在插补过程中保留数据类型)
df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-e9780979921f> in <module>()
7
8 imp = SimpleImputer(strategy="most_frequent")
----> 9 df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
337 data = {}
338 if dtype is not None:
--> 339 dtype = self._validate_dtype(dtype)
340
341 if isinstance(data, DataFrame):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _validate_dtype(self, dtype)
166
167 if dtype is not None:
--> 168 dtype = pandas_dtype(dtype)
169
170 # a compound dtype
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\common.py in pandas_dtype(dtype)
2020 # which we safeguard against by catching them earlier and returning
2021 # np.dtype(valid_dtype) before this condition is evaluated.
-> 2022 if dtype in [object, np.object_, 'object', 'O']:
2023 return npdtype
2024 elif npdtype.kind == 'O':
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1574 raise ValueError("The truth value of a {0} is ambiguous. "
1575 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1576 .format(self.__class__.__name__))
1577
1578 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
最佳答案
如果你想保留数据类型,我建议使用 pandas 查找模式,然后调用 fillna
:
df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
print(df)
Name State Age Height
0 Alex NJ 21.0 5.1
1 Mary NY 20.0 5.1
2 Sam NJ 20.0 6.3
print(df.dtypes)
Name object
State object
Age float64
Height float64
dtype: object
或者,使用 astype
并传递字典:
df = pd.DataFrame(
imp.fit_transform(df), columns=df.columns
).astype(df.dtypes.to_dict())
print(df)
Name State Age Height
0 Alex NJ 21.0 5.1
1 Mary NY 20.0 5.1
2 Sam NJ 20.0 6.3
print(df.dtypes)
Name object
State object
Age float64
Height float64
dtype: object
需要显式 astype
调用,因为根据文档,只能将单个 dtype
传递给 pd.DataFrame
构造函数。
?pd.DataFrame ... dtype : dtype, default None | Data type to force. Only a single dtype is allowed.
关于pandas simpleimputer 保留数据类型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53254292/