python - 值错误: invalid fill value with a <class 'pandas.core.frame.DataFrame' >

标签 python python-2.7 pandas machine-learning

我正在练习贷款预测练习问题,并尝试填充数据中的缺失值。我从here获取数据。为了完成这个问题,我遵循这个tutorial .

您可以找到我正在使用的完整代码(文件名 model.py)和数据 here在 GitHub 上。

数据框看起来像这样:

df[['Loan_ID', 'Self_Employed', 'Education', 'LoanAmount']].head(10)
Out: 
    Loan_ID Self_Employed     Education  LoanAmount
0  LP001002            No      Graduate         NaN
1  LP001003            No      Graduate       128.0
2  LP001005           Yes      Graduate        66.0
3  LP001006            No  Not Graduate       120.0
4  LP001008            No      Graduate       141.0
5  LP001011           Yes      Graduate       267.0
6  LP001013            No  Not Graduate        95.0
7  LP001014            No      Graduate       158.0
8  LP001018            No      Graduate       168.0
9  LP001020            No      Graduate       349.0

最后一行执行后(对应model.py文件中的第60行)

url = 'https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'
df = pd.read_csv(url) 
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Self_Employed'].fillna('No',inplace=True)

table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
# Define function to return value of this pivot_table
def fage(x):
 return table.loc[x['Self_Employed'],x['Education']]
# Replace missing values
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

我收到此错误:

ValueError                                Traceback (most recent call last)
<ipython-input-40-5146e49c2460> in <module>()
----> 1 df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
   2368                                           axis=axis, inplace=inplace,
   2369                                           limit=limit, downcast=downcast,
-> 2370                                           **kwargs)
   2371 
   2372     @Appender(generic._shared_docs['shift'] % _shared_doc_kwargs)

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in fillna(self, value, method, axis, inplace, limit, downcast)
   3264                 else:
   3265                     raise ValueError("invalid fill value with a %s" %
-> 3266                                      type(value))
   3267 
   3268                 new_data = self._data.fillna(value=value, limit=limit,

ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>

如何填充缺失值而不出现此错误?

最佳答案

教程的作者似乎想用 table 的值替换 NaN

但需要先通过 unstack 创建系列set_index用于对齐数据。

首先删除用 mean 替换为 NaN:

url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'

df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas

#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

df['Self_Employed'].fillna('No',inplace=True)
<小时/>
table = df.pivot_table(values='LoanAmount', 
                       index='Self_Employed', 
                       columns='Education', 
                       aggfunc=np.median)

print (table.unstack())
Education     Self_Employed
Graduate      No               130.0
              Yes              157.5
Not Graduate  No               113.0
              Yes              130.0
dtype: float64
<小时/>
#check all values with NaN in LoanAmount column
print (df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate         NaN
35             No      Graduate         NaN
63             No      Graduate         NaN
81            Yes      Graduate         NaN
95             No      Graduate         NaN
102            No      Graduate         NaN
103            No      Graduate         NaN
113           Yes      Graduate         NaN
127            No      Graduate         NaN
202            No  Not Graduate         NaN
284            No      Graduate         NaN
305            No  Not Graduate         NaN
322            No  Not Graduate         NaN
338            No  Not Graduate         NaN
387            No  Not Graduate         NaN
435            No      Graduate         NaN
437            No      Graduate         NaN
479            No      Graduate         NaN
524            No      Graduate         NaN
550           Yes      Graduate         NaN
551            No  Not Graduate         NaN
605            No  Not Graduate         NaN
<小时/>
#for check get all indexes where NaNs
idx = df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']].index
print (idx)
Int64Index([  0,  35,  63,  81,  95, 102, 103, 113, 127, 202, 284, 305, 322,
            338, 387, 435, 437, 479, 524, 550, 551, 605],

# Replace missing values
df = df.set_index(['Education','Self_Employed'])
df['LoanAmount'].fillna(table.unstack(), inplace=True)
df = df.reset_index()
<小时/>
#check output - filter only indexes where NaNs before
print (df.loc[df.index.isin(idx), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate       130.0
35             No      Graduate       130.0
63             No      Graduate       130.0
81            Yes      Graduate       157.5
95             No      Graduate       130.0
102            No      Graduate       130.0
103            No      Graduate       130.0
113           Yes      Graduate       157.5
127            No      Graduate       130.0
202            No  Not Graduate       113.0
284            No      Graduate       130.0
305            No  Not Graduate       113.0
322            No  Not Graduate       113.0
338            No  Not Graduate       113.0
387            No  Not Graduate       113.0
435            No      Graduate       130.0
437            No      Graduate       130.0
479            No      Graduate       130.0
524            No      Graduate       130.0
550           Yes      Graduate       157.5
551            No  Not Graduate       113.0
605            No  Not Graduate       113.0

编辑:

更好的解决方案是 groupbyapply其中将 NaN 替换为 median:

url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'

df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas

#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

df['Self_Employed'].fillna('No',inplace=True)


print (df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate         NaN
35             No      Graduate         NaN
63             No      Graduate         NaN
81            Yes      Graduate         NaN
95             No      Graduate         NaN
102            No      Graduate         NaN
103            No      Graduate         NaN
113           Yes      Graduate         NaN
127            No      Graduate         NaN
202            No  Not Graduate         NaN
284            No      Graduate         NaN
305            No  Not Graduate         NaN
322            No  Not Graduate         NaN
338            No  Not Graduate         NaN
387            No  Not Graduate         NaN
435            No      Graduate         NaN
437            No      Graduate         NaN
479            No      Graduate         NaN
524            No      Graduate         NaN
550           Yes      Graduate         NaN
551            No  Not Graduate         NaN
605            No  Not Graduate         NaN
<小时/>
idx = df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']].index
print (idx)
Int64Index([  0,  35,  63,  81,  95, 102, 103, 113, 127, 202, 284, 305, 322,
            338, 387, 435, 437, 479, 524, 550, 551, 605],
           dtype='int64')

# Replace missing values
df['LoanAmount'] = df.groupby(['Education','Self_Employed'])['LoanAmount']
                     .apply(lambda x: x.fillna(x.median()))
<小时/>
print (df.loc[df.index.isin(idx), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate       130.0
35             No      Graduate       130.0
63             No      Graduate       130.0
81            Yes      Graduate       157.5
95             No      Graduate       130.0
102            No      Graduate       130.0
103            No      Graduate       130.0
113           Yes      Graduate       157.5
127            No      Graduate       130.0
202            No  Not Graduate       113.0
284            No      Graduate       130.0
305            No  Not Graduate       113.0
322            No  Not Graduate       113.0
338            No  Not Graduate       113.0
387            No  Not Graduate       113.0
435            No      Graduate       130.0
437            No      Graduate       130.0
479            No      Graduate       130.0
524            No      Graduate       130.0
550           Yes      Graduate       157.5
551            No  Not Graduate       113.0
605            No  Not Graduate       113.0

编辑:

还有一个问题:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

解决方案是替换 NaNs:

df['Loan_Status'].fillna('No',inplace=True)
df['Credit_History'].fillna(0,inplace=True) 

outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']

classification_model(model, df, predictor_var,outcome_var)

关于python - 值错误: invalid fill value with a <class 'pandas.core.frame.DataFrame' >,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44450725/

相关文章:

python - 在类对象中存储未绑定(bind)的 python 函数

python - 使用正则表达式 re.search 和 re.compile 的问题

python - 没有匹配时打印空数据框

python - 选择具有最大值的条目并删除指定日期的其他条目

python - 将转义字符写入文件

python - 如何在 Pandas 中正确找到偏度和峰度?

python - 通过仅复制键从旧字典创建字典

python - pandas分布表按所有列分组

python - 同时使用 usecols 和 skiprows(在 Pandas read_csv 中)会出错

python - 我如何通过 Pandas 获得号码