我想通过 data[Bare Nuclei'] != '?' 的条件来更改 pandas DataFrame 中的值
import pandas as pd
import numpy as np
column_names = ['Sample code number', 'Clump Thickness',
'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size',
'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli',
'Mitoses', 'Class']
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', names = column_names )
mean = 0
n = 0
for index,row in data.iterrows():
if row['Bare Nuclei'] != '?':
n += 1
mean += int(row['Bare Nuclei'])
mean = mean / n
temp = data
index = temp['Bare Nuclei'] == '?'
temp[index,'Bare Nuclei'] = mean
我想知道如何更改数据框中的值以及为什么我的方法是错误的?你能帮助我吗,我期待你的帮助!!
最佳答案
最后一行添加 DataFrame.loc
,因为需要更改 DataFrame
的列:
temp.loc[index,'Bare Nuclei'] = mean
<小时/>
但是在pandas中最好避免循环,因为速度慢。所以更好的解决方案是 replace
?
为 NaN
,然后 fillna
通过意思
:
data['Bare Nuclei'] = data['Bare Nuclei'].replace('?', np.nan).astype(float)
#more general
#data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')
data['Bare Nuclei'] = data['Bare Nuclei'].fillna(data['Bare Nuclei'].mean())
替代解决方案:
mask = data['Bare Nuclei'] == '?'
data['Bare Nuclei'] = data['Bare Nuclei'].mask(mask).astype(float)
data['Bare Nuclei'] = data['Bare Nuclei'].fillna(data['Bare Nuclei'].mean())
验证解决方案:
column_names = ['Sample code number', 'Clump Thickness',
'Uniformity of Cell Size', 'Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size',
'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli',
'Mitoses', 'Class']
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', names = column_names )
#print (data.head())
<小时/>
#get index values by condition
L = data.index[data['Bare Nuclei'] == '?'].tolist()
print (L)
[23, 40, 139, 145, 158, 164, 235, 249, 275, 292, 294, 297, 315, 321, 411, 617]
#get mean of values converted to numeric
print (data['Bare Nuclei'].replace('?', np.nan).astype(float).mean())
3.5446559297218156
print (data.loc[L, 'Bare Nuclei'])
23 ?
40 ?
139 ?
145 ?
158 ?
164 ?
235 ?
249 ?
275 ?
292 ?
294 ?
297 ?
315 ?
321 ?
411 ?
617 ?
Name: Bare Nuclei, dtype: object
#convert to numeric - replace `?` to NaN and cast to float
data['Bare Nuclei'] = data['Bare Nuclei'].replace('?', np.nan).astype(float)
#more general
#data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')
#replace NaNs by means
data['Bare Nuclei'] = data['Bare Nuclei'].fillna(data['Bare Nuclei'].mean())
<小时/>
#verify replacing
print (data.loc[L, 'Bare Nuclei'])
23 3.544656
40 3.544656
139 3.544656
145 3.544656
158 3.544656
164 3.544656
235 3.544656
249 3.544656
275 3.544656
292 3.544656
294 3.544656
297 3.544656
315 3.544656
321 3.544656
411 3.544656
617 3.544656
Name: Bare Nuclei, dtype: float64
关于python - 当我在 dataframe(pandas) 中设置值时出现错误 : 'Series' objects are mutable, 因此它们无法被散列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49356798/