我正在使用 Pandas 来解析我创建的数据框:
# Initial DF
A B C
0 -1 qqq XXX
1 20 www CCC
2 30 eee VVV
3 -1 rrr BBB
4 50 ttt NNN
5 60 yyy MMM
6 70 uuu LLL
7 -1 iii KKK
8 -1 ooo JJJ
我的目标是分析 A 列并将以下条件应用于数据框:
- 调查每一行
- 确定是否
df['A'].iloc[index]=-1
- 如果为真且
index=0
将第一行标记为要删除 - 如果为真且
index=N
将最后一行标记为要删除 - 如果
0<index<N
和df['A'].iloc[index]=-1
并且上一行或下一行包含 -1 (df['A'].iloc[index+]=-1
或df['A'].iloc[index-1]=-1
),将行标记为要删除;否则替换 -1 为前一个值和后一个值的平均值
最终的数据框应如下所示:
# Final DF
A B C
0 20 www CCC
1 30 eee VVV
2 40 rrr BBB
3 50 ttt NNN
4 60 yyy MMM
5 70 uuu LLL
我能够通过编写应用上述条件的简单代码来实现我的目标:
将 pandas 导入为 pd
# create dataframe
data = {'A':[-1,20,30,-1,50,60,70,-1,-1],
'B':['qqq','www','eee','rrr','ttt','yyy','uuu','iii','ooo'],
'C':['XXX','CCC','VVV','BBB','NNN','MMM','LLL','KKK','JJJ']}
df = pd.DataFrame(data)
# If df['A'].iloc[index]==-1:
# - option 1: remove row if first or last row are equal to -1
# - option 2: remove row if previous or following row contains -1 (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
# - option 3: replace df['A'].iloc[index] if: df['A'].iloc[index]==-1 and (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
N = len(df.index) # number of rows
index_vect = [] # store indexes of rows to be deleated
for index in range(0,N):
# option 1
if index==0 and df['A'].iloc[index]==-1:
index_vect.append(index)
elif index>1 and index<N and df['A'].iloc[index]==-1:
# option 2
if df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1:
index_vect.append(index)
# option 3
else:
df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)
# option 1
elif index==N and df['A'].iloc[index]==-1:
index_vect.append(index)
# remove rows to be deleated
df = df.drop(index_vect).reset_index(drop = True)
正如您所看到的,代码相当长,我想知道您是否可以建议一种更智能、更有效的方法来获得相同的结果。
此外,我注意到我的代码返回一条警告消息,原因是 df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)
行
你知道我如何优化这一行代码吗?
最佳答案
解决方案如下:
import numpy as np
# Let's replace -1 by Not a Number (NaN)
df.ix[df.A==-1,'A'] = np.nan
# If df.A is NaN and either the previous or next is also NaN, we don't select it
# This takes care of the condition on the first and last row too
df = df[~(df.A.isnull() & (df.A.shift(1).isnull() | df.A.shift(-1).isnull()))]
# Use interpolate to fill with the average of previous and next
df.A = df.A.interpolate(method='linear', limit=1)
这是生成的 df
:
A B C
1 20.0 www CCC
2 30.0 eee VVV
3 40.0 rrr BBB
4 50.0 ttt NNN
5 60.0 yyy MMM
6 70.0 uuu LLL
如果需要,您可以重置索引。
关于python - 如何使用 Pandas 重构简单的数据帧解析代码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40720660/