考虑这个数据框:
id name date_time strings
1 'AAA' 2018-08-03 18:00:00 1125,1517,656,657
1 'AAA' 2018-08-03 18:45:00 128,131,646,535,157,159
1 'AAA' 2018-08-03 18:49:00 131
1 'BBB' 2018-08-03 19:41:00 0
1 'BBB' 2018-08-05 19:30:00 0
1 'AAA' 2018-08-04 11:00:00 131
1 'AAA' 2018-08-04 11:30:00 1000
1 'AAA' 2018-08-04 11:33:00 1000,5555
首先,我想检查共享 id 和 name 的行组,如果每个连续行之间存在公共(public)字符串,则匹配为 True(某些字符串列没有值,因此它们已被 0 填充。所需的输出:
id name date_time strings match
1 'AAA' 2018-08-03 18:00:00 1125,128,1517,656,657 False
1 'AAA' 2018-08-03 18:45:00 128,131,646,535,157,159 True
1 'AAA' 2018-08-03 18:49:00 131 True
1 'BBB' 2018-08-03 19:41:00 0 False
1 'BBB' 2018-08-05 19:30:00 0 False
1 'AAA' 2018-08-04 11:00:00 131 True
1 'AAA' 2018-08-04 11:30:00 1000 False
1 'AAA' 2018-08-04 11:33:00 1000,5555 True
然后按 id 和 name 对行进行分组,并查找匹配值为 True 的每个连续行之间的时间差,如果时间差小于 00:05:00,则标志为 1。最终输出:
id name date_time strings diff flag
1 'AAA' 2018-08-03 18:00:00 1125,128,1517,656,657 00:00:00 0
1 'AAA' 2018-08-03 18:45:00 128,131,646,535,157,159 00:00:00 0
1 'AAA' 2018-08-03 18:49:00 131 00:04:00 1
1 'BBB' 2018-08-03 19:41:00 0 00:00:00 0
1 'BBB' 2018-08-05 19:30:00 0 00:00:00 0
1 'AAA' 2018-08-04 11:00:00 131 16:15:00 0
1 'AAA' 2018-08-04 11:30:00 1000 00:00:00 0
1 'AAA' 2018-08-04 11:33:00 1000,5555 00:33:00 0
对于第一部分,我尝试了此代码,但它无法正常工作:
grouped = df.groupby(['id','name'])
z = []
for index,row in grouped:
z.append(list(zip(row['strings'], row['strings'].shift())))
df['match'] = [bool(set(str(s1).split(','))& set(str(s2).split(','))) for i in range(len(z)) for s1,s2 in z[i]]
对于第二部分,我尝试了不同的解决方案,但没有一个有效。
任何提示都值得赞赏。
最佳答案
如果您想将 cad 锐化与之前的锐化进行比较,请使用:
dummies=df.strings.str.get_dummies(',')
c1=df['strings'].ne('0')
c2=( dummies.groupby([df['id'],df['name']]).shift().eq(dummies) & dummies.ge(1) ).any(axis=1)
df['match']=c1&c2
df['diff']=( df.groupby(['id','name','match'])['date_time']
.diff()
.where(df['match'])
.fillna(pd.Timedelta(hours=0)) )
print(df)
id name date_time strings match diff
0 1 'AAA' 2018-08-03 18:00:00 1125,128,1517,656,657 False 00:00:00
1 1 'AAA' 2018-08-03 18:45:00 128,131,646,535,157,159 True 00:00:00
2 1 'AAA' 2018-08-03 18:49:00 131 True 00:04:00
3 1 'BBB' 2018-08-03 19:41:00 0 False 00:00:00
4 1 'BBB' 2018-08-05 19:30:00 0 False 00:00:00
5 1 'AAA' 2018-08-04 11:00:00 131 True 16:11:00
6 1 'AAA' 2018-08-04 11:30:00 1000 False 00:00:00
7 1 'AAA' 2018-08-04 11:33:00 1000,5555 True 00:33:00
如果您想将每一行与相邻行进行比较:
dummies=df.strings.str.get_dummies(',')
c1=df['strings'].ne('0') # or df['strings'].ne(0)
c2=( (dummies.groupby([df['id'],df['name']],as_index=False)
.rolling(3,center=True,min_periods=1)
.sum()
.gt(1) ).any(axis=1)
.reset_index(level=0,drop='level_0') )
df['match']=c1&c2
df['diff']=( df.groupby(['id','name','match'])['date_time']
.diff()
.where(df['match'])
.fillna(pd.Timedelta(hours=0)) )
print(df)
输出
id name date_time strings match diff
0 1 'AAA' 2018-08-03 18:00:00 1125,1517,656,657 False 00:00:00
1 1 'AAA' 2018-08-03 18:45:00 128,131,646,535,157,159 True 00:00:00
2 1 'AAA' 2018-08-03 18:49:00 131 True 00:04:00
3 1 'BBB' 2018-08-03 19:41:00 0 False 00:00:00
4 1 'BBB' 2018-08-05 19:30:00 0 False 00:00:00
5 1 'AAA' 2018-08-04 11:00:00 131 True 16:11:00
6 1 'AAA' 2018-08-04 11:30:00 1000 True 00:30:00
7 1 'AAA' 2018-08-04 11:33:00 1000,5555 True 00:03:00
关于python-3.x - 比较包含字符串的数据帧行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59162659/