python-3.x - 比较包含字符串的数据帧行

考虑这个数据框:

id     name           date_time                 strings   
1      'AAA'    2018-08-03 18:00:00             1125,1517,656,657
1      'AAA'    2018-08-03 18:45:00             128,131,646,535,157,159
1      'AAA'    2018-08-03 18:49:00             131
1      'BBB'    2018-08-03 19:41:00             0
1      'BBB'    2018-08-05 19:30:00             0
1      'AAA'    2018-08-04 11:00:00             131
1      'AAA'    2018-08-04 11:30:00             1000
1      'AAA'    2018-08-04 11:33:00             1000,5555

首先，我想检查共享 id 和 name 的行组，如果每个连续行之间存在公共(public)字符串，则匹配为 True(某些字符串列没有值，因此它们已被 0 填充。所需的输出:

id     name           date_time                 strings                    match       
1      'AAA'    2018-08-03 18:00:00             1125,128,1517,656,657       False       
1      'AAA'    2018-08-03 18:45:00             128,131,646,535,157,159     True       
1      'AAA'    2018-08-03 18:49:00             131                         True
1      'BBB'    2018-08-03 19:41:00             0                           False
1      'BBB'    2018-08-05 19:30:00             0                           False
1      'AAA'    2018-08-04 11:00:00             131                         True
1      'AAA'    2018-08-04 11:30:00             1000                        False
1      'AAA'    2018-08-04 11:33:00             1000,5555                   True

然后按 id 和 name 对行进行分组，并查找匹配值为 True 的每个连续行之间的时间差，如果时间差小于 00:05:00，则标志为 1。最终输出:

id     name           date_time                 strings                    diff        flag      
1      'AAA'    2018-08-03 18:00:00             1125,128,1517,656,657       00:00:00    0  
1      'AAA'    2018-08-03 18:45:00             128,131,646,535,157,159     00:00:00    0      
1      'AAA'    2018-08-03 18:49:00             131                         00:04:00    1
1      'BBB'    2018-08-03 19:41:00             0                           00:00:00    0
1      'BBB'    2018-08-05 19:30:00             0                           00:00:00    0
1      'AAA'    2018-08-04 11:00:00             131                         16:15:00    0
1      'AAA'    2018-08-04 11:30:00             1000                        00:00:00    0
1      'AAA'    2018-08-04 11:33:00             1000,5555                   00:33:00    0

对于第一部分，我尝试了此代码，但它无法正常工作:

grouped = df.groupby(['id','name'])
z = []
for index,row in grouped:
    z.append(list(zip(row['strings'], row['strings'].shift())))
df['match'] = [bool(set(str(s1).split(','))& set(str(s2).split(','))) for i in range(len(z)) for s1,s2 in z[i]]

对于第二部分，我尝试了不同的解决方案，但没有一个有效。

任何提示都值得赞赏。

最佳答案

如果您想将 cad 锐化与之前的锐化进行比较，请使用:

dummies=df.strings.str.get_dummies(',')
c1=df['strings'].ne('0')
c2=( dummies.groupby([df['id'],df['name']]).shift().eq(dummies) & dummies.ge(1) ).any(axis=1)
df['match']=c1&c2
df['diff']=( df.groupby(['id','name','match'])['date_time']
               .diff()
               .where(df['match'])
               .fillna(pd.Timedelta(hours=0)) )
print(df)

   id   name           date_time                  strings  match     diff
0   1  'AAA' 2018-08-03 18:00:00    1125,128,1517,656,657  False 00:00:00
1   1  'AAA' 2018-08-03 18:45:00  128,131,646,535,157,159   True 00:00:00
2   1  'AAA' 2018-08-03 18:49:00                      131   True 00:04:00
3   1  'BBB' 2018-08-03 19:41:00                        0  False 00:00:00
4   1  'BBB' 2018-08-05 19:30:00                        0  False 00:00:00
5   1  'AAA' 2018-08-04 11:00:00                      131   True 16:11:00
6   1  'AAA' 2018-08-04 11:30:00                     1000  False 00:00:00
7   1  'AAA' 2018-08-04 11:33:00                1000,5555   True 00:33:00

如果您想将每一行与相邻行进行比较:

dummies=df.strings.str.get_dummies(',')
c1=df['strings'].ne('0') # or  df['strings'].ne(0)
c2=( (dummies.groupby([df['id'],df['name']],as_index=False)
             .rolling(3,center=True,min_periods=1)
             .sum()
             .gt(1) ).any(axis=1)
                     .reset_index(level=0,drop='level_0') )
df['match']=c1&c2
df['diff']=( df.groupby(['id','name','match'])['date_time']
               .diff()
               .where(df['match'])
               .fillna(pd.Timedelta(hours=0)) )
print(df)

输出

   id   name           date_time                  strings  match     diff
0   1  'AAA' 2018-08-03 18:00:00        1125,1517,656,657  False 00:00:00
1   1  'AAA' 2018-08-03 18:45:00  128,131,646,535,157,159   True 00:00:00
2   1  'AAA' 2018-08-03 18:49:00                      131   True 00:04:00
3   1  'BBB' 2018-08-03 19:41:00                        0  False 00:00:00
4   1  'BBB' 2018-08-05 19:30:00                        0  False 00:00:00
5   1  'AAA' 2018-08-04 11:00:00                      131   True 16:11:00
6   1  'AAA' 2018-08-04 11:30:00                     1000   True 00:30:00
7   1  'AAA' 2018-08-04 11:33:00                1000,5555   True 00:03:00

关于python-3.x - 比较包含字符串的数据帧行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59162659/

python-3.x - 比较包含字符串的数据帧行

上一篇：reactjs - React Native - 将选项allowNamespaces传递给babel typescript 插件

下一篇：sql - 聚合函数中的 Null