我正在尝试使用用户@Garret 提供的修改后的代码版本分析以下数据集中的一些内容,但我遇到了一些问题。
数据集有一列显示客户是由现场代理还是自动化机器参与的。我试图找出成员首先连接到代理然后没有连接的并发调用之间的区别。调用必须具有相同的调用原因,并且必须在时间戳方面放置在初始调用之后。此外,中间有其他原因的电话也是可以的。
这是数据集:
data = [['bob13', 1, 'returns','automated',' 2019-08-18 10:12:00'],['bob13', 0, 'returns','automated',' 2019-03-18 10:12:00'],\
['bob13', 8, 'returns','agent',' 2019-04-18 10:15:00'],['rach2', 2, 'shipping','automated',' 2019-04-19 10:15:00'],\
['bob13', 0, 'returns','agent',' 2019-05-18 11:12:00'],['rach2', 0, 'shipping','agent',' 2019-04-18 11:15:00'],\
['bob13', 3, 'returns','agent',' 2019-02-18 10:12:00'],['rach2', 8, 'shipping','agent',' 2019-05-19 10:15:00'],\
['rach2', 7, 'shipping','automated',' 2019-06-19 10:15:00'],['roy', 4, 'exchange','agent','2019-03-26 17:36:00'],\
['roy', 5, 'exchange','automated','2019-01-28 09:48:00']]
df = pd.DataFrame(data, columns = ['member_id', 'survey_score','call_reason','connection','time_stamp'])
df.sort_values(by=['time_stamp']).head(20)
member_id survey_score call_reason connection time_stamp
6 bob13 3 returns agent 2019-02-18 10:12:00
1 bob13 0 returns automated 2019-03-18 10:12:00
2 bob13 8 returns agent 2019-04-18 10:15:00
5 rach2 0 shipping agent 2019-04-18 11:15:00
3 rach2 2 shipping automated 2019-04-19 10:15:00
4 bob13 0 returns agent 2019-05-18 11:12:00
7 rach2 8 shipping agent 2019-05-19 10:15:00
8 rach2 7 shipping automated 2019-06-19 10:15:00
0 bob13 1 returns automated 2019-08-18 10:12:00
10 roy 5 exchange automated 2019-01-28 09:48:00
9 roy 4 exchange agent 2019-03-26 17:36:00
我期望的输出如下:
member_id call_reason automated agent score differential
bob13 returns 0 3 -3
bob13 returns 1 0 1
rach2 shipping 2 0 2
rach2 shipping 7 8 -1
所以基本上,只是在 call_reason 和 connection 方面寻找两个调用之间的区别。第一个调用是当成员连接到代理时,第二个调用必须在第一个基于时间戳之后,出于同样的原因,并且必须连接到自动化系统。如果中间因为其他原因打过电话也没关系。我试过的代码如下:
grp = df.query('connection=="automated"').\
groupby(['member_id', 'call_reason'])
df['OutId'] = grp.time_stamp.transform(lambda x: x.rank())
df.head(10)
grp = df.groupby(['member_id', 'call_reason'])
df['Id'] = grp.OutId.transform(lambda x: x.bfill())
df.head(10)
agent = df.query('connection=="agent"').\
groupby(['member_id', 'call_reason', 'Id']).survey_score.last()
automated = df.query('connection=="automated"').\
groupby(['member_id', 'call_reason', 'Id']).survey_score.last()
ddf = pd.concat([automated, agent], axis=1,
keys=['automated', 'agent'])
ddf['score_differential'] = ddf.automated - ddf.agent
我得到的输出是:
ddf.dropna().head(10)
automated agent score_differential
member_id call_reason Id
rach2 shipping 2.0 7 8.0 -1.0
roy exchange 1.0 5 4.0 1.0
同样,预期的输出将是:
member_id call_reason automated agent score differential
bob13 returns 0 3 -3
bob13 returns 1 0 1
rach2 shipping 2 0 2
rach2 shipping 7 8 -1
注意:如果解决方案可以灵活一点,以便我可以分析一些不同的场景,我会很高兴,例如:
- 仅自动调用之间的区别
- 仅连接到座席的调用之间的区别
- 第一次调用连接到座席时调用之间的区别,在第二次调用中,连接类型无关紧要
对此的额外帮助将不胜感激!
最佳答案
您可以通过创建一个函数,然后将该函数应用于 groupby 中的组来实现这一点。
设置初始数据框:
import pandas as pd
data = [['bob13', 1, 'returns','automated',' 2019-08-18 10:12:00'],['bob13', 0, 'returns','automated',' 2019-03-18 10:12:00'],\
['bob13', 8, 'returns','agent',' 2019-04-18 10:15:00'],['rach2', 2, 'shipping','automated',' 2019-04-19 10:15:00'],\
['bob13', 0, 'returns','agent',' 2019-05-18 11:12:00'],['rach2', 0, 'shipping','agent',' 2019-04-18 11:15:00'],\
['bob13', 3, 'returns','agent',' 2019-02-18 10:12:00'],['rach2', 8, 'shipping','agent',' 2019-05-19 10:15:00'],\
['rach2', 7, 'shipping','automated',' 2019-06-19 10:15:00'],['roy', 4, 'exchange','agent','2019-03-26 17:36:00'],\
['roy', 5, 'exchange','automated','2019-01-28 09:48:00']]
df = pd.DataFrame(data, columns = ['member_id', 'survey_score','call_reason','connection','time_stamp'])
df.sort_values(by=['time_stamp']).head(20)
df['time_stamp'] = pd.to_datetime(df['time_stamp'])
df
member_id survey_score call_reason connection time_stamp
0 bob13 1 returns automated 2019-08-18 10:12:00
1 bob13 0 returns automated 2019-03-18 10:12:00
2 bob13 8 returns agent 2019-04-18 10:15:00
3 rach2 2 shipping automated 2019-04-19 10:15:00
4 bob13 0 returns agent 2019-05-18 11:12:00
5 rach2 0 shipping agent 2019-04-18 11:15:00
6 bob13 3 returns agent 2019-02-18 10:12:00
7 rach2 8 shipping agent 2019-05-19 10:15:00
8 rach2 7 shipping automated 2019-06-19 10:15:00
9 roy 4 exchange agent 2019-03-26 17:36:00
10 roy 5 exchange automated 2019-01-28 09:48:00
每当我尝试解决这样的问题时,我都会分成一个特定的组。所以我只是隔离了 bob13,并尝试复制到达我们想要的 bob。这让我进行了一系列特定的步骤,然后我将这些步骤放入一个函数中:
我们按时间对数据帧进行排序,然后创建名为 next_connection 和 'next_score' 的新列。这些将结果从下一个结果转移,以便我们将其包含在该行中。我们删除所有缺失的(组中的最后一个,因为没有下一个),我们隔离连接为 agent
且 next_connection 为 automated
的所有行。我们重命名列以匹配您的输出内容,并计算得分差异。
def function_(df):
df = df.sort_values('time_stamp')
df['next_connection'] = df.connection.shift(-1)
df['next_score'] = df.survey_score.shift(-1)
df = df.dropna()
df = df[(df.connection == 'agent') & (df.next_connection == 'automated')]
df = df.rename(columns={'survey_score':'agent', 'next_score':'automated'})
df['score differential'] = df['automated'] - df['agent']
return df
现在我们将其应用于按 member_id
和 call_reason
分组的数据框。
g = df.groupby(['member_id', 'call_reason']).apply(function_)
g[['member_id','call_reason','automated','agent','score differential']].reset_index(drop=True)
member_id call_reason automated agent score differential
0 bob13 returns 0.0 3 -3.0
1 bob13 returns 1.0 0 1.0
2 rach2 shipping 2.0 0 2.0
3 rach2 shipping 7.0 8 -1.0
关于python - Pandas 数据框中并发调用分数之间的差异,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56857049/