python - 如何根据时差分离 Pandas 数据框?

标签 python python-3.x pandas

我提取了我组的短信,它看起来像下表(不包括第三列)。

如何根据时间将第一个对话与第二个对话分开,并将每条消息分配给一个 conversation_id?

如果有帮助,我很乐意假设如果任何消息在一小时内未得到回复,下一条消息将开始新的对话。

如果给出前两列,理想情况下,我能够在近 45,000 条具有不同对话长度的消息中找出第三列。

一旦我将我们的短信分成对话,我想我可以训练一个聊天机器人来参与我们的聊天!我不知道我在这里做什么,所以我很感激任何帮助:)

| created_at          | message                              | conversation_id |
| 2018-07-03 02:12:33 | knock knock                          | 1               |
| 2018-07-03 02:12:35 | who's there                          | 1               |
| 2018-07-03 02:12:40 | Europe                               | 1               |
| 2018-07-03 02:12:45 | Europe who?                          | 1               |
| 2018-07-03 02:12:48 | No - you're a poo                    | 1               |
| 2018-07-03 03:15:17 | knock knock                          | 2               |
| 2018-07-03 03:15:20 | who's there                          | 2               |
| 2018-07-03 03:15:23 | the KGB                              | 2               |
| 2018-07-03 03:15:28 | the KGB who?                         | 2               |
| 2018-07-03 03:15:33 | SLAP the KGB will ask the questions! | 2               |

最佳答案

这应该可以解决问题。我不认为您可以通过迭代数据框来分配 ID,因为它们基于与列中先前值的链接连接:

d = {"created_at": pd.to_datetime(["2018-07-03 02:12:33", "2018-07-03 02:12:35","2018-07-03 02:12:40","2018-07-03 02:12:45","2018-07-03 02:12:48","2018-07-03 03:15:17","2018-07-03 03:15:20","2018-07-03 03:15:23","2018-07-03 03:15:28","2018-07-03 03:15:33","2018-08-03 09:00:00","2018-09-03 10:15:00"]),
     "message": ["knock knock","who's there","Europe","Europe who?","No - you're a poo","knock knock","who's there","the KGB","the KGB who?","SLAP the KGB will ask q's!","Hello?","Hello, again?"]}

import pandas as pd
import numpy as np

#60mins in secs
thresh = 60*60

df = pd.DataFrame(data=d)

#Creating time delta from previous message
df["delta"] = df["created_at"].diff().fillna(0).dt.total_seconds()

#Normalising delta based on threshold as a flag for new convos
df["id"] = np.where(df["delta"] < thresh, 0, 1)
df = df.drop(["delta"], axis=1)

#Assigning ID's to each convo
for i in range(1, len(df)):
    df.loc[i, 'id'] += df.loc[i-1, 'id']

print(df)

            created_at                     message  id
0  2018-07-03 02:12:33                 knock knock   0
1  2018-07-03 02:12:35                 who's there   0
2  2018-07-03 02:12:40                      Europe   0
3  2018-07-03 02:12:45                 Europe who?   0
4  2018-07-03 02:12:48           No - you're a poo   0
5  2018-07-03 03:15:17                 knock knock   1
6  2018-07-03 03:15:20                 who's there   1
7  2018-07-03 03:15:23                     the KGB   1
8  2018-07-03 03:15:28                the KGB who?   1
9  2018-07-03 03:15:33  SLAP the KGB will ask q's!   1
10 2018-08-03 09:00:00                      Hello?   2
11 2018-09-03 10:15:00               Hello, again?   3

关于python - 如何根据时差分离 Pandas 数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51201380/

相关文章:

python - 在 PyGame、Python 3 上演奏和弦

python-3.x - 如何摆脱密码编译错误?

python - 融化一堆多索引列,同时保留单个 'index' 列

python - 采用多索引 pandas df 的子集,索引的意外行为

python - 将小于 x% 的 Dict 值合并到单个其他切片饼图

python - 循环 Path.glob() (Pathlib) 的结果

python - 从多列的 value_counts 中排除项目

python - 具有互换行和列的稀疏 cholesky 分解

python - tkinter 消息框未显示在 Windows 任务栏上

python - pandas DataFrame 在 bool 掩码上设置值