我有一个包含 3 列的数据框:
“日期”
每行都是唯一的'Group'
每行所属的组'ID'
每行的标识符
在'ID'
列中存在一些重复值。我想计算每个组和每个日期的重复次数。
数据框示例:
Date Group ID
0 2021-10-29 09:15:52 B 9352
1 2021-10-29 10:58:57 A 9352
2 2021-10-29 11:20:46 C 9352
3 2021-10-29 12:47:34 C 6274
4 2021-10-29 15:41:35 C 1677
5 2021-10-29 16:12:39 B 1677
6 2021-10-29 18:57:56 B 9225
7 2021-10-29 19:46:46 C 9225
8 2021-10-30 01:23:07 C 9225
9 2021-10-30 02:13:57 A 9225
10 2021-10-30 05:03:52 B 9329
11 2021-10-30 07:48:39 B 9329
12 2021-10-30 08:45:00 A 9329
13 2021-10-30 11:17:47 C 9329
14 2021-10-30 21:46:07 C 9496
15 2021-10-30 22:13:29 A 1218
16 2021-10-31 05:39:38 C 2422
17 2021-10-31 05:39:41 C 9654
18 2021-10-31 10:10:21 A 1951
19 2021-10-31 10:19:45 A 1951
20 2021-10-31 16:40:10 A 1951
21 2021-10-31 16:41:23 C 1951
22 2021-10-31 22:07:16 A 1951
23 2021-11-01 00:26:30 C 2867
24 2021-11-01 01:25:46 B 2867
25 2021-11-01 01:53:16 B 4262
26 2021-11-01 01:58:30 A 4581
27 2021-11-01 05:23:26 C 1734
28 2021-11-01 05:38:22 C 1734
29 2021-11-01 05:47:43 C 1734
30 2021-11-01 06:49:27 A 4813
31 2021-11-01 07:54:02 C 4813
32 2021-11-01 12:10:48 C 8661
33 2021-11-01 14:32:50 C 4138
34 2021-11-01 18:38:16 B 4138
35 2021-11-01 21:33:37 C 4138
36 2021-11-02 02:34:53 B 4138
37 2021-11-02 04:45:56 C 4138
38 2021-11-02 07:33:38 C 4138
39 2021-11-02 07:40:33 C 4138
40 2021-11-02 08:06:21 B 4138
41 2021-11-02 08:32:20 C 4138
42 2021-11-02 09:47:26 A 4138
43 2021-11-02 15:51:33 C 4138
44 2021-11-02 16:04:33 B 2433
45 2021-11-02 20:47:01 B 2433
46 2021-11-03 04:11:57 A 4594
47 2021-11-03 12:36:16 A 6829
48 2021-11-03 14:42:14 A 6829
49 2021-11-03 18:03:27 B 7138
50 2021-11-03 18:46:46 C 7138
51 2021-11-03 19:07:01 C 7138
52 2021-11-03 19:23:02 A 9752
53 2021-11-03 21:10:51 A 2699
54 2021-11-04 00:58:12 C 2699
55 2021-11-04 03:44:12 A 7463
56 2021-11-04 05:40:07 C 4558
57 2021-11-04 05:56:51 C 7855
58 2021-11-04 06:27:28 C 7855
59 2021-11-04 07:50:46 C 7855
期望的结果:
Date Group Repetitions
0 2021-10-29 A 1
1 2021-10-29 B 3
2 2021-10-29 C 3
3 2021-10-30 A 2
4 2021-10-30 B 2
5 2021-10-30 C 2
6 2021-10-31 A 4
7 2021-10-31 B 0
8 2021-10-31 C 1
9 2021-11-01 A 1
10 2021-11-01 B 2
11 2021-11-01 C 7
12 2021-11-02 A 1
13 2021-11-02 B 4
14 2021-11-02 C 5
15 2021-11-03 A 3
16 2021-11-03 B 1
17 2021-11-03 C 2
18 2021-11-04 A 0
19 2021-11-04 B 0
20 2021-11-04 C 4
请注意,重复条件跨越日期和组:上例中的 'ID'
2699
也被视为重复,即使这些重复属于不同的日期和组。
最佳答案
IIUC,您可以将重复的值替换为 1,其他值替换为 0,然后 groupby
+agg
+sum
:
(df.assign(Date=pd.to_datetime(df['Date']).dt.normalize(),
ID=df['ID'].duplicated(keep=False).astype(int)
)
.groupby(['Date', 'Group'], as_index=False).agg(repetitions=('ID', 'sum'))
)
输出:
Date Group repetitions
0 2021-10-29 A 1
1 2021-10-29 B 3
2 2021-10-29 C 3
3 2021-10-30 A 2
4 2021-10-30 B 2
5 2021-10-30 C 2
6 2021-10-31 A 4
7 2021-10-31 C 1
8 2021-11-01 A 1
9 2021-11-01 B 2
10 2021-11-01 C 7
11 2021-11-02 A 1
12 2021-11-02 B 4
13 2021-11-02 C 5
14 2021-11-03 A 3
15 2021-11-03 B 1
16 2021-11-03 C 2
17 2021-11-04 A 0
18 2021-11-04 C 4
添加缺少的组合
dates = pd.to_datetime(df['Date']).dt.normalize()
idx = pd.MultiIndex.from_product([dates.unique(), df['Group'].unique()], names=['Date', 'Group'])
(df.assign(Date=dates,
ID=df['ID'].duplicated(keep=False).astype(int)
)
.groupby(['Date', 'Group']).agg(repetitions=('ID', 'sum'))
.reindex(idx, fill_value=0)
.reset_index()
)
输出:
Date Group repetitions
0 2021-10-29 B 3
1 2021-10-29 A 1
2 2021-10-29 C 3
3 2021-10-30 B 2
4 2021-10-30 A 2
5 2021-10-30 C 2
6 2021-10-31 B 0
7 2021-10-31 A 4
8 2021-10-31 C 1
9 2021-11-01 B 2
10 2021-11-01 A 1
11 2021-11-01 C 7
12 2021-11-02 B 4
13 2021-11-02 A 1
14 2021-11-02 C 5
15 2021-11-03 B 1
16 2021-11-03 A 3
17 2021-11-03 C 2
18 2021-11-04 B 0
19 2021-11-04 A 0
20 2021-11-04 C 4
关于python - pandas 数据框按组和日期计算重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69973461/