我试图根据两列之间的顺序关系进行分组。
d = {'df1':[10,20, 30, 60, 70, 40, 30, 70], 'df2':[20, 30, 40, 80, 70, 50, 90, 100]}
df = pd.DataFrame(data = d)
df
df1 df2
0 10 20
1 20 30
2 30 40
3 60 80
4 80 70
5 40 50
6 30 90
7 70 100
我期待以下结果:
为了更清楚:- df1 和 df2 基于它们的序列有关系。例如,10 与 20 有直接关系,10 与 30 到 20 有间接关系。而且 10 与 40 到 20 和 30 有间接关系。再举个例子,让我们以 80 与 70 有直接关系,与 100 到 70 之间的间接关系。这适用于其余的列值。
df1 | df2
-----|-------------------
0 10 | 20, 30, 40, 50, 90
1 20 | 30, 40, 50, 90
2 30 | 40, 50, 90
3 60 | 80, 70, 100
4 80 | 70, 100
5 40 | 50
6 70 | 100
我正在尝试使用下面的脚本,但我无法成功。
(df.groupby('df1')
.agg({ 'df2' : ','.join})
.reset_index()
.reindex(columns=df.columns))
有人可以帮助解决这个挑战吗?如果这里有任何类似的解决方案 堆栈溢出 请告诉我。
编辑:
第一个答案与上述示例完美配合,但是当我尝试使用我想要做的数据时,它无法正常工作。我的真实数据如下所示。
df1 df2
0 10 20
1 10 30
2 10 80
3 10 90
4 10 120
5 10 140
6 10 170
7 20 180
8 30 40
9 30 165
10 30 175
11 40 20
12 40 50
13 50 60
14 60 70
15 70 180
16 80 180
17 90 100
18 100 110
19 110 180
20 120 130
21 130 180
22 140 150
23 150 160
24 160 165
25 165 180
26 165 200
27 170 175
28 175 180
29 175 200
30 180 190
31 190 200
32 200 210
33 210 220
34 220 230
35 230 240
36 240 -
最佳答案
一种可能的解决方案:
import pandas as pd
from itertools import chain
l1 = [10, 20, 30, 60, 80, 40, 30, 70]
l2 = [20, 30, 40, 80, 70, 50, 90, 100]
d = dict()
for i, j in zip(l1, l2):
if i == j:
continue
d.setdefault(i, []).append(j)
for k in d:
d[k].extend(chain.from_iterable(d.get(v, []) for v in d[k]))
df = pd.DataFrame({'df1': list(d.keys()), 'df2': [', '.join(str(v) for v in d[k]) for k in d]})
print(df)
打印:
df1 df2
0 10 20, 30, 40, 90, 50
1 20 30, 40, 90, 50
2 30 40, 90, 50
3 60 80, 70, 100
4 80 70, 100
5 40 50
6 70 100
编辑:基于新输入数据的其他解决方案。现在我正在检查路径中可能的圆圈:
import pandas as pd
data = '''
0 10 20
1 10 30
2 10 80
3 10 90
4 10 120
5 10 140
6 10 170
7 20 180
8 30 40
9 30 165
10 30 175
11 40 20
12 40 50
13 50 60
14 60 70
15 70 180
16 80 180
17 90 100
18 100 110
19 110 180
20 120 130
21 130 180
22 140 150
23 150 160
24 160 165
25 165 180
26 165 200
27 170 175
28 175 180
29 175 200
30 180 190
31 190 200
32 200 210
33 210 220
34 220 230
35 230 240
36 240 -
'''
df1, df2 = [], []
for line in data.splitlines()[:-1]: # <--- get rid of last `-` character
line = line.strip().split()
if not line:
continue
df1.append(int(line[1]))
df2.append(int(line[2]))
from pprint import pprint
d = dict()
for i, j in zip(df1, df2):
if i == j:
continue
d.setdefault(i, []).append(j)
for k in d:
seen = set()
for v in d[k]:
for val in d.get(v, []):
if val not in seen:
seen.add(val)
d[k].append(val)
df = pd.DataFrame({'df1': list(d.keys()), 'df2': [', '.join(str(v) for v in d[k]) for k in d]})
print(df)
打印:
df1 df2
0 10 20, 30, 80, 90, 120, 140, 170, 180, 40, 165, 1...
1 20 180, 190, 200, 210, 220, 230, 240
2 30 40, 165, 175, 20, 50, 180, 200, 190, 210, 220,...
3 40 20, 50, 180, 190, 200, 210, 220, 230, 240, 60, 70
4 50 60, 70, 180, 190, 200, 210, 220, 230, 240
5 60 70, 180, 190, 200, 210, 220, 230, 240
6 70 180, 190, 200, 210, 220, 230, 240
7 80 180, 190, 200, 210, 220, 230, 240
8 90 100, 110, 180, 190, 200, 210, 220, 230, 240
9 100 110, 180, 190, 200, 210, 220, 230, 240
10 110 180, 190, 200, 210, 220, 230, 240
11 120 130, 180, 190, 200, 210, 220, 230, 240
12 130 180, 190, 200, 210, 220, 230, 240
13 140 150, 160, 165, 180, 200, 190, 210, 220, 230, 240
14 150 160, 165, 180, 200, 190, 210, 220, 230, 240
15 160 165, 180, 200, 190, 210, 220, 230, 240
16 165 180, 200, 190, 210, 200, 220, 230, 240
17 170 175, 180, 200, 190, 210, 220, 230, 240
18 175 180, 200, 190, 210, 200, 220, 230, 240
19 180 190, 200, 210, 220, 230, 240
20 190 200, 210, 220, 230, 240
21 200 210, 220, 230, 240
22 210 220, 230, 240
23 220 230, 240
24 230 240
或
pprint(d, width=250)
:{10: [20, 30, 80, 90, 120, 140, 170, 180, 40, 165, 175, 100, 130, 150, 190, 20, 50, 200, 110, 160, 60, 210, 70, 220, 230, 240],
20: [180, 190, 200, 210, 220, 230, 240],
30: [40, 165, 175, 20, 50, 180, 200, 190, 210, 220, 230, 240, 60, 70],
40: [20, 50, 180, 190, 200, 210, 220, 230, 240, 60, 70],
50: [60, 70, 180, 190, 200, 210, 220, 230, 240],
60: [70, 180, 190, 200, 210, 220, 230, 240],
70: [180, 190, 200, 210, 220, 230, 240],
80: [180, 190, 200, 210, 220, 230, 240],
90: [100, 110, 180, 190, 200, 210, 220, 230, 240],
100: [110, 180, 190, 200, 210, 220, 230, 240],
110: [180, 190, 200, 210, 220, 230, 240],
120: [130, 180, 190, 200, 210, 220, 230, 240],
130: [180, 190, 200, 210, 220, 230, 240],
140: [150, 160, 165, 180, 200, 190, 210, 220, 230, 240],
150: [160, 165, 180, 200, 190, 210, 220, 230, 240],
160: [165, 180, 200, 190, 210, 220, 230, 240],
165: [180, 200, 190, 210, 200, 220, 230, 240],
170: [175, 180, 200, 190, 210, 220, 230, 240],
175: [180, 200, 190, 210, 200, 220, 230, 240],
180: [190, 200, 210, 220, 230, 240],
190: [200, 210, 220, 230, 240],
200: [210, 220, 230, 240],
210: [220, 230, 240],
220: [230, 240],
230: [240]}
编辑 2:如果
df
是带有“df1”和“df2”列的输入数据框:from pprint import pprint
d = dict()
for i, j in zip(df.df1, df.df2):
if i == j:
continue
if j == '-': # <-- this will remove the `-` character in df2
continue
d.setdefault(i, []).append(j)
for k in d:
seen = set()
for v in d[k]:
for val in d.get(v, []):
if val not in seen:
seen.add(val)
d[k].append(val)
df = pd.DataFrame({'df1': list(d.keys()), 'df2': [', '.join(str(v) for v in d[k]) for k in d]})
print(df)
关于python - 如何根据序列关系对数据框列进行分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59456922/