注意:下面的帖子是 earlier question 的“多键”副本我的。先前问题的解决方案仅适用于连接在单个键上的情况,我不清楚如何将这些解决方案推广到下面介绍的多键情况。因为,IME,以取消它收到的答案的方式修改一个已经回答的问题在 SO 中是不受欢迎的,所以我单独发布这个变体。我还发布了 question向 Meta SO 询问我是否应该删除这篇文章,而是修改原始问题,代价是使其当前答案无效。
以下是我正在使用的更大/复杂数据帧的小型/玩具版本:
>>> A
key1 key2 u v w x
0 a G 0.757954 0.258917 0.404934 0.303313
1 b H 0.583382 0.504687 NaN 0.618369
2 c I NaN 0.982785 0.902166 NaN
3 d J 0.898838 0.472143 NaN 0.610887
4 e K 0.966606 0.865310 NaN 0.548699
5 f L NaN 0.398824 0.668153 NaN
key1 key2 y z
0 a G 0.867603 NaN
1 b H NaN 0.191067
2 c I 0.238616 0.803179
3 d G 0.080446 NaN
4 e H 0.932834 NaN
5 f I 0.706561 0.814467
(FWIW,在本文末尾,我提供了生成这些数据帧的代码。)
我想在 key1
和 key2
列上生成这些数据帧的外部连接,这样由外部连接引起的新位置获得默认值0.0。 IOW,期望的结果看起来像这样
key1 key2 u v w x y z
0 a G 0.757954 0.258917 0.404934 0.303313 0.867603 NaN
1 b H 0.583382 0.504687 NaN 0.618369 NaN 0.191067
2 c I NaN 0.982785 0.902166 NaN 0.238616 0.803179
3 d J 0.898838 0.472143 NaN 0.610887 0.000000 0.000000
4 e K 0.966606 0.86531 NaN 0.548699 0.000000 0.000000
5 f L NaN 0.398824 0.668153 NaN 0.000000 0.000000
6 d G 0.000000 0.000000 0.000000 0.000000 0.080446 NaN
7 e H 0.000000 0.000000 0.000000 0.000000 0.932834 NaN
8 f I 0.000000 0.000000 0.000000 0.000000 0.706561 0.814467
(请注意,此所需输出包含一些 NaN,即那些已经存在于 A
或 B
中的 NaN。)
merge
方法让我走到了一半,但填充的默认值是 NaN,而不是 0.0:
>>> C = pandas.DataFrame.merge(A, B, how='outer', on=('key1', 'key2'))
>>> C
key1 key2 u v w x y z
0 a G 0.757954 0.258917 0.404934 0.303313 0.867603 NaN
1 b H 0.583382 0.504687 NaN 0.618369 NaN 0.191067
2 c I NaN 0.982785 0.902166 NaN 0.238616 0.803179
3 d J 0.898838 0.472143 NaN 0.610887 NaN NaN
4 e K 0.966606 0.865310 NaN 0.548699 NaN NaN
5 f L NaN 0.398824 0.668153 NaN NaN NaN
6 d G NaN NaN NaN NaN 0.080446 NaN
7 e H NaN NaN NaN NaN 0.932834 NaN
8 f I NaN NaN NaN NaN 0.706561 0.814467
fillna
方法无法产生所需的输出,因为它修改了一些应该保持不变的位置:
>>> C.fillna(0.0)
key1 key2 u v w x y z
0 a G 0.757954 0.258917 0.404934 0.303313 0.867603 0.000000
1 b H 0.583382 0.504687 0.000000 0.618369 0.000000 0.191067
2 c I 0.000000 0.982785 0.902166 0.000000 0.238616 0.803179
3 d J 0.898838 0.472143 0.000000 0.610887 0.000000 0.000000
4 e K 0.966606 0.865310 0.000000 0.548699 0.000000 0.000000
5 f L 0.000000 0.398824 0.668153 0.000000 0.000000 0.000000
6 d G 0.000000 0.000000 0.000000 0.000000 0.080446 0.000000
7 e H 0.000000 0.000000 0.000000 0.000000 0.932834 0.000000
8 f I 0.000000 0.000000 0.000000 0.000000 0.706561 0.814467
我怎样才能有效地达到预期的输出? (这里的性能很重要,因为我打算在比这里显示的数据帧大得多的数据帧上执行此操作。)
重要提示:为了使示例保持最小,我使多键仅包含两列;实际上,多键中的键数可能要大得多。建议的答案应该适用于由至少六列组成的多键。
FWIW,下面是生成示例数据帧 A
和 B
的代码。
from pandas import DataFrame
from collections import OrderedDict
from random import random, seed
def make_dataframe(rows, colnames):
return DataFrame(OrderedDict([(n, [row[i] for row in rows])
for i, n in enumerate(colnames)]))
maybe_nan = lambda: float('nan') if random() < 0.4 else random()
seed(0)
A = make_dataframe([['A', 'g', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
['B', 'h', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
['C', 'i', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
['D', 'j', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
['E', 'k', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
['F', 'l', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()]],
('key1', 'key2', 'u', 'v', 'w', 'x'))
B = make_dataframe([['A', 'g', maybe_nan(), maybe_nan()],
['B', 'h', maybe_nan(), maybe_nan()],
['C', 'i', maybe_nan(), maybe_nan()],
['D', 'g', maybe_nan(), maybe_nan()],
['E', 'h', maybe_nan(), maybe_nan()],
['F', 'i', maybe_nan(), maybe_nan()]],
('key1', 'key2', 'y', 'z'))
最佳答案
将键
设置为两个DF
的索引:
def index_set(frame, keys=['key1', 'key2']):
frame.set_index(keys, inplace=True)
return frame
子集包含 NaN
值的 DF
:
def nulls(frame):
nulls_in_frame = frame[frame.isnull().any(axis=1)].reset_index()
return nulls_in_frame
加入两个 Df
。将连接的 DF
与包含 DF's
的 NaN
的每个子集连接起来,并删除填充剩余 NaN
的重复值剩下 0。
然后,使用 combine_first
通过连接的 DF
进行链接操作来修补值。
def perform_join(fr_1, fr_2, keys=['key1', 'key2']):
fr_1 = index_set(fr_1); frame_2 = index_set(fr_2)
frame = fr_1.join(fr_2, how='outer').reset_index()
cat_fr_1 = pd.concat([frame, nulls(fr_1)]).drop_duplicates(keys, keep=False).fillna(0)
cat_fr_2 = pd.concat([frame, nulls(fr_2)]).drop_duplicates(keys, keep=False).fillna(0)
fr_1_join = frame.combine_first(frame.fillna(cat_fr_1[fr_1.columns]))
joined_frame = fr_1_join.combine_first(frame.fillna(cat_fr_2[fr_2.columns]))
return joined_frame
最后,
perform_join(A, B)
关于python - 关于 *multi-key* 外连接的默认/填充值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39751636/