python - Pandas "join"奇怪

如果我尝试这个(使用两种不同年份的 pandas，一种使用 Python 2，另一种使用 Python 3)

import pandas as pd
x = pd.DataFrame({"id": [1, 2,3], "value1": [5,5,5]})
y = pd.DataFrame({"id": [1], "value2": [10]})

z1 = x.join(y, on = "id")
z2 = x.join(y, on = "id", lsuffix = "_left", rsuffix = "_right")
z3 = x.join(y, lsuffix = "_left", rsuffix = "_right")

第一个连接因 ValueError 失败，第二个连接没有中断，但 y 不匹配，只有第三个连接产生预期结果，即 y 的行与 x 匹配。

join 的文档说

on : name, tuple/list of names, or array-like Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.

这是一个错误(即 z2 发生的情况)，还是有某种意义？

最佳答案

df.join(...) 通常用于将 df 的索引与另一个 DataFrame 的索引连接起来。

df.join(..., on='id')将 df 的 id 列与另一个 DataFrame 的索引连接起来。 Per the docs (我的重点):

on : name, tuple/list of names, or array-like

Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation

因为 x 和 y 看起来像这样:

In [14]: x
Out[14]: 
   id  value1
0   1       5
1   2       5
2   3       5

In [15]: y
Out[15]: 
   id  value2
0   1      10

x.join(y, on='id') 尝试加入 x['id'](值为 1, 2, 3)与 y.index(值为 0)。由于 x['id'] 和 y.index 没有共同值，因此(默认情况下)左连接会为值生成 NaN在连接生成的新 y 列中。

z1 = x.join(y, on = "id") 引发

ValueError: columns overlap but no suffix specified: Index(['id'], dtype='object')

因为连接生成的y列包含id，它已经是 x - 列名称。当列名重叠时，必须指定一个lsuffix， rsuffix，或两者都用于消除列名称的歧义。

z2 = x.join(y, on = "id", lsuffix = "_left", rsuffix = "_right") 返回

In [12]: z2
Out[12]: 
   id_left  value1  id_right  value2
0        1       5       NaN     NaN
1        2       5       NaN     NaN
2        3       5       NaN     NaN

因为常见的 x 和 y 列(即 id 列)已消除歧义。 NaN 值是由于 x['id'] 和 y.index 没有共同值(如上所述)。

z3 = x.join(y, lsuffix = "_left", rsuffix = "_right") 生成

In [20]: z3
Out[20]: 
   id_left  value1  id_right  value2
0        1       5       1.0    10.0
1        2       5       NaN     NaN
2        3       5       NaN     NaN

因为现在正在对 x.index 和 y.index 执行连接。

关于python - Pandas "join"奇怪，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50899179/

python - Pandas "join"奇怪

上一篇：python - 如何在动态下拉 Django 模板中设置已选择的值(更新表单)

下一篇：python - 如何测试设置和检查依赖于未经测试的方法的方法？