python - 为什么 pandas 逻辑运算符没有像它应该的那样在索引上对齐？

考虑这个简单的设置:

x = pd.Series([1, 2, 3], index=list('abc'))
y = pd.Series([2, 3, 3], index=list('bca'))

x

a    1
b    2
c    3
dtype: int64

y

b    2
c    3
a    3
dtype: int64

正如您所看到的，索引是相同的，只是顺序不同。

现在，考虑使用相等 (==) 运算符进行简单的逻辑比较:

x == y
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

这会引发 ValueError，很可能是因为索引不匹配。另一方面，调用等效的 eq 运算符可以工作:

x.eq(y)

a    False
b     True
c     True
dtype: bool

OTOH，如果 y 首先重新排序，操作符方法就可以工作...

x == y.reindex_like(x)

a    False
b     True
c     True
dtype: bool

我的理解是，函数和运算符比较应该做同样的事情，所有其他事情都相同。 eq 做了哪些操作符比较没有做的事情？

最佳答案

查看与不匹配索引的系列比较的整个回溯，特别关注异常消息:

In [1]: import pandas as pd
In [2]: x = pd.Series([1, 2, 3], index=list('abc'))
In [3]: y = pd.Series([2, 3, 3], index=list('bca'))
In [4]: x == y
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-73b2790c1e5e> in <module>()
----> 1 x == y
/usr/lib/python3.7/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
   1188 
   1189         elif isinstance(other, ABCSeries) and not self._indexed_same(othe
r):
-> 1190             raise ValueError("Can only compare identically-labeled "
   1191                              "Series objects")
   1192 
ValueError: Can only compare identically-labeled Series objects

我们看到这是一个经过深思熟虑的实现决定。此外，这并不是 Series 对象所独有的 - DataFrames 也会引发类似的错误。

深入挖掘相关行的 Git 责任最终会发现一些相关的提交和问题跟踪线程。例如，Series.__eq__ 用于完全忽略 RHS 的索引，并且在 comment 中在有关该行为的错误报告中，Pandas 作者 Wes McKinney 说道:

This is actually a feature / deliberate choice and not a bug-- it's related to #652. Back in January I changed the comparison methods to do auto-alignment, but found that it led to a large amount of bugs / breakage for users and, in particular, many NumPy functions (which regularly do things like arr[1:] == arr[:-1]; example: np.unique) stopped working.

This gets back to the issue that Series isn't quite ndarray-like enough and should probably not be a subclass of ndarray.

So, I haven't got a good answer for you except for that; auto-alignment would be ideal but I don't think I can do it unless I make Series not a subclass of ndarray. I think this is probably a good idea but not likely to happen until 0.9 or 0.10 (several months down the road).

这就是当时的changed到 pandas 0.19.0 中的当前行为。引用 "what's new" page :

Following Series operators have been changed to make all operators consistent, including DataFrame (GH1134, GH4581, GH13538)

Series comparison operators now raise ValueError when index are different.

Series logical operators align both index of left and right hand side.

这使得 Series 行为与 DataFrame 的行为相匹配，DataFrame 已经在比较中拒绝了不匹配的索引。

总之，让比较运算符自动对齐索引会破坏太多东西，所以这是最好的选择。

关于python - 为什么 pandas 逻辑运算符没有像它应该的那样在索引上对齐？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56496554/

python - 为什么 pandas 逻辑运算符没有像它应该的那样在索引上对齐？

上一篇：python - databricks 之外是否有另一种/类似的 Spark.read.format.load 方法？

下一篇：python - 为什么在逐行导入文本文件进行情感分析而不是使用硬编码的句子时会出现类型错误？