python - Pandas pd.Series.isin的性能与数组和数组的关系

通常，在Python中，最好通过set测试可散列集合的成员资格。我们之所以知道这一点，是因为哈希的使用使O(1)的查找复杂度高于list或np.ndarray的O(n)。

在 Pandas 中，我经常不得不检查非常大的收藏中的成员(member)资格。我认为同样适用，即检查一系列项目中的每个项目是否属于set，这比使用list或np.ndarray更有效。但是，情况似乎并非如此:

import numpy as np
import pandas as pd

np.random.seed(0)

x_set = {i for i in range(100000)}
x_arr = np.array(list(x_set))
x_list = list(x_set)

arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

%timeit ser.isin(x_set)                   # 8.9 ms
%timeit ser.isin(x_arr)                   # 2.17 ms
%timeit ser.isin(x_list)                  # 7.79 ms
%timeit np.in1d(arr, x_arr)               # 5.02 ms
%timeit [i in x_set for i in lst]         # 1.1 ms
%timeit [i in x_set for i in ser.values]  # 4.61 ms

用于测试的版本:

np.__version__  # '1.14.3'
pd.__version__  # '0.23.0'
sys.version     # '3.6.5'

我相信 pd.Series.isin 的源代码利用了 numpy.in1d ，这大概意味着set到np.ndarray转换的开销很大。

消除投入的成本，对 Pandas 的影响:

如果您知道x_list或x_arr的元素是唯一的，请不要麻烦转换为x_set。与Pandas一起使用，这将是昂贵的(转换和成员资格测试)。

使用列表推导是从O(1)集查找中受益的唯一方法。

我的问题是:

我的上述分析正确吗？这似乎是pd.Series.isin的实现方式的一个显而易见但尚未记录的结果。

是否有一种解决方法，而不使用列表推导或pd.Series.apply，但确实使用O(1)集查找？还是以NumPy作为 Pandas 的 Backbone 是不可避免的设计选择和/或必然结果？

更新:在较旧的设置(Pandas/NumPy版本)上，我看到x_set优于x_arr和pd.Series.isin。因此，还有一个问题:从根本上改变了新旧内容，导致set的性能变差了吗？

%timeit ser.isin(x_set)                   # 10.5 ms
%timeit ser.isin(x_arr)                   # 15.2 ms
%timeit ser.isin(x_list)                  # 9.61 ms
%timeit np.in1d(arr, x_arr)               # 4.15 ms
%timeit [i in x_set for i in lst]         # 1.15 ms
%timeit [i in x_set for i in ser.values]  # 2.8 ms

pd.__version__  # '0.19.2'
np.__version__  # '1.11.3'
sys.version     # '3.6.0'

最佳答案

这可能并不明显，但是pd.Series.isin使用O(1)-查找每个元素。

经过分析，证明了上面的陈述，我们将利用其洞察力创建一个Cython原型(prototype)，该原型(prototype)可以轻松击败最快的即用型解决方案。

假设“集合”具有n元素，而“系列”具有m元素。那么运行时间为:

 T(n,m)=T_preprocess(n)+m*T_lookup(n)

对于纯python版本，这意味着:

T_preprocess(n)=0-无需预处理

T_lookup(n)=O(1)-python设置

的众所周知的行为

生成T(n,m)=O(m)

pd.Series.isin(x_arr)会怎样？显然，如果跳过预处理并在线性时间内搜索，则会得到O(n*m)，这是 Not Acceptable 。

在调试器或探查器(我使用valgrind-callgrind + kcachegrind)的帮助下很容易看出来，发生了什么:工作马是__pyx_pw_6pandas_5_libs_9hashtable_23ismember_int64函数。其定义可以在here中找到:

在预处理步骤中，从n中的x_arr元素中(即在运行时O(n)中)创建一个哈希映射( Pandas 使用khash from klib)。

m查找在所构造的哈希图中每个O(1)或总共O(m)中进行。

生成T(n,m)=O(m)+O(n)

我们必须记住-numpy-array的元素是原始C-整数，而不是原始集合中的Python对象-因此我们不能按原样使用集合。

将Python对象集转换为C-int的一种替代方法是将单个C-ints转换为Python-object，从而能够使用原始集。那就是[i in x_set for i in ser.values] -variant中发生的事情:

没有预处理。

每次在O(1)或总共O(m)中进行

m查找，但是由于必须创建Python对象，因此查找速度较慢。

生成T(n,m)=O(m)

显然，您可以使用Cython加快此版本的速度。

但是有足够的理论，让我们看一下带有固定n的不同m的运行时间:

我们可以看到:对于大n，预处理的线性时间主导着numpy版本。从numpy转换为纯python的版本(numpy->python)与纯python版本具有相同的恒定行为，但由于必须进行转换而速度较慢-所有这些都根据我们的分析。

在图中不能很好地看出这一点:如果n < m的numpy版本变得更快-在这种情况下，khash -lib的快速查找将扮演最重要的角色，而不是预处理部分。

我从分析中得出的结论:

n < m:应该采用pd.Series.isin，因为O(n) -preprocessing的代价并不高。

n > m:应该采用[i in x_set for i in ser.values](可能是cythonized版本)，因此应避免使用O(n)。

显然存在一个灰色区域，其中n和m大致相等，如果不进行测试，很难确定哪种解决方案是最佳的。

如果您可以控制它:最好的方法是直接将set构建为C整数集(khash(already wrapped in pandas)，甚至可能是某些c++实现)，从而消除了预处理的需要。我不知道， Pandas 中是否有可以重复使用的东西，但是用Cython编写函数可能没什么大不了的。

问题在于，最后的建议无法立即使用，因为 Pandas 和numpy的界面中都没有集合的概念(至少据我所知)。但是拥有原始C设置接口(interface)将是两全其美的方法:

不需要预处理，因为值已作为一组

传递

不需要转换，因为传递的集合包含原始C值

我编写了一个快速且肮脏的Cython-wrapper for khash(受 Pandas 包装程序的启发)，可以通过pip install https://github.com/realead/cykhash/zipball/master进行安装，然后与Cython配合使用以获得更快的isin版本:

%%cython
import numpy as np
cimport numpy as np

from cykhash.khashsets cimport Int64Set

def isin_khash(np.ndarray[np.int64_t, ndim=1] a, Int64Set b):
    cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)
    cdef int i
    for i in range(a.size):
        res[i]=b.contains(a[i])
    return res

另一种可能是，可以包装c++的unordered_map(请参见 list C)，这具有需要c++库的缺点，并且(我们将看到)速度稍慢。

比较方法(请参见 list D中的计时创建):

khash比numpy->python快20倍，比纯python快6倍(但是，无论如何，纯python都不是我们想要的)，甚至比cpp版本快3倍。

list

1)用valgrind分析:

#isin.py
import numpy as np
import pandas as pd

np.random.seed(0)

x_set = {i for i in range(2*10**6)}
x_arr = np.array(list(x_set))


arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)


for _ in range(10):
   ser.isin(x_arr)

现在:

>>> valgrind --tool=callgrind python isin.py
>>> kcachegrind

导致以下调用图:

B:用于生成运行时间的ipython代码:

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

np.random.seed(0)

x_set = {i for i in range(10**2)}
x_arr = np.array(list(x_set))
x_list = list(x_set)

arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

n=10**3
result=[]
while n<3*10**6:
    x_set = {i for i in range(n)}
    x_arr = np.array(list(x_set))
    x_list = list(x_set)

    t1=%timeit -o  ser.isin(x_arr) 
    t2=%timeit -o  [i in x_set for i in lst]
    t3=%timeit -o  [i in x_set for i in ser.values]

    result.append([n, t1.average, t2.average, t3.average])
    n*=2

#plotting result:
for_plot=np.array(result)
plt.plot(for_plot[:,0], for_plot[:,1], label='numpy')
plt.plot(for_plot[:,0], for_plot[:,2], label='python')
plt.plot(for_plot[:,0], for_plot[:,3], label='numpy->python')
plt.xlabel('n')
plt.ylabel('running time')
plt.legend()
plt.show()

C:cpp包装器:

%%cython --cplus -c=-std=c++11 -a

from libcpp.unordered_set cimport unordered_set

cdef class HashSet:
    cdef unordered_set[long long int] s
    cpdef add(self, long long int z):
        self.s.insert(z)
    cpdef bint contains(self, long long int z):
        return self.s.count(z)>0

import numpy as np
cimport numpy as np

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)

def isin_cpp(np.ndarray[np.int64_t, ndim=1] a, HashSet b):
    cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)
    cdef int i
    for i in range(a.size):
        res[i]=b.contains(a[i])
    return res

D:用不同的套包器绘制结果:

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from cykhash import Int64Set

np.random.seed(0)

x_set = {i for i in range(10**2)}
x_arr = np.array(list(x_set))
x_list = list(x_set)


arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

n=10**3
result=[]
while n<3*10**6:
    x_set = {i for i in range(n)}
    x_arr = np.array(list(x_set))
    cpp_set=HashSet()
    khash_set=Int64Set()

    for i in x_set:
        cpp_set.add(i)
        khash_set.add(i)


    assert((ser.isin(x_arr).values==isin_cpp(ser.values, cpp_set)).all())
    assert((ser.isin(x_arr).values==isin_khash(ser.values, khash_set)).all())


    t1=%timeit -o  isin_khash(ser.values, khash_set)
    t2=%timeit -o  isin_cpp(ser.values, cpp_set) 
    t3=%timeit -o  [i in x_set for i in lst]
    t4=%timeit -o  [i in x_set for i in ser.values]

    result.append([n, t1.average, t2.average, t3.average, t4.average])
    n*=2

#ploting result:
for_plot=np.array(result)
plt.plot(for_plot[:,0], for_plot[:,1], label='khash')
plt.plot(for_plot[:,0], for_plot[:,2], label='cpp')
plt.plot(for_plot[:,0], for_plot[:,3], label='pure python')
plt.plot(for_plot[:,0], for_plot[:,4], label='numpy->python')
plt.xlabel('n')
plt.ylabel('running time')
ymin, ymax = plt.ylim()
plt.ylim(0,ymax)
plt.legend()
plt.show()

关于python - Pandas pd.Series.isin的性能与数组和数组的关系，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50779617/

python - Pandas pd.Series.isin的性能与数组和数组的关系

上一篇：python - 使用 Python 3.8 中的赋值表达式，为什么我们需要在 `as` 中使用 `with` ？

下一篇：python - matplotlib 图例的标题