python-3.x - 从结构化 Numpy 数组 Python3.x 中删除重复项

采用以下数组:

import numpy as np

arr_dupes = np.array(
    [
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 222),
      ('2017-09-13T11:04:00.000000',  1.32683,  1.32686,  1.32682,  1.32685,  1.32682,  1.32684,  1.3268 ,  1.32684,  97),
      ('2017-09-13T11:03:00.000000',  1.32664,  1.32684,  1.32663,  1.32683,  1.32664,  1.32683,  1.32661,  1.32682, 268),
      ('2017-09-13T11:02:00.000000',  1.3268 ,  1.32692,  1.3266 ,  1.32664,  1.32678,  1.32689,  1.32658,  1.32664, 299),
      ('2017-09-13T11:02:00.000000',  1.3268 ,  1.32692,  1.3266 ,  1.32664,  1.32678,  1.32689,  1.32658,  1.32664, 299),
      ('2017-09-13T11:01:00.000000',  1.32648,  1.32682,  1.32648,  1.3268 ,  1.32647,  1.32682,  1.32647,  1.32678, 322),
      ('2017-09-13T11:00:00.000000',  1.32647,  1.32649,  1.32628,  1.32648,  1.32644,  1.32651,  1.32626,  1.32647, 285)],
      dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
             ('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)

使用日期作为索引并保留最后一个值来删除重复项的最快解决方案是什么？

Pandas DataFrame 等效项是

In [5]: df = pd.DataFrame(arr_dupes, index=arr_dupes['date'])
In [6]: df
Out[6]:
                                   date  askopen  askhigh   asklow  askclose  bidopen  bidhigh   bidlow  bidclose  volume
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     246
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     246
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     222
2017-09-13 11:04:00 2017-09-13 11:04:00  1.32683  1.32686  1.32682   1.32685  1.32682  1.32684  1.32680   1.32684      97
2017-09-13 11:03:00 2017-09-13 11:03:00  1.32664  1.32684  1.32663   1.32683  1.32664  1.32683  1.32661   1.32682     268
2017-09-13 11:02:00 2017-09-13 11:02:00  1.32680  1.32692  1.32660   1.32664  1.32678  1.32689  1.32658   1.32664     299
2017-09-13 11:02:00 2017-09-13 11:02:00  1.32680  1.32692  1.32660   1.32664  1.32678  1.32689  1.32658   1.32664     299
2017-09-13 11:01:00 2017-09-13 11:01:00  1.32648  1.32682  1.32648   1.32680  1.32647  1.32682  1.32647   1.32678     322
2017-09-13 11:00:00 2017-09-13 11:00:00  1.32647  1.32649  1.32628   1.32648  1.32644  1.32651  1.32626   1.32647     285

In [7]: df.reset_index().drop_duplicates(subset='date', keep='last').set_index('date')
Out[7]:
                                  index  askopen  askhigh   asklow  askclose  bidopen  bidhigh   bidlow  bidclose  volume
date
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     222
2017-09-13 11:04:00 2017-09-13 11:04:00  1.32683  1.32686  1.32682   1.32685  1.32682  1.32684  1.32680   1.32684      97
2017-09-13 11:03:00 2017-09-13 11:03:00  1.32664  1.32684  1.32663   1.32683  1.32664  1.32683  1.32661   1.32682     268
2017-09-13 11:02:00 2017-09-13 11:02:00  1.32680  1.32692  1.32660   1.32664  1.32678  1.32689  1.32658   1.32664     299
2017-09-13 11:01:00 2017-09-13 11:01:00  1.32648  1.32682  1.32648   1.32680  1.32647  1.32682  1.32647   1.32678     322
2017-09-13 11:00:00 2017-09-13 11:00:00  1.32647  1.32649  1.32628   1.32648  1.32644  1.32651  1.32626   1.32647     285

numpy.unique 似乎比较整个元组并将返回重复项。

最终输出应如下所示。

array([
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 222),
      ('2017-09-13T11:04:00.000000',  1.32683,  1.32686,  1.32682,  1.32685,  1.32682,  1.32684,  1.3268 ,  1.32684,  97),
      ('2017-09-13T11:03:00.000000',  1.32664,  1.32684,  1.32663,  1.32683,  1.32664,  1.32683,  1.32661,  1.32682, 268),
      ('2017-09-13T11:02:00.000000',  1.3268 ,  1.32692,  1.3266 ,  1.32664,  1.32678,  1.32689,  1.32658,  1.32664, 299),
      ('2017-09-13T11:01:00.000000',  1.32648,  1.32682,  1.32648,  1.3268 ,  1.32647,  1.32682,  1.32647,  1.32678, 322),
      ('2017-09-13T11:00:00.000000',  1.32647,  1.32649,  1.32628,  1.32648,  1.32644,  1.32651,  1.32626,  1.32647, 285)],
      dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
             ('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)

谢谢

最佳答案

看来您的问题的解决方案不必模仿 pandas drop_duplicates() 函数，但我将提供一个模仿它的方法和一个不模仿它的方法。

如果您需要与 pandas drop_duplicates() 完全相同的行为，那么可以使用以下代码:

#initialization of arr_dupes here

#actual algorithm

helper1, helper2 = np.unique(arr_dupes['date'][::-1], return_index = True)

result = arr_dupes[::-1][helper2][::-1]

初始化 arr_dupes 时，您只需将“日期”列传递给 numpy.unique()。另外，由于您对数组中最后一个非唯一元素感兴趣，因此必须使用 [::-1] 反转传递给 unique() 的数组的顺序。这样 unique() 将抛出除最后一个元素之外的所有非唯一元素。然后 unique() 返回唯一元素列表 (helper1) 作为第一个返回值，并返回原始数组 (helper2) 中这些元素的索引列表作为第二个返回值。最后，通过从原始数组 arr_dupes 中选取 helper2 中列出的元素来创建一个新数组。

该解决方案比 pandas 版本快约 9.898 倍。

现在让我解释一下我在这个答案开头的意思。在我看来，您的数组是按“日期”列排序的。如果这是真的，那么我们可以假设重复项将被分组在一起。如果它们被分组在一起，那么我们只需要保留下一行“日期”列与当前行“日期”列不同的行。例如，如果我们看一下以下数组行:

...
  ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
  ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
  ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 222),
  ('2017-09-13T11:04:00.000000',  1.32683,  1.32686,  1.32682,  1.32685,  1.32682,  1.32684,  1.3268 ,  1.32684,  97),
...

第三行的“日期”列与第四行不同，我们需要保留它。无需再做任何检查。第一行的“日期”列与第二行相同，我们不需要该行。第二行也是如此。所以在代码中它看起来像这样:

#initialization of arr_dupes here

#actual algorithm

result = arr_dupes[np.concatenate((arr_dupes['date'][:-1] != arr_dupes['date'][1:], np.array([True])))]

首先将“日期”列的每个元素与下一个元素进行比较。这将创建一系列 true 和 false。如果此 bool 数组中的索引分配有 true，则需要保留具有该索引的 arr_dupes 元素。否则它需要走。接下来，concatenate() 只是将最后一个真值添加到该 bool 数组中，因为最后一个元素始终需要保留在结果数组中。

该解决方案比 pandas 版本快约 17 倍。

关于python-3.x - 从结构化 Numpy 数组 Python3.x 中删除重复项，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46390376/

python-3.x - 从结构化 Numpy 数组 Python3.x 中删除重复项

上一篇：r - 将 R 对象导入 Python 的最佳方法？

下一篇：amazon-web-services - AWS EC2 负载均衡器 - 具体实例是否只是 "not available"？