python - python 列表中每个唯一元素的所有索引

我正在处理一个非常大的数据集(大约 7500 万个条目)，并且我正在尝试大幅缩短运行代码所需的时间(现在有一个循环，需要几个小时)天)并保持内存使用量极低。

我有两个长度相同的 numpy 数组(clients 和 units)。我的目标是获取第一个列表 (clients) 中出现值的每个索引的列表，然后找到第二个列表中每个索引处的条目总和。

这是我尝试过的(np是之前导入的numpy库)

# create a list of each value that appears in clients
unq = np.unique(clients)
arr = np.zeros(len(unq))
tmp = np.arange(len(clients))
# for each unique value i in clients
for i in range(len(unq)) :
    #create a list inds of all the indices that i occurs in clients
    inds = tmp[clients==unq[i]]
    # add the sum of all the elements in units at the indices inds to a list
    arr[i] = sum(units[inds])

有谁知道一种方法可以让我找到这些总和，而无需循环遍历 unq 中的每个元素？

最佳答案

与 Pandas ，这可以使用 grouby() 轻松完成功能:

import pandas as pd

# some fake data
df = pd.DataFrame({'clients': ['a', 'b', 'a', 'a'], 'units': [1, 1, 1, 1]})

print df.groupby(['clients'], sort=False).sum()

这会为您提供所需的输出:

         units
clients       
a            3
b            1

我使用 sort=False 选项，因为这可能会导致加速(默认情况下，条目将被排序，这对于巨大的数据集可能需要一些时间)。

关于python - python 列表中每个唯一元素的所有索引，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38126477/

上一篇：python - 你如何让 python 一直读取行直到满足条件？

下一篇：python - 树莓派运行时错误 : Conflicting edge detection already enabled for this GPIO channel

相关文章：

python - 在 Python 中的其他函数中添加函数调用

java - 根据 int 数组的第一个和第二个元素对 int 数组的数组列表进行排序

c++ - std::deque 内存地址作为数组

python - 如何用Python绘制结构化数据文件？

python - 在 Python 中序列化 JSON 时出现 "TypeError: (Integer) is not JSON serializable"？

python - 在Python中检查互联网是否真的很慢的低影响方法是什么？

python - Plotly:如何使用 updatemenus 更新一个特定的跟踪？

c - 按升序排列数字

python - 如何更好地处理数据并设置神经网络参数？

python - python 中的矩阵矩阵