python - 计算由长度不等的二维索引列表给出的 DataFrame 行组的平均值

我有一个包含 n 行的 DataFrame。我还有一个二维索引数组。这个数组也有 n 行，但是每行的长度可以是可变的。我需要根据索引对 DataFrame 行进行分组并计算列的平均值。

例如:

如果我有DataFrame df和array ind，我需要得到

[df.loc[ind[n], col_name].mean() for n in ind]。

我已经使用 apply pandas 函数实现了这个:

size = 100000
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
    return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()
df['avg'] = df.apply(group, axis=1)

但这很慢并且扩展性很差。在这种情况下，这样做要快得多

df.a.values[ind].mean(axis=1)

然而，据我所知，这只是因为 ind 的所有元素的长度都相同，而下面的代码不起作用:

new_ind = ind.tolist()
new_ind[0].pop()
df.a.values[new_ind].mean(axis=1)

我试过 pandas groupby 方法，但没有成功。是否有另一种有效的方法可以根据长度不等的索引列表对行进行分组并返回列的平均值？

最佳答案

设置
出于演示目的保持数据帧较短

np.random.seed(1)

size = 10
df = pd.DataFrame(dict(a=np.arange(size)))

# array of variable length sub-arrays
ind = np.array([
    np.random.randint(
        0, size, size=np.random.randint(1, 11)
    ) for _ in range(size)
])

解决方案
使用 np.bincount使用 weights 参数。
这应该是一个非常快速的解决方案。

# get an array of the lengths of sub-arrays
lengths = np.array([len(x) for x in ind])
# simple np.arange for initial positions
positions = np.arange(len(ind))
# get at the underlying values of column `'a'`
values = df.a.values

# for each position repeated the number of times equal to
# the length of the sub-array at that position,
# add to the bin, identified by the position, the amount
# from values at the indices from the sub-array
# divide sums by lengths to get averages
avg = np.bincount(
    positions.repeat(lengths),
    values[np.concatenate(ind)]
) / lengths

df.assign(avg=avg)

   a       avg
0  0  3.833333
1  1  4.250000
2  2  6.200000
3  3  6.000000
4  4  5.200000
5  5  5.400000
6  6  2.000000
7  7  3.750000
8  8  6.500000
9  9  6.200000

时间

此表标识每一行的最短时间量，该行中的每个其他值都表示为最短时间量的倍数。最后一列标识了相应行指定的数据长度的最快方法。

Method pir      mcf Best
Size                    
10       1  12.3746  pir
30       1  44.0495  pir
100      1  124.054  pir
300      1    270.6  pir
1000     1  576.505  pir
3000     1  819.034  pir
10000    1  990.847  pir

代码

def mcf(d, i):
    g = lambda r: d.loc[i[d.index.get_loc(r.name)], 'a'].mean()
    return d.assign(avg=d.apply(g, 1))

def pir(d, i):
    lengths = np.array([len(x) for x in i])
    positions = np.arange(len(i))
    values = d.a.values

    avg = np.bincount(
        positions.repeat(lengths),
        values[np.concatenate(i)]
    ) / lengths

    return d.assign(avg=avg)

results = pd.DataFrame(
    index=pd.Index([10, 30, 100, 300, 1000, 3000, 10000], name='Size'),
    columns=pd.Index(['pir', 'mcf'], name='Method')
)

for i in results.index:

    df = pd.DataFrame(dict(a=np.arange(i)))
    ind = np.array([
        np.random.randint(
            0, i, size=np.random.randint(1, 11)
        ) for _ in range(i)
    ])

    for j in results.columns:

        stmt = '{}(df, ind)'.format(j)
        setp = 'from __main__ import df, ind, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setp, number=10))

results.div(results.min(1), 0).round(2).pipe(lambda d: d.assign(Best=d.idxmin(1)))

fig, (a1, a2) = plt.subplots(2, 1, figsize=(6, 6))
results.plot(loglog=True, lw=3, ax=a1)
results.div(results.min(1), 0).round(2).plot.bar(logy=True, ax=a2)

关于python - 计算由长度不等的二维索引列表给出的 DataFrame 行组的平均值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45624653/

python - 计算由长度不等的二维索引列表给出的 DataFrame 行组的平均值

上一篇：python - pandas - 与相同类别的列连接变成对象

下一篇：python - celery 异常处理