python - 按键分组并使用自定义标准聚合

我的数据框如下:

df = pd.DataFrame([['A', 'a', 'web'],
                   ['A', 'b', 'mobile'],
                   ['B', 'c', 'web'],
                   ['C', 'd', 'web'],
                   ['D', 'e', 'mobile'],
                   ['D', 'f', 'web'],
                   ['D', 'g', 'web'],
                   ['D', 'g', 'web']],

columns=['seller_id', 'item_id', 'selling_channel'])

它显示已售商品，其中包含有关卖家是谁以及用于销售商品的销售 channel (在上面的示例中可以是网络或移动 channel ，但实际数据中有更多潜在 channel )的信息

我想确定哪个销售 channel 是给定销售 ID 的主要销售 channel - 但还有其他限制:

如果其中一个 channel 的销售额达到或超过 75% - 该 channel 将成为主要 channel
如果没有一个 channel 达到至少 75% - 主 channel 的名称应为mixed

因此，对于上面的输入，我期望以下输出:

df = pd.DataFrame([['A', 'mixed'],
                   ['B', 'web'],
                   ['C', 'web'],
                   ['D', 'web']],

columns=['seller_id', 'main_selling_channel'])

现在，我正在通过手动迭代每个数据帧的行来构建 map ，其中每个 seller_id 我列出每个 channel 及其出现次数。然后我再次迭代该数据以确定哪个 channel 是主要的。但是当我有 10k 行输入时，这种手动迭代已经花费了大量时间 - 并且实际数据包含数百万个条目。

我想知道是否有任何有效的方法可以使用 pandas api 而不是手动迭代来做到这一点？

最佳答案

这是一种使用 df.groupby 和 normalize=True 值计数的方法来检查每个组中值的百分比，然后检查 % 是否大于或等于到 0.75 ，然后使用 np.where 将返回 Tue 的值设置为 mixed ，最后使用 df.groupby() 和 idxmax 将返回 1 个值，否则 mixed

a = (df.groupby('seller_id')['selling_channel'].value_counts(normalize=True).ge(0.75)
       .rename('Pct').reset_index())

out = (a.assign(selling_channel=np.where(a['Pct'],a['selling_channel'],'mixed'))
       .loc[lambda x: x.groupby('seller_id')['Pct'].idxmax()].drop('Pct',1))

print(out)

  seller_id selling_channel
0         A           mixed
2         B             web
3         C             web
4         D             web

关于python - 按键分组并使用自定义标准聚合，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60602586/

python - 按键分组并使用自定义标准聚合

上一篇：.net - 如何在 .net Web 服务中序列化可为 null 的 DateTime？

下一篇：python - Pandas:数据帧上采样第一阶段失败