我有一个关于电子邮件和购买的数据集,如下所示。
Email Purchaser order_id amount
a@gmail.com a@gmail.com 1 5
b@gmail.com
c@gmail.com c@gmail.com 2 10
c@gmail.com c@gmail.com 3 5
我想求出数据集中的总人数、购买人数和总订单数以及总收入金额。我知道如何使用 left join
和聚合函数通过 SQL
完成此操作,但我不知道如何使用 Python
/ 复制它 Pandas
。
对于 Python
,我尝试使用 pandas
和 numpy
:
table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'})
table1.agg({'Email': 'count', 'Purchaser': 'count', 'amount': np.sum, 'order_id': 'count'})
问题是 - 它只返回有顺序的行(第一行和第三行)而不是其他行(第二行)
Email Purchaser order_id amount
a@gmail.com a@gmail.com 1 5
c@gmail.com c@gmail.com 2 15
SQL
查询应如下所示:
SELECT count(Email) as num_ind, count(Purchaser) as num_purchasers, sum(order) as orders , sum(amount) as revenue
FROM
(SELECT Email, Purchaser, count(order_id) as order, sum(amount) as amount
FROM table 1
GROUP BY Email, Purchaser) x
如何在 Python
中复制它?
最佳答案
它现在没有在 pandas 中实现 - see .
所以一个糟糕的解决方案是将 NaN
替换为某个字符串,然后在 agg
替换回 NaN
之后:
table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy')
print table
Email Purchaser order_id amount
0 a@gmail.com a@gmail.com 1 5
1 b@gmail.com NaN NaN NaN
2 c@gmail.com c@gmail.com 2 10
3 c@gmail.com c@gmail.com 3 5
table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy')
print table
Email Purchaser order_id amount
0 a@gmail.com a@gmail.com 1 5
1 b@gmail.com dummy NaN NaN
2 c@gmail.com c@gmail.com 2 10
3 c@gmail.com c@gmail.com 3 5
table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'})
print table1
order_id amount
Email Purchaser
a@gmail.com a@gmail.com 1 5
b@gmail.com dummy 0 NaN
c@gmail.com c@gmail.com 2 15
table1 = table1.reset_index()
table1['Purchaser'] = table1['Purchaser'].replace('dummy', np.nan)
print table1
Email Purchaser order_id amount
0 a@gmail.com a@gmail.com 1 5
1 b@gmail.com NaN 0 NaN
2 c@gmail.com c@gmail.com 2 15
关于python - 如何使用分组依据并返回具有空值的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34489141/