我有域名和月份的组合以及相应月份的总订单数。我想用 0 值来估算缺失的组合。在 Pyspark 中可以使用哪些最便宜的聚合命令来实现此目的?
我有以下输入表:
domain month year total_orders
google.com 01 2017 20
yahoo.com 02 2017 30
google.com 03 2017 30
yahoo.com 03 2017 40
a.com 04 2017 50
a.com 05 2017 50
a.com 06 2017 50
预期输出:
domain month year total_orders
google.com 01 2017 20
yahoo.com 02 2017 30
google.com 03 2017 30
yahoo.com 03 2017 40
a.com 04 2017 50
a.com 05 2017 50
a.com 06 2017 50
google.com 02 2017 0
google.com 04 2017 0
yahoo.com 04 2017 0
google.com 05 2017 0
yahoo.com 05 2017 0
google.com 06 2017 0
yahoo.com 06 2017 0
这里预期的输出顺序并不重要。
最佳答案
最简单的方法是合并每个域的所有月份和年份:
select my.year, my.month, d.domain, coalesce(t.total_orders, 0) as total_orders
from (select distinct month, year from input) my cross join
(select distinct domain from input) d left join
t
on t.month = my.month and t.year = my.year and t.domain = d.domain;
注意:这假设每个年/月组合在数据中的某个位置至少出现一次。
获取某个范围内的值很痛苦,因为您已将日期拆分为多个列。让我假设年份都相同,如您的示例所示:
select my.year, my.month, d.domain, coalesce(t.total_orders, 0) as total_orders
from (select distinct month, year from input) my join
(select domain, min(month) as min_month, max(month) as max_month
from input
) d
on my.month >= d.min_month and my.month <= d.max_month left join
t
on t.month = my.month and t.year = my.year and t.domain = d.domain
关于mysql - 如何添加缺少数据组合的行并用 0 估算相应字段,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53126438/